#05 - Log-Structured Database Storage ✸ SingleStore Database Talk (CMU Intro to Database Systems) | YouTube Summarizer

Transcript

00:00

[Music] I'm still ass. [Music]

00:25

All right, give it up for uh DJ Cash. Uh, you looks well.

Thank you. Thank you.

Uh, why? Uh, I'm just coming back from court.

Um, trying to get back payments on my royalties.

00:40

Court. Yeah.

Did you win or you don't know yet? I'm not too sure.

I'll let you know next week. Okay.

Uh, well, there's there's a lot of announcements. Um, actually announcing for you as well.

Uh, you have a radio show starting this week. I do.

I do on Sunday. this Sunday from 1

00:56

to 2 p.m. I have a show starting from um WRCT.

You guys should all walk in. All right.

This Sunday, DJ Cash is on the radio. Uh are they paying you or No.

No. All right.

You got to work on that. Okay.

Um for you guys in the class, uh

01:14

homework 2 is going to go out today. Uh that'll be on the 21st.

Project one got released yesterday. Uh and that's going to be due on September 29th.

Has anybody looked at project one yet? Any questions about project one?

01:30

So, we'll announce the resitation that'll be next week. Uh, but all the files and everything should be should be uh on GitHub.

So, pull down latest version and make sure you're working off that that codebase. And then as I said uh last class uh next Monday and Tuesday on the 16th and 17th

01:46

we are having our uh industry affiliates visit day. So a bunch of database friends from various companies are coming to campus uh to give talks and meet with you guys.

So 15th will be research talks and poster sessions with students and other researchers at CMU. Um and then on the 16th in the morning

02:02

before like 12:30 there'll be about three sessions of different talks from the companies. You can talk to them about you know internships and full-time positions.

And like I said we'll send them your resumes uh if you fill out the spreadsheet that that I posted on Patza. Okay.

Yes. questions

02:17

and times question is is a list of research talks and times. Yes, that link uh in the slides that link will take you to the page.

We don't have every the finalized schedule is not a not ready yet, but that'll be posted uh probably tomorrow and then for the uh company sessions

02:33

once everyone fills out the form tomorrow, we'll do we'll we'll figure out what the actual optimal uh ordering will be. Okay, any other questions?

All right. So, this is very unusual.

Uh, but there was

02:50

very, very, very huge news today in databases like two hours ago. Anybody know what it is?

Like groundbreaking news, like the biggest news story in your lifetime. Nobody.

Larry Ellson, the founder of Oracle, is

03:07

now the officially the the richest person in the world. He dethroned uh uh Elon Musk.

What's that? Who?

I don't I don't know. He doesn't do

03:22

databases, right? This guy does, right?

All right. Larry Ellison is like the like the OG of database money, right?

He's again now he's the richest person in the world. It's all paid for by databases.

He owns a Hawaiian island paid for by databases and like a big one, right? Not the one that had all the

03:38

lepers, right? one you could actually see from like Maui.

Uh he bought his son a from his fourth marriage. He bought a son from his fourth marriage uh a movie studio, right?

He got married last year to new uh the woman she's 36, he's 80,

03:54

whatever. Like that's his thing, right?

This is a big deal. Like I know he's worked so hard for it.

Um and it really means a lot to him and I sent an email congratulating him. Um but this is it.

This is why we do databases. This right here, right?

So, I mean, I I would do a round of applause for him, but he's not

04:10

here, so that'd be kind of weird. But this is again, he he's been bumping back and forth between fifth and seventh, and then earlier in the year, he went to third to second.

So, this is this this is a big deal. And I mean, I'm I get emotional, but this this is this

04:27

means a lot. Okay.

All right. So, we're not there yet.

We're not at Larry's level. Um, but we'll we'll get there.

All right. So last class we spent time talking about the buffer manager, right?

We said that this was this component inside of every database system. If

04:43

you're not using MAPAP, you shouldn't be, right? If you're going to manage your own memory in your database system, it's the component that is responsible for uh storing pages database pages from disk into memory, handing off pointers out to other parts of the system, keeping track of like the reference

04:58

count or who's whether they're dirty or not, right? doing all the all the the the the the machinery that allows allow us to have the illusion that our database is actually can fit memory even though it may not.

Right? So today's class I want to quickly go over some uh

05:15

three optimizations we missed at the end of last class. Then we're going to jump back now to the disk level and refresh your memory about what tuple oriented storage looks like.

And then we're look at two alternative methods, index oriented storage and log structure storage. And then end of class today we'll finish up with a talk from the

05:31

single store guys. All right, which actually is not a dis started off not as a discordant database was an inmemory database uh where the primary storage location is the database was inmemory.

They switched that but um but it you know it's it's a pretty pretty

05:46

state-of-the-art system and the founder of single store was actually the founder of Neon uh as well who got bought by data bricks for a billion dollars last year. Uh so again Larry's not Larry Ellis is not the only one who makes a lot of money to do bases.

Okay. So all right so last class

06:03

again we were trying to quickly go over the buffer stuff at the end or the dis disculing stuff and I want to come back and talk about buffer optimizations. Again these are going to be examples of things your data system can do because it has full control of the memory and what's what's being read and written from disk.

Again these are things that the operating system is not going to be

06:19

able to do very easily because it doesn't know what queries are running and what the queries want to do. But we as the data system, we know, right?

Because because SQL is declarative. We know what the the query plan is going to look like.

Therefore, there's a bunch of tricks we can do to make things better.

06:34

So the first one is is maintain multiple buffer pools. So I talked about this buffer pool being like this this region of memory and every we're going to put our database pages in there.

It doesn't have to be logically organized as a single block of memory. We actually could break it up into to sort of sub subarrays and have a page table per per

06:52

subarray. and have different policies and algorithms used to optimize those different uh buffer pools.

Right? So you can have one per for per database, one for per for table, you can have one per like page type.

If it's going to be indexes versus versus table pages, you

07:07

split those up, right? And so the the benefit that we're going to get is that because now we can tailor the buffer pool's internal algorithms based on how we know the the rest of the system wants to use the data that's going to put in that buffer pool, we can make better decisions about what to evict and when

07:22

to evict. And we're actually going to reduce latch contention uh when we have multiple workers trying to access the same data structures at the same time because now things are are split across.

So this is something you'll see in the high-end systems, the enterprise systems like in DB2 you can go crazy like from IBM system like you can declare a buffer

07:39

pool per table, per index, per uh per blob type like you you do a lot of very sophisticated things. Um my SQL has a pretty simple one.

We we'll see that in a second. So there's basic two ways to do this.

You can just say, "All right, well, if I

07:55

have a so I know I got to do lookups on record IDs or object IDs, um, like the object could be like a table or or or index. Then when I do my my lookup to try to get that page that this object belongs to, I'm showing some arbitrary

08:11

command get record whatever." Like it's not really it's not really SQL obviously. Um, but you can pick apart this record ID, look at maybe the first portion of it, that's the object ID, and use that to decide which of these different uh buffer pools are going to have the data that you want.

And the

08:27

thing about the guarantee if you have multiple buffer pools is that any page that we want, any like physical page can only exist in only one one and only one buffer pool at a time, right? You don't want to have two copies because then if you do two writes, how do you keep those in sync, right?

We don't want to do that, right? So again with a simple

08:44

mapping we can say okay this is the this is what we want for this object ID and we know it's in bufferful one and go get it over there. Another simple trick and this is what my SQL does is that you can just take whatever the record ID is that it's trying to find like I want page one two three uh or record one 123 just hash it

09:01

and modify the number of buffer pools that you that you allocate in the beginning and then modify that and it just tells you where to go find it. Again, this has the same property guarantee that any page that we're looking for can only be in one of these buffer pools at a time, right?

I'm showing two buffer pools. You

09:17

could have hundreds, right? And then there's a disculer underneath to make sure that if you're trying to all these buffer pools instances are trying to write data, read and write data at the same time.

You don't want to do random uh have a bunch of random IO. The disculer can see all the requests and then sort things and

09:33

try to do it in a more efficient manner. Another trick we can do is prefetching.

So again, we haven't really talked about what query plans look like. Just assume that it's the it's the the the program that we know we're going to execute in order to run some some query, right?

So we know everything ahead of time what's

09:48

going to happen. So if we see a scan occurring on on data, right?

So query one's going to start. First thing it needs is page zero and then it starts reading uh page one and go read that.

But if you recognize that we had this scentual access pattern, then the

10:04

database system can say, "Okay, well, you just read page zero and you read page one. I know your query is also going to need page two and page three.

So, let me go ahead and prefetch that for you. Put that in the buffer pool while while you're crunching on page page one." Right.

Right. Run whatever

10:20

eviction policy that that we want to run. Put put those pages in there.

So then now when the query says, "All right, I'm done with page one. Now we go look at page page two." It doesn't stall and wait to go while you're going fetching from disk.

The data you need is already there, right? And you do that for for the rest

10:35

of pages all the way down. Now, MAPAP can do that.

Actually, the OS can do this, right? OS has a way to do you can pass hints to say I want to prefetch as as I'm reading a file.

Um, but it can only can prefetch sequentially, right? It can't do

10:51

anything more sophisticated because it doesn't know what's in these pages. Doesn't know how they're related to each other.

So, you can do tricks like this. So say I have a query wants to get uh data within a range and the data system decides okay well I have an index for this I can use that to find the data that I want right assuming some tree

11:08

dash data structure it's most likely going to be a B+ tree we'll cover that in a few weeks but it doesn't have to be right and along the leaf nodes the data is be sorted in that key order we want so now as my query starts running I read index page zero that's the route that's

11:23

going to tell me what I want to go left and right now I go down here I go to page one I fetch that in And that that just happened to be sequential in my database file. So yeah, you could potentially prefetch that.

But when I get down here to the bottom, I can see that I'm going to scan across the leaf nodes and I know I'm going to need page

11:38

three and page five and they're not sequential. So if the OS was doing this for us, it would blindly say, "Okay, well, you just read page two, page page one, page two.

Let me go get page three for you and three and four." But we actually want three and five. But because we know the

11:55

content of these pages, we can we can peek ahead and we'll talk about in two weeks few weeks how we do that. Say, oh, we don't actually want four.

We actually want five because we're scan log leaf. And we go ahead and prefetch those guys.

Right? So again, there's an example because the the data knows what you're

12:12

trying to do the queries, it's in a better position to make decisions about how to optimize things. All right, the last one I'll talk about is scan sharing.

Um, and the idea here is that if we have a query that's running and it's scanning some table and

12:28

then some other query comes along and wants to scan the same table, rather than just having that second query start from the beginning and just read all the same pages that the first guy just read, we actually can piggyback and ride along the the first query, see all the pages that it sees, and then

12:44

make a determination whether we want to quit or not or or come back around and find the data that we're looking for. These are sometimes called synchronized scans and again this is done at the lowest level of the system like as as there's some cursor reading pages from page directory or scanning along a table.

13:01

This is not the same as result caching. Result caching is like I take a query I run a query I produce some output.

If that same query shows up I just reuse the same output. This is actually done at the lowest level actually while we're running queries.

So this is actually rare few systems support this. again the high-end

13:18

ones like DB2 uh SQL server terror data and postgress do this postgress is the the exception Oracle again despite him now being the the richest person in the world doesn't have this feature they have something that almost looks like this uh but has this limitation where it

13:34

the queries have to be exactly the same in order to do to do this piggyback mechanism right so meaning this is from the documentation if you have a query like select star for employees if the next query comes along says select star for employees with capital E, right? Or there's an extra space in there, it

13:50

doesn't match it. Even though semantically they're the same, the way they're doing this matching is basically hashing the string and seeing whether it's it's identical, right?

But high level idea is still the same, right? So say we have this query select select the sum from from a it starts at

14:06

the beginning. It's going to scan through and read read all the pages that it needs.

And then say at at this point we get here, we we we need to read page three. We only have three frames in our buffer pool.

So we're going to throw out page page zero. And then while we're doing this now, another query shows up, but basically wants to do uh another

14:23

sequential scan on the entire table, but now comput a different aggregation. So the so the computation is different, but the data it needs to access is the same.

So we could do the stupid thing and just start this second query from the very beginning of of the table, but it's

14:39

going to read page zero. And that's that's actually the the the the last page we just evicted.

So we just go back and now fetch it back in in our buffer pool. But if we're smart about it, we can say, okay, well, these are both doing a sequential scan on A.

So I'll

14:54

just have Q2 tag along Q1. As the cursor reads the table and gets all these these pages, it can do whatever computation it wants.

Then at the bottom, Q1 says, I scan the entire table, it goes away. But Q2 says, "Well, I know I I started at

15:10

the beginning some the sort of halfway point of this table and there's a bunch of more pages I need to go back and read. So, let me start and scan through those all over again." Right?

Again, it's we we talked about this thing last class like we want to maximize the amount of computational

15:25

work we can do on on data pages when we bring them uh from disk into memory. So if I can run two queries or satisfy two queries uh operations uh for one single page fetch that's a huge win for us.

Yes.

15:47

Like here at this point hasn't seen page one or page two, right? Question is so question is at this point here query one has not seen page one or two one has.

Yes. Yes.

16:06

Yeah. So so his statement here is Could could we be more sophisticated and and have QG recognize I need page one and two that's already in memory by going peek in the buffer pool and then say okay let me read those and then piggyback yes in this in this PowerPoint example I'm not showing that but yes

16:22

that you could do that now who does that I don't think postest does I don't know what the enterprise guys are closed source so I don't know what they do yes question is how do you know what pages correspond to what what table page

16:38

directory tells Yeah, page directory is basically a catalog and says, "Okay, if I want to read table A, here's the directory or here's the pages the file that here's the directory with the files that have the pages that you want to read this." Yeah, we have to know that

16:54

the statement is uh this is a sequential scan. If there's an index scan, it might not do the same read same pages.

Yes, if there's a predicate there from a from a where something you you might not want to read everything if there's an index. Sure.

Keep it simple. Central scan,

17:12

right? All right.

All right. So again, that that's that's the again the three optimizations I wanted to cover in the buffer pool.

So let's jump back now uh and talk about uh the you disk storage again the the database storage on disk. So we said at

17:29

the beginning of the semester that we're going to focus on what I'll call a disk oriented architecture. Again, we we assume that the primary location of the database when it it is at rest is is on nonvontile storage like a disk.

And then we talked about a uh way to break up a database. It's the the entire database

17:46

in a bunch of these database pages and represent them in in uh as slotted pages where we could stick in tupils and then do lookups later later on. Right?

And so we discussed the page layout like so that page layout would be again what are actually in the the pages

18:03

so what's in the pages that we're storing in in our in our in our database files right and we talk again oriented storage that was the slotted page architecture where the the pages are really all about storing tupils and we have this indirection layer within uh

18:19

each page allows us to move data or tupils around with within a single page itself. So today I want to now look at two alternatives.

Again we'll refresh what a tuporned storage looks like. Um but then we'll we'll jump into the the the new stuff for today.

Right? And these again

18:36

these are going to be alternative ways to store data uh for tables in a database. So again slotted page I said this is the most common layout for uh for databases that that are row oriented.

Next class we'll discuss what that means in more

18:51

detail. Basically, we're going to take any tupil and assuming we can sit fit it to this page and we're going to sort from beginning to end inside that page.

And then inside that page, we have this header that tells us some metadata about what's in the page like a check sum, a version number, so forth. Uh then

19:07

there'll be a slot array at the beginning that uh and the all the tupal data at the bottom. And the slot array is going to have basically just record offsets inside the page that tells us where to jump to to find the beginning the beginning point for every single tupil.

And we said the slot array is

19:23

allowed to grow from the from the beginning to the end. And the the the tupil data is going to grow from the end to the beginning.

This is way Postgress does this. Uh this is way I think the textbook might describe this.

Other textbooks might might reverse it. In the

19:39

end it doesn't matter whether the slot array or the tupal data is on the top or the bottom the the high level ID is the same right and again and the slot array is providing us indirection that allows us to change the position of of tupil so then within the page itself right we

19:55

don't want to move it to another page um because that we may have to but that's not we're not focusing on that yet and this allows us to then be able to like do things like oh I'll delete this tupil here and then I can do compaction and slide this tupil over here to fill in the the space, right? And all I have to

20:11

do now is update the slot array to tell me where the offset is for that tupole I just moved. And I don't have to update anything else outside the system or outside this page itself.

Then we talked about record IDs and this was the way that we were going to

20:28

physically find a a tupil with within our database, right? And it's going to be some combination of like a file name or file ID uh a page ID and then a slot number within that page.

Right? So the the page directory thing

20:45

he was asking about before they were asking about before that'll store the uh the file ID uh for for some object like if I want table fu here's the file where to go find it and then if I if I want page one two three depending on the addressing

21:01

scheme the data system uses it could just be an offset within the file or could just be the file name itself might have the page number right all that's within the page directory yes or if you have like more rows The

21:16

question is what if you have more rows than the size of the ID? Um I mean it's a 64-bit number.

Well, hold up. In in Oracle, it's a 10 byt number, right?

That's pretty big. Well, so ingress is a conversational system

21:33

from like 1974. Postgress is Yeah.

So six bytes s is would you run out of six it's a pretty big that would be a pretty big table. Yeah, but in theory, yes, you you could

21:48

there's other limitations too, right? So like uh the is not just for one table, it's the whole database.

Now, this would be for depending on the scheme you're using. This could be the for the entire database.

It could be for a single table, right? I know I want table foo

22:06

and here's the page ID for it. Therefore, I know how to find that.

Right? There's other limitations we're not going to talk about.

Like in in Postgress, you can only have two to the 16 uh columns in a table. Oracle famously only lets you have a thousand because somebody hardcoded it, right?

22:22

used to be like 200 and then they they they spent a year trying to fix it to to go to a thousand, but that was 10 years ago. No one's going to change it again.

Like it's like stupid things like that. Um but yeah, you you you're very unlikely to run out of space for this.

Um there's other tricks you do if you

22:37

get around this. Like in case of Postgress, you can you can have a you can partition a table.

It's like make here's like here's a parent table and then within that I have like sub table. So logically it looks like a single table but each one has their own right there.

There's ways to get around it.

22:55

Say it again. Question is for a row store data system would it make sense to have a thousand columns?

People do it. Sure.

Why not? Right.

There's easily applications that have a thousand columns. Yeah.

23:11

Enterprise stuff is nasty. Like anything from like the SAP enterprise resource ERP thing, the thing they make a lot of money on.

That thing is insane, right? Because think of like an application running for like 30 years.

Every two years someone's got to add like 10 columns because something gets else added. Then before you know it, you're

23:27

at a thousand easily. To your point, I mean like it's it's a long tail, right?

There's gonna be most most tables going to have maybe I don't know 20 30 columns but there's easily stuff out there that that'll have that many.

23:44

The same is at that point you would have a single row across multiple pages. Uh depends on what the columns are.

There are multiple pages.

23:59

Sure. Like but the data system can handle that.

Yeah, for our discussion here, we we've simplified it and say like we assume it's all going to sit in a single page, but actually we talked about large large data record or large attributes. Now you just have a pointer to something else to another page and then at runtime the D

24:16

knows how to stitch it back together. All right, so in this tuple oriented storage reads are pretty simple.

All right. If I want to get an existing tupil, then I'm gonna get its record ID.

I'm

24:33

not saying how to get that just yet, but assume there's some index. There's something that's going to tell me for, you know, I want I want your email for your email address.

Here's the record ID that has your student record. There's something there.

It's the index that's going to tell us how to find that. But for our purposes right here, we don't care.

So again, how do we find this? We go look in the page directory.

Go get

24:50

the the find the location of the page we want. retrieve that that page we want from that location on disk if it's not already in memory and then now we use the record ID's offset uh that's embedded inside it to jump inside the page within the slot array to find the data we're looking for rights are reads

25:05

are pretty simple the in this case here we have to do the lookup in the index to find the data that we're looking for so we kind of do a lookup in this index that's going to

25:21

tell the for a given logical key like your email address when we when we look at the index it's going to come back and say here's the record ID then we do a lookup to go find the data that we want given that record ID within the within the page but what if we didn't have to do that what if we didn't have to do a

25:38

separate look lookup in the index and a separate lookup in the the table heap so this is called index organized storage and the basic idea is that we're going to use the index as the storage for the table itself So again what is what is index doing for

25:55

us? It's it's a mapping from a key to a value.

In this case here would be a logical key like your email address like the primary key or something and then the value is going to be the record ID if it was a tuple storage. With index or organized storage in the leaf nodes instead of storing a record ID we're

26:10

just going to store the tupil itself. So again I'm not telling you what data structure this is going to be but typically it's going to be a B+ tree could be a skip list.

Could be a try. It could be you know what whatever your favorite uh tree data structure you want.

So then now within again within the pages sorry the leaf nodes it's

26:27

going to look a lot like the slot array from the two oriented storage but in the beginning the the header we're going to have this key to offset array mapping that we have to keep in sorted order based on the key and these are just going to again be offsets within the

26:43

page that gives us the the tupil that we want. So now when I do that lookup based on your email address, I would scan down through the the inner nodes in the the data structure, land on a leaf node, then do a binary search inside this key offset array, and then that'll find the

27:00

the record that you're looking for. And I don't have to do that separate lookup in like the page directory to go find uh your your tupil for given a record ID.

Everything's right there in in the leaf node. So SQLite famously or SQLite and my SQL

27:15

with InnoDB famously give you this to you. Uh like that's basically how they organize it is how they organize the data in their system.

Um the with Oracle and SQL server by default you get the slot slot of page architecture that I said before. Uh but

27:31

when you create a table you can tell it I want to do index organized storage. All right.

So they'll support this as well. And just like again just like before the the the key offset grows one way and the tub grows another way.

Okay. Well,

27:48

so that solves the the problem we had in reads before where I wanted to do a uh I want to find a record or a tupil based on the record ID. I have to go look in the index and I then have to find the two the the

28:03

for given key look up the index and the index tells me the record ID and from the record ID I can go the the page that has the two data uh but it's still we still have to worry about rights. So to do an insert, it's pretty straightforward,

28:19

right? We just go check the page directory to see whether there's a free page for us where we has enough space where or if there's a there's a page that has enough free space for the the tupil we want to store.

Again, we know the size of the data we're storing because SQL is declarative. We know exactly what you're telling us to store.

28:36

So we can figure out at the moment you try to insert it, how much how much space we need. Then we go retrieve that page from disk if it's not already in memory and go check the slot array, define a free slot insert into it, right?

Ignoring the index organized

28:51

storage like rebalancing the tree. We'll we'll worry about that later.

So inserting is pretty straightforward and say we run out of pages uh there's no free space and the current pages we have, we just allocate a new page and then insert our record in there. No problem.

29:08

But now if we want to update a tupil, go through the same process before go uh find its uh find its page from its record ID. uh go see whether the page exists.

Then assuming we're going to overwrite

29:23

the the data, which is not always the case in most systems, but assuming we are, then I'm going to check to see whether the size of the t the new value of the tupil that I'm trying to insert or update. If it if it's larger in than the original value, do I have any space

29:40

in my my page? Now, because with the slotted array, I can reorganize things to try to free up space within the page.

Uh, and if I if I can do that and I can get in the same page, great. I'm done.

But if there's no space in the current

29:55

page that I have where the the tupil currently exists, then I basically have to treat that as a delete removing the old the original tupil followed by an insert with the new tupil. And that gets that's expensive, right?

30:12

So tuple oriented storage is really great for reads because again it's really easy to go find the thing that I'm looking for. For inserts not so bad either, right?

Because I I can just insert a page page to page over and over again. But for updates they can be actually quite expensive,

30:30

right? So part of the problems we're going to have is that there's going to be fragmentation, right?

because we're not we can't guarantee that we're going to be able to fully utilize every single page or there's that little space in the middle where the slot array goes one way and the tuple data goes one way and if I don't exactly line up perfectly uh which

30:47

you almost never do then you're going to have a little space in the middle that's just wasted then now anytime I want to go update a tupil I got to go fetch that page uh into memory in order just to update one tupil inside

31:02

of it right depending on the page size. If if my pages are really large, then I might be really reading 16 kilobytes to go update a five five byt value or eight byt value like something really small.

So there's a lot of wasted I/IO potentially depending on what my workload is, what my data looks like.

31:20

Their problem is going to be is that because now tupils could be stored across multiple pages. That means if I want to update a batch of tupils at the same time, I may have to go fetch all those individual pages, bring them into memory and then update them.

31:36

Right? If I had 10 tupils across 10 different pages, I have to bring in those 10 10 pages into memory before I can update anything.

And then lastly, this is not so much an issue with two. Storage, but it's a way to think about how to design systems is that well, what if we actually can't do

31:52

any in place updates in our in our nonvolatile storage, right? I can't bring a page into memory, jump to some offset within it, and then write out a new new values to that at that offset, right?

There's some storage systems that actually don't let you do this. The

32:08

Hadoop file system is is one, probably one of the earliest ones that did this. Uh Google has their own internal file system called Colossus.

Uh if you ever use Amazon S3, other other object stores, some of them provide these semantics. So you can't do in place updates.

You can only do appends. You can only create new data.

32:25

You can't overwrite existing ones. If you want to get rid of the old stuff, you got to actually it's a delete and then you do more appends.

So because of this the these issues uh there's a different way to store data

32:40

called log structured storage sometimes these are called log structure merge trees or LSMS and the idea here is that instead of storing tupils and pages and then updating them in place multi-verging we'll cover later that what I'm saying is not exactly true for

32:56

multivering but what we're talking about today is going to look like multi- versioning but not in the way that we'll talk about later for current control but the high level idea is still the same that instead of dialing it to do in place updates, the only thing we can do is append new records to data pages.

33:16

And the idea is that because now we're going to only do appends, of course, we have to handle deletes, remove things, then that makes some of the algorithms to go read and write data more efficient or less efficient, but it's going to make our updates be really really uh go

33:31

really fast. Inserts go really fast as well because you're just appending new records.

And then that's that thing we talked about before where we want to be able to minimize amount of disc random IO in exchange for more sequential IO because so now if we only do appends that's all sequential IO and that's be much faster

33:48

for us. So the LSMS are an old idea uh goes back to 1996.

Actually the guy that invented the LUK stuff I talked about last class also was the inventor of uh this LSMS. Uh the paper came out in 96.

There was

34:04

earlier work done on log structured file systems in the late 80s early 90s right and so this is a this is a successor to that in the file system board but applying to databases so there's essentially going to be two main data structures in a log structure storage system there'll be a mem table

34:21

we s inmemory data structure where we do all our writes uh and make all our changes and that one because it's in memory uh we can make in place updates and then as this mem table gets too big, then we're going to write it out to this immutable file called an SS table. I

34:39

think it's called stands for sorted string table or sometimes uh static string or sorted static table. Whatever it's the same idea, but it's going to be a sort of compact form of what the mem table is that's going to have the the

34:54

changes that are made to the database. So there's only two operations to to to modify a log structure system.

It's we put where you add add a new entry and delete. Right?

There's there's no overwrite, there's no uh there's no update, there's

35:11

no upserts, right? Puts basically the same thing as insert, but it may not be you may put would be physically putting a new record in, but logically it might be updating an existing one.

We'll see that mean looks like in a second. So again, now we bring again the the the demarcation line between disk and

35:28

memory. Again, we have our MEM table up above.

It's some tree data structure that's going to record changes that we're going to make to the database. I'm not saying what the data structure is.

Again, it could be a B+ tree. Skit lists are pretty common.

Uh could be a try. It doesn't matter.

35:44

So, say now my operation is I want to put a uh a new value in for key 101. I'm not defining how I'm determining what key 101 is.

You can think of the same way we did that uh like the row ID in the SQL lights like a sequential counter

35:59

right where the systems is maintaining for every new record or every new uh every new record you just add add a new key you increment a counter by one. So I'm going to put this uh put this put into my MIM table and I'm going to basically add a new entry for it.

Same

36:15

thing now I want to do a put for key 102 the new value B1. I'll put that in my me table as well.

And then now if I want to go back and modify key 101 again now with a value A2 again because the mem table is in memory

36:32

by its name I can do an in place update inside of that. I actually don't need to record the history of of the the previous changes that were made.

There's a separate log file. We're not going about in today's class, but basically because this is in memory, if you crash, you you could lose data.

But basically,

36:48

we are appending the the puts to a separate file and that'll get flushed before we say the the the transaction is committed. But for now, we could ignore that.

Just be aware that we have a way to keep keep the mem table durable.

37:03

Same thing if I do another put on key 103, right? I'll update my mem table and so forth.

So at some point the mem cable is going to get full right we're going to run out of space like you think you know think couple hundred megabytes or something like that. So at which point we're going to

37:19

take the scan on the leaf nodes of the mem table right that's because that's all the entries of of the actual values for the different keys that we have and we're going to store that now into a different form of the SS table. Again, think of this is like almost like the right ahead log in the logs.

Like here's

37:35

all the the the changes that were made uh that this mem table recorded. So again, if I made multiple multiple changes to a key, like I changed key 101 multiple times because I'm doing in place updates on the mem table when I write out the SS table, I only see whatever the latest version of that was.

37:53

So at this point once it's in my SS table has been uh uh initialized with the data the mem table I'm storing the the the the data within the the the lowest key to the highest key. So this is in a sorted order based on whatever the key that we want.

I'm going to go

38:09

ahead and write this out to disk to nonvolatile storage. Do a flush.

Make sure it's there. And then I'll just pop populate a new mem table.

Just keep doing this over over and over again. making new uh making new SS tables and these are now

38:25

going to be basically sorted now from the in time stamp order from newest to oldest right within one file it's it's sorted from key low to high with across multiple files I'm going to sort them keep track of them in uh in newest to

38:40

oldest order like you put the file name have a time stamp right and the time stamp is always incrementing as things go forward so I'm going to show one type of compaction uh that's most common one, but I'll show another one and I'll go more into detail what this compaction scheme looks like. Um, but at some point

38:56

this this uh this level is going to get full. So then I actually want to combine SS tables that are in disk.

This the first level zero and I'll combine them into uh a new SS table and then I can blow away the old ones. The idea is that you keep going as you keep going down

39:12

these things get bigger and bigger. Uh, right.

And the idea is that I I'm I'm throwing away things that I don't need anymore and I always try to have like whatever the latest version is at each level for a given key. Yes.

I don't understand why we want multiple

39:27

levels like why not always merge. Question is why why do we want multiple levels?

Why why not just merge levels there within itself? Give me two slides.

We'll come to that. But so the one would be better for reads, one would be better for writes.

39:44

I'm not showing key ranges here and that that's why it's it's maybe confusing. Give me a few few slides.

Yes. These are I assume these are operations.

[Music]

40:01

Yes. And two slides, right?

So his question is and he's correct like is the is this compaction is are we throwing out um are we throwing out operations for that we don't need anymore? Yes.

So if I do a bunch of puts and then on a key then I

40:17

delete that key I don't need to keep the puts. I just need to know that I deleted it.

Right? So you basically end up with sort of this tier structure like this where you have big longer SS tables at the bottom uh and progressively smaller ones as you go up.

40:33

All right. So that's how we do puts and deletes.

We need to handle reads. Now, so if I want to do a read, the key 101, the first thing I'm going to do is go check my mem table to see if if I have any

40:49

entries inside that because the mem table is going to have the latest version or latest changes that were made to uh to a key. So I go check my mem table.

If it's in there, great. Take whatever the value is and and and return that back.

I'm also not saying what the value is. I'm kind of being vague here, but it's typically going to be the tupil

41:05

itself, right? Just in sort of serialized form.

But if the key that I want is not in the mem table, then I I now got to go check all these different SS tables, different levels, and that's going to suck now because I

41:20

got to go do a bunch of reads to go find some, you know, basically a needle in a haststack, right? So the way to deal with this is that you basically maintain what's called a summary table in memory that provides some additional metadata about what's

41:36

actually in these files. Right?

So I could say uh for a given key 101 I could check at does this key even exist at the different levels. So I start at level zero because again the levels are going um depending on my package scheme could be going forward to time.

could be also

41:52

going across keys, but like I want to see what level has my key. And then if I know that a uh the key exists maybe at this bottom level here, but also exists in a higher level there, I always want to choose the higher one because I know

42:07

I know that one's more recent. So again, we're we're we're making this trade-off to make our writes go faster in exchange for slower reads, but we can add some additional data structures to make sure the reads aren't aren't as bad.

42:25

Wait, say again? The question is the key filter a bloom filter or what was question question like what is the filter?

Um it could be a range filter to say like

42:41

within this range this key doesn't exist or not because that would handle range scans could be like a single key does this key exist in this this this level or not that your bloom filter lookup. So that could give you a we'll talk about bloom filters in a few weeks like that could give you a false positive.

It might tell you the key exist level then

42:57

you got to actually look and check to see whether it's there or not. Yes.

So the way I imagine it like let's say I say I want to update

43:12

this field for ID then I only have the field I'm changing but there are many other fields the statement is again this why I was trying to be vague about what the value is the statement is is the

43:31

if I update If I update a record, uh, if it's a blind write, I don't need to look to see what actually was. I just I just put a put and I'm done.

If it's like, uh, update Andy's salary or Andy's age where age equals current age plus

43:47

one, I got to go read whatever the virtual value was, then I got to do an update. And then what do I actually putting in the mem table?

In the case of Rox DB, I think it's or most systems, it's going to be the full tuple all over again. But my point is like, let's It's just right.

I don't care what it was before.

44:03

I'm saying like in this page a blind right. Yes.

So if I do that and later on I want to read your record. It's not enough to like find your age.

I need to find everything like your record may appear in multiple SS table. Yes.

So his statement is you're

44:20

basically asking is is there a given record that has multiple attributes is that those attributes broken across multiple SS tables? No.

you would always have the full tupil with all its attributes in the record that you're storing as part of the value in the mem table or an S an SS table. So like there's this sense like

44:36

internally you're going to a blind rate would be like an insert but to your point like you um yeah so his point is like you'd have to do a read modify right you go fetch the tupil then update depends on like what the query is. If it's an update where you're you're up if it's update where you're

44:51

like you're changing all the attributes, then yeah, you don't need to see what the original ones were. You just do an update in place.

Yeah. Or not update in place.

You can do an update without actually reading it. Other questions?

45:08

All right. So again, what are we doing?

It's it's a basic key value storage uh architecture where again the value is going to represent the the the contents of the tupil. Um the the if we do a delete we don't actually need to go find the data that we want and actually

45:24

delete it which is kind of nice again in a tuporiented architecture on a slot of pages I got to go fetch the page then mark go you know mark it as deleted in this case here if I have the key of the thing I want to delete I just put a delete key record in my my log in the me

45:42

table and then I'm done with it then anybody else that comes along wants to read that tupil We got to, you know, have to make sure that it sees that delete record notifying it that the record no longer logically exists. Even though physically might be in still a bunch of SS tables, we know we need we should ignore it if we try to do a

45:58

lookup. Yeah.

Sorry. Yes.

Yeah. So question the question is when do merges

46:13

get merged? Next slide.

Okay, so at some point I'm making all these SS tables running out the disk. Uh, and the, you know, if I delete a tupil, if I update a tuple like a thousand times,

46:29

then then I delete it, I'm going to have a thousand1 records corresponding to that one tupil, but I only really need the latest one. So the there'll be this background process that gets triggered based on the size of the SS tables, based on the number of overlapping r key

46:44

ranges. We'll see that in in next second next slide.

Like there's a bunch of ways bunch of mechanism to trigger when compaction should kick in. And again, there's this trade-off between uh using doing background maintenance and actually just running queries.

Remember we talked about the background writer or

47:00

the page cleaner and the buffer pool that we could have a thread run through our buffer pool manager, find our dirty pages and start proactively writing them out so that they get marked clean and they can get evicted. But if I'm spending all my time doing that, then I'm I'm going to slow down queries actually trying to do real work.

So I

47:16

could spend all my time doing compaction to clean up all these SS tables, but that's going to make query execution run slower. So there really isn't a good way to to there's there's not a sort of one-sizefits-all solution to determine when you should trigger a compaction.

47:31

But things like Roxb have a bunch of different policies. All right.

So say in the most simple form, say I have two SS tables. I I want to merge them.

And in this case here, I I'm assuming that the order of the two SS tables from newest to oldest. So the one on all the

47:47

way over here is the late the the newest one. The one after that is older than that.

So the way I'm going to do basically merge is the same way you're going to do sort merge. If you know that algorithm, we'll talk about that again in two weeks.

There's a join algorithm on sort merge as well, right? You

48:02

basically just have two cursors that start at the beginning of both these files. Again, I'm showing two SS tables.

You could have multiple ones. You typically you want to merge multiple ones at the same time, right?

But for our our illustration, it's just two. So now what's going to happen is I'm going to look at whatever the key is being

48:17

referenced at the uh wherever my cursor is pointing at. In this case here, the the the first person the first cursor is looking at a delete on key 100.

This guy has a put on key 101. says 100 is less than 101.

I know that there isn't going

48:33

to be a reference to key 100 in in this SS table. Uh so therefore I can just take whatever that that record is and put it in my new SS table.

And then I I move that cursor down, but I keep the previous the other cursor at the same location that that is that that it was

48:49

pointing at before. So now at this point here we have a uh a put on key 101 in the first the in the first one and then a put on key 101 as well in the second one.

But since this

49:04

SS table is newer than this one, we know we don't care about whatever that put was. We only care about the latest one.

So we go ahead and remove or ignore that that put and put the first SS tables into uh the new one. Then same thing the both of the cursors move down.

We have a

49:20

put put on key 102. Another put on key 102 in both of them.

I don't care about the second one. I only put the first one.

Keep moving down. Now I have a put in key 103.

Uh but I also have a delete on key 103. Same thing.

They're operating on the same. They're doing some modification to the same key, but I

49:36

only care about the latest one. So I go ahead and ignore the second one, put in the first one.

The first cursor is done. So now I keep scanning along and whatever comes after that I know I want to copy into my new SS table, right?

So I would I I'd put the put for 104 there.

49:55

That's basically what compaction is doing. Now the the tricky thing is going to be how do I organize multiple SS tables and what am I what's the how I'm deciding what SS tables to merge and we said the when it gets triggered based on a bunch of different policies.

50:12

So what I showed before was an example called level compaction where I'm going to maintain different levels of of SS tables of different sizes. And the goal here is actually to have non-over overlapping key ranges within our our levels except for level zero because

50:29

that's the special one where as me tables get full we SS tables. We we just append them to to disk.

So in that case they will have overlapping key ranges but everything below that will not. So again so assuming that the the SS tables are going to go from newest to

50:45

oldest. Our first SS table we're get written out is going to have key range from A to R, right?

Arbitrary letters. The next one is going to have key ranges from E to T.

And then the third one here will have key ranges from B to Q. So some point I'm going to run out of

51:00

space or somebody gets triggered says I want to now do compaction on level zero. So, I'm now going to make a new level, level one.

And I'm going to take these three SS tables, and I'm combine them in two two larger SS tables. And I'll do

51:16

the the the compaction I just showed in the previous slide where I'll have a cursor go through each one and remove whatever the the oldest entry is for a given key, and it'll only keep the latest one. So, but then in my my new asset tables I'm

51:32

going to generate now I have non-over overlapping keys within this level. So, if I'm looking for key key uh key X for example, if I have to look at level zero, I have to potentially look at all these all the SS tables uh to find key

51:49

X. Well, actually I would wouldn't be in any of them, but like say I'm looking for key key Q.

I would actually have to look at uh all three again all three SS tables to find that given key because each of them have uh that key could fit in each of their ranges. But now when I

52:04

do the compaction down to the second level or level one since now they're non-over overlapping key ranges. If I need key Q, I know I only have to go look in the second SS table.

It may not exist. And again, I can use a filter to tell me whether it's going to exist or not.

But if if I if it could exist then

52:22

I know so I only need to look at one SS table. Yes.

Can you repeat why do we have multiple levels multiple tables? Why did you get question why do we end up with multiple SS tables in level two?

It could be just

52:38

one giant one. Basically the size of the SS table is is growing.

Sorry size SS table grows per a level. And so the the cumulative size of the SS tables up above that I'm merging down may be larger than what I can fit into a single

52:54

SS table. So I drew two, but it could just be one.

I just want to show too that that you would have these non-over overlapping ranges. So once this is done again, I can blow away the all the tables at at level zero

53:09

and keep doing the same thing just keep adding more and more um right doing depends. Now I I again I now I have different ranges and at some point I want to do a merge and again these are going to be overlapping ranges in level zero but when I put when I create SS tables in

53:25

level one uh they have to be non-over overlapping. So I may actually end up recreating all the files again to read them all in to memory using my buffer manager and then write them all back out.

uh because the keys that are in the level zero might be

53:42

in any of the files might have to go in any of the files. I may have to generate a bunch more files.

So if I have like 10 gigabytes of data on level one, I run compaction. I'm gonna have to read 10 gigabytes in and I might be writing out another 10 gigabytes.

54:00

Yes. Number question.

The question is is the number levels fixed? Uh, no.

I I they can go forever. In practice though, like I I forget the default is in Rox DB, you just end up with like really big files at the bottom.

54:22

Uh, question is is for any set of what? Sorry.

for like

54:38

uh the question is uh for for would you have would you have a setting for a data system that say I'm I'm gonna have 10 levels of in my compaction scheme and then you always have 10 levels. No, the the very beginning I could have I could set this

54:54

thing to have 10 levels, but right here at this point you only have two it grows. Yeah.

So never exceed that. You just end up with really large SS tables.

A lot of them at the lowest level. So why is this good for read?

Question. Why is this read heavy uh why

55:11

is this um better for read heavy workloads? Because when we see universal pure next slide, if I want to go find a key at a lower level, I only have to look at the the SS tables that that where my key fits in that range.

Can if you go back

55:27

just to when we only had level level zero and this is fresh from fresh from memory into disk. These are overlapping ranges.

So if I want key Q, it can exist in the first one, the second one or the third one. So I have to check all of them.

Now, in this case here, like it would be

55:43

since they're ordered based on time. If I see it in the first one, I know I don't care about the other two.

But in the worst case, I have to scan all three. Yep.

#05 - Log-Structured Database Storage ✸ SingleStore Database Talk (CMU Intro to Database Systems)

Summary

Transcript