#05 - Log-Structured Database Storage ✸ SingleStore Database Talk (CMU Intro to Database Systems)

🚀 Add to Chrome – It’s Free - YouTube Summarizer

Category: Database Systems

Tags: buffercompactiondatabasesindexstorage

Entities: ClickhouseDJ CashLarry EllisonOraclePostgresRedshiftSingle StoreSnowflakeVerticaWRCT

Building WordCloud ...

Summary

    Announcements and Events
    • DJ Cash will be hosting a radio show on WRCT this Sunday from 1 to 2 p.m.
    • Homework 2 will be released today, and Project 1 is due on September 29th.
    • Industry affiliates visit day will occur on the 16th and 17th, with research talks and company sessions.
    • Larry Ellison, founder of Oracle, is now the richest person in the world.
    Database Systems
    • Buffer manager is responsible for managing database pages in memory.
    • Optimizations include maintaining multiple buffer pools, prefetching, and scan sharing.
    • Index organized storage uses the index as the storage for the table itself.
    • Log structured storage focuses on appending new records, improving write efficiency.
    Single Store Database
    • Single Store is a hybrid database supporting both OLTP and OLAP workloads.
    • It uses a column store with optimizations for transactional workloads.
    • Single Store can handle both analytical and transactional benchmarks effectively.
    Actionable Takeaways
    • Consider using multiple buffer pools for better memory management.
    • Prefetch data to reduce query execution time.
    • Use index organized storage to eliminate separate index lookups.
    • Apply log structured storage for efficient writes and updates.
    • Explore databases like Single Store for hybrid workload support.

    Transcript

    00:00

    [Music] I'm still ass. [Music]

    00:25

    All right, give it up for uh DJ Cash. Uh, you looks well.

    Thank you. Thank you.

    Uh, why? Uh, I'm just coming back from court.

    Um, trying to get back payments on my royalties.

    00:40

    Court. Yeah.

    Did you win or you don't know yet? I'm not too sure.

    I'll let you know next week. Okay.

    Uh, well, there's there's a lot of announcements. Um, actually announcing for you as well.

    Uh, you have a radio show starting this week. I do.

    I do on Sunday. this Sunday from 1

    00:56

    to 2 p.m. I have a show starting from um WRCT.

    You guys should all walk in. All right.

    This Sunday, DJ Cash is on the radio. Uh are they paying you or No.

    No. All right.

    You got to work on that. Okay.

    Um for you guys in the class, uh

    01:14

    homework 2 is going to go out today. Uh that'll be on the 21st.

    Project one got released yesterday. Uh and that's going to be due on September 29th.

    Has anybody looked at project one yet? Any questions about project one?

    01:30

    So, we'll announce the resitation that'll be next week. Uh, but all the files and everything should be should be uh on GitHub.

    So, pull down latest version and make sure you're working off that that codebase. And then as I said uh last class uh next Monday and Tuesday on the 16th and 17th

    01:46

    we are having our uh industry affiliates visit day. So a bunch of database friends from various companies are coming to campus uh to give talks and meet with you guys.

    So 15th will be research talks and poster sessions with students and other researchers at CMU. Um and then on the 16th in the morning

    02:02

    before like 12:30 there'll be about three sessions of different talks from the companies. You can talk to them about you know internships and full-time positions.

    And like I said we'll send them your resumes uh if you fill out the spreadsheet that that I posted on Patza. Okay.

    Yes. questions

    02:17

    and times question is is a list of research talks and times. Yes, that link uh in the slides that link will take you to the page.

    We don't have every the finalized schedule is not a not ready yet, but that'll be posted uh probably tomorrow and then for the uh company sessions

    02:33

    once everyone fills out the form tomorrow, we'll do we'll we'll figure out what the actual optimal uh ordering will be. Okay, any other questions?

    All right. So, this is very unusual.

    Uh, but there was

    02:50

    very, very, very huge news today in databases like two hours ago. Anybody know what it is?

    Like groundbreaking news, like the biggest news story in your lifetime. Nobody.

    Larry Ellson, the founder of Oracle, is

    03:07

    now the officially the the richest person in the world. He dethroned uh uh Elon Musk.

    What's that? Who?

    I don't I don't know. He doesn't do

    03:22

    databases, right? This guy does, right?

    All right. Larry Ellison is like the like the OG of database money, right?

    He's again now he's the richest person in the world. It's all paid for by databases.

    He owns a Hawaiian island paid for by databases and like a big one, right? Not the one that had all the

    03:38

    lepers, right? one you could actually see from like Maui.

    Uh he bought his son a from his fourth marriage. He bought a son from his fourth marriage uh a movie studio, right?

    He got married last year to new uh the woman she's 36, he's 80,

    03:54

    whatever. Like that's his thing, right?

    This is a big deal. Like I know he's worked so hard for it.

    Um and it really means a lot to him and I sent an email congratulating him. Um but this is it.

    This is why we do databases. This right here, right?

    So, I mean, I I would do a round of applause for him, but he's not

    04:10

    here, so that'd be kind of weird. But this is again, he he's been bumping back and forth between fifth and seventh, and then earlier in the year, he went to third to second.

    So, this is this this is a big deal. And I mean, I'm I get emotional, but this this is this

    04:27

    means a lot. Okay.

    All right. So, we're not there yet.

    We're not at Larry's level. Um, but we'll we'll get there.

    All right. So last class we spent time talking about the buffer manager, right?

    We said that this was this component inside of every database system. If

    04:43

    you're not using MAPAP, you shouldn't be, right? If you're going to manage your own memory in your database system, it's the component that is responsible for uh storing pages database pages from disk into memory, handing off pointers out to other parts of the system, keeping track of like the reference

    04:58

    count or who's whether they're dirty or not, right? doing all the all the the the the the machinery that allows allow us to have the illusion that our database is actually can fit memory even though it may not.

    Right? So today's class I want to quickly go over some uh

    05:15

    three optimizations we missed at the end of last class. Then we're going to jump back now to the disk level and refresh your memory about what tuple oriented storage looks like.

    And then we're look at two alternative methods, index oriented storage and log structure storage. And then end of class today we'll finish up with a talk from the

    05:31

    single store guys. All right, which actually is not a dis started off not as a discordant database was an inmemory database uh where the primary storage location is the database was inmemory.

    They switched that but um but it you know it's it's a pretty pretty

    05:46

    state-of-the-art system and the founder of single store was actually the founder of Neon uh as well who got bought by data bricks for a billion dollars last year. Uh so again Larry's not Larry Ellis is not the only one who makes a lot of money to do bases.

    Okay. So all right so last class

    06:03

    again we were trying to quickly go over the buffer stuff at the end or the dis disculing stuff and I want to come back and talk about buffer optimizations. Again these are going to be examples of things your data system can do because it has full control of the memory and what's what's being read and written from disk.

    Again these are things that the operating system is not going to be

    06:19

    able to do very easily because it doesn't know what queries are running and what the queries want to do. But we as the data system, we know, right?

    Because because SQL is declarative. We know what the the query plan is going to look like.

    Therefore, there's a bunch of tricks we can do to make things better.

    06:34

    So the first one is is maintain multiple buffer pools. So I talked about this buffer pool being like this this region of memory and every we're going to put our database pages in there.

    It doesn't have to be logically organized as a single block of memory. We actually could break it up into to sort of sub subarrays and have a page table per per

    06:52

    subarray. and have different policies and algorithms used to optimize those different uh buffer pools.

    Right? So you can have one per for per database, one for per for table, you can have one per like page type.

    If it's going to be indexes versus versus table pages, you

    07:07

    split those up, right? And so the the benefit that we're going to get is that because now we can tailor the buffer pool's internal algorithms based on how we know the the rest of the system wants to use the data that's going to put in that buffer pool, we can make better decisions about what to evict and when

    07:22

    to evict. And we're actually going to reduce latch contention uh when we have multiple workers trying to access the same data structures at the same time because now things are are split across.

    So this is something you'll see in the high-end systems, the enterprise systems like in DB2 you can go crazy like from IBM system like you can declare a buffer

    07:39

    pool per table, per index, per uh per blob type like you you do a lot of very sophisticated things. Um my SQL has a pretty simple one.

    We we'll see that in a second. So there's basic two ways to do this.

    You can just say, "All right, well, if I

    07:55

    have a so I know I got to do lookups on record IDs or object IDs, um, like the object could be like a table or or or index. Then when I do my my lookup to try to get that page that this object belongs to, I'm showing some arbitrary

    08:11

    command get record whatever." Like it's not really it's not really SQL obviously. Um, but you can pick apart this record ID, look at maybe the first portion of it, that's the object ID, and use that to decide which of these different uh buffer pools are going to have the data that you want.

    And the

    08:27

    thing about the guarantee if you have multiple buffer pools is that any page that we want, any like physical page can only exist in only one one and only one buffer pool at a time, right? You don't want to have two copies because then if you do two writes, how do you keep those in sync, right?

    We don't want to do that, right? So again with a simple

    08:44

    mapping we can say okay this is the this is what we want for this object ID and we know it's in bufferful one and go get it over there. Another simple trick and this is what my SQL does is that you can just take whatever the record ID is that it's trying to find like I want page one two three uh or record one 123 just hash it

    09:01

    and modify the number of buffer pools that you that you allocate in the beginning and then modify that and it just tells you where to go find it. Again, this has the same property guarantee that any page that we're looking for can only be in one of these buffer pools at a time, right?

    I'm showing two buffer pools. You

    09:17

    could have hundreds, right? And then there's a disculer underneath to make sure that if you're trying to all these buffer pools instances are trying to write data, read and write data at the same time.

    You don't want to do random uh have a bunch of random IO. The disculer can see all the requests and then sort things and

    09:33

    try to do it in a more efficient manner. Another trick we can do is prefetching.

    So again, we haven't really talked about what query plans look like. Just assume that it's the it's the the the program that we know we're going to execute in order to run some some query, right?

    So we know everything ahead of time what's

    09:48

    going to happen. So if we see a scan occurring on on data, right?

    So query one's going to start. First thing it needs is page zero and then it starts reading uh page one and go read that.

    But if you recognize that we had this scentual access pattern, then the

    10:04

    database system can say, "Okay, well, you just read page zero and you read page one. I know your query is also going to need page two and page three.

    So, let me go ahead and prefetch that for you. Put that in the buffer pool while while you're crunching on page page one." Right.

    Right. Run whatever

    10:20

    eviction policy that that we want to run. Put put those pages in there.

    So then now when the query says, "All right, I'm done with page one. Now we go look at page page two." It doesn't stall and wait to go while you're going fetching from disk.

    The data you need is already there, right? And you do that for for the rest

    10:35

    of pages all the way down. Now, MAPAP can do that.

    Actually, the OS can do this, right? OS has a way to do you can pass hints to say I want to prefetch as as I'm reading a file.

    Um, but it can only can prefetch sequentially, right? It can't do

    10:51

    anything more sophisticated because it doesn't know what's in these pages. Doesn't know how they're related to each other.

    So, you can do tricks like this. So say I have a query wants to get uh data within a range and the data system decides okay well I have an index for this I can use that to find the data that I want right assuming some tree

    11:08

    dash data structure it's most likely going to be a B+ tree we'll cover that in a few weeks but it doesn't have to be right and along the leaf nodes the data is be sorted in that key order we want so now as my query starts running I read index page zero that's the route that's

    11:23

    going to tell me what I want to go left and right now I go down here I go to page one I fetch that in And that that just happened to be sequential in my database file. So yeah, you could potentially prefetch that.

    But when I get down here to the bottom, I can see that I'm going to scan across the leaf nodes and I know I'm going to need page

    11:38

    three and page five and they're not sequential. So if the OS was doing this for us, it would blindly say, "Okay, well, you just read page two, page page one, page two.

    Let me go get page three for you and three and four." But we actually want three and five. But because we know the

    11:55

    content of these pages, we can we can peek ahead and we'll talk about in two weeks few weeks how we do that. Say, oh, we don't actually want four.

    We actually want five because we're scan log leaf. And we go ahead and prefetch those guys.

    Right? So again, there's an example because the the data knows what you're

    12:12

    trying to do the queries, it's in a better position to make decisions about how to optimize things. All right, the last one I'll talk about is scan sharing.

    Um, and the idea here is that if we have a query that's running and it's scanning some table and

    12:28

    then some other query comes along and wants to scan the same table, rather than just having that second query start from the beginning and just read all the same pages that the first guy just read, we actually can piggyback and ride along the the first query, see all the pages that it sees, and then

    12:44

    make a determination whether we want to quit or not or or come back around and find the data that we're looking for. These are sometimes called synchronized scans and again this is done at the lowest level of the system like as as there's some cursor reading pages from page directory or scanning along a table.

    13:01

    This is not the same as result caching. Result caching is like I take a query I run a query I produce some output.

    If that same query shows up I just reuse the same output. This is actually done at the lowest level actually while we're running queries.

    So this is actually rare few systems support this. again the high-end

    13:18

    ones like DB2 uh SQL server terror data and postgress do this postgress is the the exception Oracle again despite him now being the the richest person in the world doesn't have this feature they have something that almost looks like this uh but has this limitation where it

    13:34

    the queries have to be exactly the same in order to do to do this piggyback mechanism right so meaning this is from the documentation if you have a query like select star for employees if the next query comes along says select star for employees with capital E, right? Or there's an extra space in there, it

    13:50

    doesn't match it. Even though semantically they're the same, the way they're doing this matching is basically hashing the string and seeing whether it's it's identical, right?

    But high level idea is still the same, right? So say we have this query select select the sum from from a it starts at

    14:06

    the beginning. It's going to scan through and read read all the pages that it needs.

    And then say at at this point we get here, we we we need to read page three. We only have three frames in our buffer pool.

    So we're going to throw out page page zero. And then while we're doing this now, another query shows up, but basically wants to do uh another

    14:23

    sequential scan on the entire table, but now comput a different aggregation. So the so the computation is different, but the data it needs to access is the same.

    So we could do the stupid thing and just start this second query from the very beginning of of the table, but it's

    14:39

    going to read page zero. And that's that's actually the the the the last page we just evicted.

    So we just go back and now fetch it back in in our buffer pool. But if we're smart about it, we can say, okay, well, these are both doing a sequential scan on A.

    So I'll

    14:54

    just have Q2 tag along Q1. As the cursor reads the table and gets all these these pages, it can do whatever computation it wants.

    Then at the bottom, Q1 says, I scan the entire table, it goes away. But Q2 says, "Well, I know I I started at

    15:10

    the beginning some the sort of halfway point of this table and there's a bunch of more pages I need to go back and read. So, let me start and scan through those all over again." Right?

    Again, it's we we talked about this thing last class like we want to maximize the amount of computational

    15:25

    work we can do on on data pages when we bring them uh from disk into memory. So if I can run two queries or satisfy two queries uh operations uh for one single page fetch that's a huge win for us.

    Yes.

    15:47

    Like here at this point hasn't seen page one or page two, right? Question is so question is at this point here query one has not seen page one or two one has.

    Yes. Yes.

    16:06

    Yeah. So so his statement here is Could could we be more sophisticated and and have QG recognize I need page one and two that's already in memory by going peek in the buffer pool and then say okay let me read those and then piggyback yes in this in this PowerPoint example I'm not showing that but yes

    16:22

    that you could do that now who does that I don't think postest does I don't know what the enterprise guys are closed source so I don't know what they do yes question is how do you know what pages correspond to what what table page

    16:38

    directory tells Yeah, page directory is basically a catalog and says, "Okay, if I want to read table A, here's the directory or here's the pages the file that here's the directory with the files that have the pages that you want to read this." Yeah, we have to know that

    16:54

    the statement is uh this is a sequential scan. If there's an index scan, it might not do the same read same pages.

    Yes, if there's a predicate there from a from a where something you you might not want to read everything if there's an index. Sure.

    Keep it simple. Central scan,

    17:12

    right? All right.

    All right. So again, that that's that's the again the three optimizations I wanted to cover in the buffer pool.

    So let's jump back now uh and talk about uh the you disk storage again the the database storage on disk. So we said at

    17:29

    the beginning of the semester that we're going to focus on what I'll call a disk oriented architecture. Again, we we assume that the primary location of the database when it it is at rest is is on nonvontile storage like a disk.

    And then we talked about a uh way to break up a database. It's the the entire database

    17:46

    in a bunch of these database pages and represent them in in uh as slotted pages where we could stick in tupils and then do lookups later later on. Right?

    And so we discussed the page layout like so that page layout would be again what are actually in the the pages

    18:03

    so what's in the pages that we're storing in in our in our in our database files right and we talk again oriented storage that was the slotted page architecture where the the pages are really all about storing tupils and we have this indirection layer within uh

    18:19

    each page allows us to move data or tupils around with within a single page itself. So today I want to now look at two alternatives.

    Again we'll refresh what a tuporned storage looks like. Um but then we'll we'll jump into the the the new stuff for today.

    Right? And these again

    18:36

    these are going to be alternative ways to store data uh for tables in a database. So again slotted page I said this is the most common layout for uh for databases that that are row oriented.

    Next class we'll discuss what that means in more

    18:51

    detail. Basically, we're going to take any tupil and assuming we can sit fit it to this page and we're going to sort from beginning to end inside that page.

    And then inside that page, we have this header that tells us some metadata about what's in the page like a check sum, a version number, so forth. Uh then

    19:07

    there'll be a slot array at the beginning that uh and the all the tupal data at the bottom. And the slot array is going to have basically just record offsets inside the page that tells us where to jump to to find the beginning the beginning point for every single tupil.

    And we said the slot array is

    19:23

    allowed to grow from the from the beginning to the end. And the the the tupil data is going to grow from the end to the beginning.

    This is way Postgress does this. Uh this is way I think the textbook might describe this.

    Other textbooks might might reverse it. In the

    19:39

    end it doesn't matter whether the slot array or the tupal data is on the top or the bottom the the high level ID is the same right and again and the slot array is providing us indirection that allows us to change the position of of tupil so then within the page itself right we

    19:55

    don't want to move it to another page um because that we may have to but that's not we're not focusing on that yet and this allows us to then be able to like do things like oh I'll delete this tupil here and then I can do compaction and slide this tupil over here to fill in the the space, right? And all I have to

    20:11

    do now is update the slot array to tell me where the offset is for that tupole I just moved. And I don't have to update anything else outside the system or outside this page itself.

    Then we talked about record IDs and this was the way that we were going to

    20:28

    physically find a a tupil with within our database, right? And it's going to be some combination of like a file name or file ID uh a page ID and then a slot number within that page.

    Right? So the the page directory thing

    20:45

    he was asking about before they were asking about before that'll store the uh the file ID uh for for some object like if I want table fu here's the file where to go find it and then if I if I want page one two three depending on the addressing

    21:01

    scheme the data system uses it could just be an offset within the file or could just be the file name itself might have the page number right all that's within the page directory yes or if you have like more rows The

    21:16

    question is what if you have more rows than the size of the ID? Um I mean it's a 64-bit number.

    Well, hold up. In in Oracle, it's a 10 byt number, right?

    That's pretty big. Well, so ingress is a conversational system

    21:33

    from like 1974. Postgress is Yeah.

    So six bytes s is would you run out of six it's a pretty big that would be a pretty big table. Yeah, but in theory, yes, you you could

    21:48

    there's other limitations too, right? So like uh the is not just for one table, it's the whole database.

    Now, this would be for depending on the scheme you're using. This could be the for the entire database.

    It could be for a single table, right? I know I want table foo

    22:06

    and here's the page ID for it. Therefore, I know how to find that.

    Right? There's other limitations we're not going to talk about.

    Like in in Postgress, you can only have two to the 16 uh columns in a table. Oracle famously only lets you have a thousand because somebody hardcoded it, right?

    It

    22:22

    used to be like 200 and then they they they spent a year trying to fix it to to go to a thousand, but that was 10 years ago. No one's going to change it again.

    Like it's like stupid things like that. Um but yeah, you you you're very unlikely to run out of space for this.

    Um there's other tricks you do if you

    22:37

    get around this. Like in case of Postgress, you can you can have a you can partition a table.

    It's like make here's like here's a parent table and then within that I have like sub table. So logically it looks like a single table but each one has their own right there.

    There's ways to get around it.

    22:55

    Say it again. Question is for a row store data system would it make sense to have a thousand columns?

    People do it. Sure.

    Why not? Right.

    There's easily applications that have a thousand columns. Yeah.

    23:11

    Enterprise stuff is nasty. Like anything from like the SAP enterprise resource ERP thing, the thing they make a lot of money on.

    That thing is insane, right? Because think of like an application running for like 30 years.

    Every two years someone's got to add like 10 columns because something gets else added. Then before you know it, you're

    23:27

    at a thousand easily. To your point, I mean like it's it's a long tail, right?

    There's gonna be most most tables going to have maybe I don't know 20 30 columns but there's easily stuff out there that that'll have that many.

    23:44

    The same is at that point you would have a single row across multiple pages. Uh depends on what the columns are.

    There are multiple pages.

    23:59

    Sure. Like but the data system can handle that.

    Yeah, for our discussion here, we we've simplified it and say like we assume it's all going to sit in a single page, but actually we talked about large large data record or large attributes. Now you just have a pointer to something else to another page and then at runtime the D

    24:16

    knows how to stitch it back together. All right, so in this tuple oriented storage reads are pretty simple.

    All right. If I want to get an existing tupil, then I'm gonna get its record ID.

    I'm

    24:33

    not saying how to get that just yet, but assume there's some index. There's something that's going to tell me for, you know, I want I want your email for your email address.

    Here's the record ID that has your student record. There's something there.

    It's the index that's going to tell us how to find that. But for our purposes right here, we don't care.

    So again, how do we find this? We go look in the page directory.

    Go get

    24:50

    the the find the location of the page we want. retrieve that that page we want from that location on disk if it's not already in memory and then now we use the record ID's offset uh that's embedded inside it to jump inside the page within the slot array to find the data we're looking for rights are reads

    25:05

    are pretty simple the in this case here we have to do the lookup in the index to find the data that we're looking for so we kind of do a lookup in this index that's going to

    25:21

    tell the for a given logical key like your email address when we when we look at the index it's going to come back and say here's the record ID then we do a lookup to go find the data that we want given that record ID within the within the page but what if we didn't have to do that what if we didn't have to do a

    25:38

    separate look lookup in the index and a separate lookup in the the table heap so this is called index organized storage and the basic idea is that we're going to use the index as the storage for the table itself So again what is what is index doing for

    25:55

    us? It's it's a mapping from a key to a value.

    In this case here would be a logical key like your email address like the primary key or something and then the value is going to be the record ID if it was a tuple storage. With index or organized storage in the leaf nodes instead of storing a record ID we're

    26:10

    just going to store the tupil itself. So again I'm not telling you what data structure this is going to be but typically it's going to be a B+ tree could be a skip list.

    Could be a try. It could be you know what whatever your favorite uh tree data structure you want.

    So then now within again within the pages sorry the leaf nodes it's

    26:27

    going to look a lot like the slot array from the two oriented storage but in the beginning the the header we're going to have this key to offset array mapping that we have to keep in sorted order based on the key and these are just going to again be offsets within the

    26:43

    page that gives us the the tupil that we want. So now when I do that lookup based on your email address, I would scan down through the the inner nodes in the the data structure, land on a leaf node, then do a binary search inside this key offset array, and then that'll find the

    27:00

    the record that you're looking for. And I don't have to do that separate lookup in like the page directory to go find uh your your tupil for given a record ID.

    Everything's right there in in the leaf node. So SQLite famously or SQLite and my SQL

    27:15

    with InnoDB famously give you this to you. Uh like that's basically how they organize it is how they organize the data in their system.

    Um the with Oracle and SQL server by default you get the slot slot of page architecture that I said before. Uh but

    27:31

    when you create a table you can tell it I want to do index organized storage. All right.

    So they'll support this as well. And just like again just like before the the the key offset grows one way and the tub grows another way.

    Okay. Well,

    27:48

    so that solves the the problem we had in reads before where I wanted to do a uh I want to find a record or a tupil based on the record ID. I have to go look in the index and I then have to find the two the the

    28:03

    for given key look up the index and the index tells me the record ID and from the record ID I can go the the page that has the two data uh but it's still we still have to worry about rights. So to do an insert, it's pretty straightforward,

    28:19

    right? We just go check the page directory to see whether there's a free page for us where we has enough space where or if there's a there's a page that has enough free space for the the tupil we want to store.

    Again, we know the size of the data we're storing because SQL is declarative. We know exactly what you're telling us to store.

    28:36

    So we can figure out at the moment you try to insert it, how much how much space we need. Then we go retrieve that page from disk if it's not already in memory and go check the slot array, define a free slot insert into it, right?

    Ignoring the index organized

    28:51

    storage like rebalancing the tree. We'll we'll worry about that later.

    So inserting is pretty straightforward and say we run out of pages uh there's no free space and the current pages we have, we just allocate a new page and then insert our record in there. No problem.

    29:08

    But now if we want to update a tupil, go through the same process before go uh find its uh find its page from its record ID. uh go see whether the page exists.

    Then assuming we're going to overwrite

    29:23

    the the data, which is not always the case in most systems, but assuming we are, then I'm going to check to see whether the size of the t the new value of the tupil that I'm trying to insert or update. If it if it's larger in than the original value, do I have any space

    29:40

    in my my page? Now, because with the slotted array, I can reorganize things to try to free up space within the page.

    Uh, and if I if I can do that and I can get in the same page, great. I'm done.

    But if there's no space in the current

    29:55

    page that I have where the the tupil currently exists, then I basically have to treat that as a delete removing the old the original tupil followed by an insert with the new tupil. And that gets that's expensive, right?

    30:12

    So tuple oriented storage is really great for reads because again it's really easy to go find the thing that I'm looking for. For inserts not so bad either, right?

    Because I I can just insert a page page to page over and over again. But for updates they can be actually quite expensive,

    30:30

    right? So part of the problems we're going to have is that there's going to be fragmentation, right?

    because we're not we can't guarantee that we're going to be able to fully utilize every single page or there's that little space in the middle where the slot array goes one way and the tuple data goes one way and if I don't exactly line up perfectly uh which

    30:47

    you almost never do then you're going to have a little space in the middle that's just wasted then now anytime I want to go update a tupil I got to go fetch that page uh into memory in order just to update one tupil inside

    31:02

    of it right depending on the page size. If if my pages are really large, then I might be really reading 16 kilobytes to go update a five five byt value or eight byt value like something really small.

    So there's a lot of wasted I/IO potentially depending on what my workload is, what my data looks like.

    31:20

    Their problem is going to be is that because now tupils could be stored across multiple pages. That means if I want to update a batch of tupils at the same time, I may have to go fetch all those individual pages, bring them into memory and then update them.

    31:36

    Right? If I had 10 tupils across 10 different pages, I have to bring in those 10 10 pages into memory before I can update anything.

    And then lastly, this is not so much an issue with two. Storage, but it's a way to think about how to design systems is that well, what if we actually can't do

    31:52

    any in place updates in our in our nonvolatile storage, right? I can't bring a page into memory, jump to some offset within it, and then write out a new new values to that at that offset, right?

    There's some storage systems that actually don't let you do this. The

    32:08

    Hadoop file system is is one, probably one of the earliest ones that did this. Uh Google has their own internal file system called Colossus.

    Uh if you ever use Amazon S3, other other object stores, some of them provide these semantics. So you can't do in place updates.

    You can only do appends. You can only create new data.

    32:25

    You can't overwrite existing ones. If you want to get rid of the old stuff, you got to actually it's a delete and then you do more appends.

    So because of this the these issues uh there's a different way to store data

    32:40

    called log structured storage sometimes these are called log structure merge trees or LSMS and the idea here is that instead of storing tupils and pages and then updating them in place multi-verging we'll cover later that what I'm saying is not exactly true for

    32:56

    multivering but what we're talking about today is going to look like multi- versioning but not in the way that we'll talk about later for current control but the high level idea is still the same that instead of dialing it to do in place updates, the only thing we can do is append new records to data pages.

    33:16

    And the idea is that because now we're going to only do appends, of course, we have to handle deletes, remove things, then that makes some of the algorithms to go read and write data more efficient or less efficient, but it's going to make our updates be really really uh go

    33:31

    really fast. Inserts go really fast as well because you're just appending new records.

    And then that's that thing we talked about before where we want to be able to minimize amount of disc random IO in exchange for more sequential IO because so now if we only do appends that's all sequential IO and that's be much faster

    33:48

    for us. So the LSMS are an old idea uh goes back to 1996.

    Actually the guy that invented the LUK stuff I talked about last class also was the inventor of uh this LSMS. Uh the paper came out in 96.

    There was

    34:04

    earlier work done on log structured file systems in the late 80s early 90s right and so this is a this is a successor to that in the file system board but applying to databases so there's essentially going to be two main data structures in a log structure storage system there'll be a mem table

    34:21

    we s inmemory data structure where we do all our writes uh and make all our changes and that one because it's in memory uh we can make in place updates and then as this mem table gets too big, then we're going to write it out to this immutable file called an SS table. I

    34:39

    think it's called stands for sorted string table or sometimes uh static string or sorted static table. Whatever it's the same idea, but it's going to be a sort of compact form of what the mem table is that's going to have the the

    34:54

    changes that are made to the database. So there's only two operations to to to modify a log structure system.

    It's we put where you add add a new entry and delete. Right?

    There's there's no overwrite, there's no uh there's no update, there's

    35:11

    no upserts, right? Puts basically the same thing as insert, but it may not be you may put would be physically putting a new record in, but logically it might be updating an existing one.

    We'll see that mean looks like in a second. So again, now we bring again the the the demarcation line between disk and

    35:28

    memory. Again, we have our MEM table up above.

    It's some tree data structure that's going to record changes that we're going to make to the database. I'm not saying what the data structure is.

    Again, it could be a B+ tree. Skit lists are pretty common.

    Uh could be a try. It doesn't matter.

    35:44

    So, say now my operation is I want to put a uh a new value in for key 101. I'm not defining how I'm determining what key 101 is.

    You can think of the same way we did that uh like the row ID in the SQL lights like a sequential counter

    35:59

    right where the systems is maintaining for every new record or every new uh every new record you just add add a new key you increment a counter by one. So I'm going to put this uh put this put into my MIM table and I'm going to basically add a new entry for it.

    Same

    36:15

    thing now I want to do a put for key 102 the new value B1. I'll put that in my me table as well.

    And then now if I want to go back and modify key 101 again now with a value A2 again because the mem table is in memory

    36:32

    by its name I can do an in place update inside of that. I actually don't need to record the history of of the the previous changes that were made.

    There's a separate log file. We're not going about in today's class, but basically because this is in memory, if you crash, you you could lose data.

    But basically,

    36:48

    we are appending the the puts to a separate file and that'll get flushed before we say the the the transaction is committed. But for now, we could ignore that.

    Just be aware that we have a way to keep keep the mem table durable.

    37:03

    Same thing if I do another put on key 103, right? I'll update my mem table and so forth.

    So at some point the mem cable is going to get full right we're going to run out of space like you think you know think couple hundred megabytes or something like that. So at which point we're going to

    37:19

    take the scan on the leaf nodes of the mem table right that's because that's all the entries of of the actual values for the different keys that we have and we're going to store that now into a different form of the SS table. Again, think of this is like almost like the right ahead log in the logs.

    Like here's

    37:35

    all the the the changes that were made uh that this mem table recorded. So again, if I made multiple multiple changes to a key, like I changed key 101 multiple times because I'm doing in place updates on the mem table when I write out the SS table, I only see whatever the latest version of that was.

    37:53

    So at this point once it's in my SS table has been uh uh initialized with the data the mem table I'm storing the the the the data within the the the lowest key to the highest key. So this is in a sorted order based on whatever the key that we want.

    I'm going to go

    38:09

    ahead and write this out to disk to nonvolatile storage. Do a flush.

    Make sure it's there. And then I'll just pop populate a new mem table.

    Just keep doing this over over and over again. making new uh making new SS tables and these are now

    38:25

    going to be basically sorted now from the in time stamp order from newest to oldest right within one file it's it's sorted from key low to high with across multiple files I'm going to sort them keep track of them in uh in newest to

    38:40

    oldest order like you put the file name have a time stamp right and the time stamp is always incrementing as things go forward so I'm going to show one type of compaction uh that's most common one, but I'll show another one and I'll go more into detail what this compaction scheme looks like. Um, but at some point

    38:56

    this this uh this level is going to get full. So then I actually want to combine SS tables that are in disk.

    This the first level zero and I'll combine them into uh a new SS table and then I can blow away the old ones. The idea is that you keep going as you keep going down

    39:12

    these things get bigger and bigger. Uh, right.

    And the idea is that I I'm I'm throwing away things that I don't need anymore and I always try to have like whatever the latest version is at each level for a given key. Yes.

    I don't understand why we want multiple

    39:27

    levels like why not always merge. Question is why why do we want multiple levels?

    Why why not just merge levels there within itself? Give me two slides.

    We'll come to that. But so the one would be better for reads, one would be better for writes.

    39:44

    I'm not showing key ranges here and that that's why it's it's maybe confusing. Give me a few few slides.

    Yes. These are I assume these are operations.

    [Music]

    40:01

    Yes. And two slides, right?

    So his question is and he's correct like is the is this compaction is are we throwing out um are we throwing out operations for that we don't need anymore? Yes.

    So if I do a bunch of puts and then on a key then I

    40:17

    delete that key I don't need to keep the puts. I just need to know that I deleted it.

    Right? So you basically end up with sort of this tier structure like this where you have big longer SS tables at the bottom uh and progressively smaller ones as you go up.

    40:33

    All right. So that's how we do puts and deletes.

    We need to handle reads. Now, so if I want to do a read, the key 101, the first thing I'm going to do is go check my mem table to see if if I have any

    40:49

    entries inside that because the mem table is going to have the latest version or latest changes that were made to uh to a key. So I go check my mem table.

    If it's in there, great. Take whatever the value is and and and return that back.

    I'm also not saying what the value is. I'm kind of being vague here, but it's typically going to be the tupil

    41:05

    itself, right? Just in sort of serialized form.

    But if the key that I want is not in the mem table, then I I now got to go check all these different SS tables, different levels, and that's going to suck now because I

    41:20

    got to go do a bunch of reads to go find some, you know, basically a needle in a haststack, right? So the way to deal with this is that you basically maintain what's called a summary table in memory that provides some additional metadata about what's

    41:36

    actually in these files. Right?

    So I could say uh for a given key 101 I could check at does this key even exist at the different levels. So I start at level zero because again the levels are going um depending on my package scheme could be going forward to time.

    could be also

    41:52

    going across keys, but like I want to see what level has my key. And then if I know that a uh the key exists maybe at this bottom level here, but also exists in a higher level there, I always want to choose the higher one because I know

    42:07

    I know that one's more recent. So again, we're we're we're making this trade-off to make our writes go faster in exchange for slower reads, but we can add some additional data structures to make sure the reads aren't aren't as bad.

    42:25

    Wait, say again? The question is the key filter a bloom filter or what was question question like what is the filter?

    Um it could be a range filter to say like

    42:41

    within this range this key doesn't exist or not because that would handle range scans could be like a single key does this key exist in this this this level or not that your bloom filter lookup. So that could give you a we'll talk about bloom filters in a few weeks like that could give you a false positive.

    It might tell you the key exist level then

    42:57

    you got to actually look and check to see whether it's there or not. Yes.

    So the way I imagine it like let's say I say I want to update

    43:12

    this field for ID then I only have the field I'm changing but there are many other fields the statement is again this why I was trying to be vague about what the value is the statement is is the

    43:31

    if I update If I update a record, uh, if it's a blind write, I don't need to look to see what actually was. I just I just put a put and I'm done.

    If it's like, uh, update Andy's salary or Andy's age where age equals current age plus

    43:47

    one, I got to go read whatever the virtual value was, then I got to do an update. And then what do I actually putting in the mem table?

    In the case of Rox DB, I think it's or most systems, it's going to be the full tuple all over again. But my point is like, let's It's just right.

    I don't care what it was before.

    44:03

    I'm saying like in this page a blind right. Yes.

    So if I do that and later on I want to read your record. It's not enough to like find your age.

    I need to find everything like your record may appear in multiple SS table. Yes.

    So his statement is you're

    44:20

    basically asking is is there a given record that has multiple attributes is that those attributes broken across multiple SS tables? No.

    you would always have the full tupil with all its attributes in the record that you're storing as part of the value in the mem table or an S an SS table. So like there's this sense like

    44:36

    internally you're going to a blind rate would be like an insert but to your point like you um yeah so his point is like you'd have to do a read modify right you go fetch the tupil then update depends on like what the query is. If it's an update where you're you're up if it's update where you're

    44:51

    like you're changing all the attributes, then yeah, you don't need to see what the original ones were. You just do an update in place.

    Yeah. Or not update in place.

    You can do an update without actually reading it. Other questions?

    45:08

    All right. So again, what are we doing?

    It's it's a basic key value storage uh architecture where again the value is going to represent the the the contents of the tupil. Um the the if we do a delete we don't actually need to go find the data that we want and actually

    45:24

    delete it which is kind of nice again in a tuporiented architecture on a slot of pages I got to go fetch the page then mark go you know mark it as deleted in this case here if I have the key of the thing I want to delete I just put a delete key record in my my log in the me

    45:42

    table and then I'm done with it then anybody else that comes along wants to read that tupil We got to, you know, have to make sure that it sees that delete record notifying it that the record no longer logically exists. Even though physically might be in still a bunch of SS tables, we know we need we should ignore it if we try to do a

    45:58

    lookup. Yeah.

    Sorry. Yes.

    Yeah. So question the question is when do merges

    46:13

    get merged? Next slide.

    Okay, so at some point I'm making all these SS tables running out the disk. Uh, and the, you know, if I delete a tupil, if I update a tuple like a thousand times,

    46:29

    then then I delete it, I'm going to have a thousand1 records corresponding to that one tupil, but I only really need the latest one. So the there'll be this background process that gets triggered based on the size of the SS tables, based on the number of overlapping r key

    46:44

    ranges. We'll see that in in next second next slide.

    Like there's a bunch of ways bunch of mechanism to trigger when compaction should kick in. And again, there's this trade-off between uh using doing background maintenance and actually just running queries.

    Remember we talked about the background writer or

    47:00

    the page cleaner and the buffer pool that we could have a thread run through our buffer pool manager, find our dirty pages and start proactively writing them out so that they get marked clean and they can get evicted. But if I'm spending all my time doing that, then I'm I'm going to slow down queries actually trying to do real work.

    So I

    47:16

    could spend all my time doing compaction to clean up all these SS tables, but that's going to make query execution run slower. So there really isn't a good way to to there's there's not a sort of one-sizefits-all solution to determine when you should trigger a compaction.

    47:31

    But things like Roxb have a bunch of different policies. All right.

    So say in the most simple form, say I have two SS tables. I I want to merge them.

    And in this case here, I I'm assuming that the order of the two SS tables from newest to oldest. So the one on all the

    47:47

    way over here is the late the the newest one. The one after that is older than that.

    So the way I'm going to do basically merge is the same way you're going to do sort merge. If you know that algorithm, we'll talk about that again in two weeks.

    There's a join algorithm on sort merge as well, right? You

    48:02

    basically just have two cursors that start at the beginning of both these files. Again, I'm showing two SS tables.

    You could have multiple ones. You typically you want to merge multiple ones at the same time, right?

    But for our our illustration, it's just two. So now what's going to happen is I'm going to look at whatever the key is being

    48:17

    referenced at the uh wherever my cursor is pointing at. In this case here, the the the first person the first cursor is looking at a delete on key 100.

    This guy has a put on key 101. says 100 is less than 101.

    I know that there isn't going

    48:33

    to be a reference to key 100 in in this SS table. Uh so therefore I can just take whatever that that record is and put it in my new SS table.

    And then I I move that cursor down, but I keep the previous the other cursor at the same location that that is that that it was

    48:49

    pointing at before. So now at this point here we have a uh a put on key 101 in the first the in the first one and then a put on key 101 as well in the second one.

    But since this

    49:04

    SS table is newer than this one, we know we don't care about whatever that put was. We only care about the latest one.

    So we go ahead and remove or ignore that that put and put the first SS tables into uh the new one. Then same thing the both of the cursors move down.

    We have a

    49:20

    put put on key 102. Another put on key 102 in both of them.

    I don't care about the second one. I only put the first one.

    Keep moving down. Now I have a put in key 103.

    Uh but I also have a delete on key 103. Same thing.

    They're operating on the same. They're doing some modification to the same key, but I

    49:36

    only care about the latest one. So I go ahead and ignore the second one, put in the first one.

    The first cursor is done. So now I keep scanning along and whatever comes after that I know I want to copy into my new SS table, right?

    So I would I I'd put the put for 104 there.

    49:55

    That's basically what compaction is doing. Now the the tricky thing is going to be how do I organize multiple SS tables and what am I what's the how I'm deciding what SS tables to merge and we said the when it gets triggered based on a bunch of different policies.

    50:12

    So what I showed before was an example called level compaction where I'm going to maintain different levels of of SS tables of different sizes. And the goal here is actually to have non-over overlapping key ranges within our our levels except for level zero because

    50:29

    that's the special one where as me tables get full we SS tables. We we just append them to to disk.

    So in that case they will have overlapping key ranges but everything below that will not. So again so assuming that the the SS tables are going to go from newest to

    50:45

    oldest. Our first SS table we're get written out is going to have key range from A to R, right?

    Arbitrary letters. The next one is going to have key ranges from E to T.

    And then the third one here will have key ranges from B to Q. So some point I'm going to run out of

    51:00

    space or somebody gets triggered says I want to now do compaction on level zero. So, I'm now going to make a new level, level one.

    And I'm going to take these three SS tables, and I'm combine them in two two larger SS tables. And I'll do

    51:16

    the the the compaction I just showed in the previous slide where I'll have a cursor go through each one and remove whatever the the oldest entry is for a given key, and it'll only keep the latest one. So, but then in my my new asset tables I'm

    51:32

    going to generate now I have non-over overlapping keys within this level. So, if I'm looking for key key uh key X for example, if I have to look at level zero, I have to potentially look at all these all the SS tables uh to find key

    51:49

    X. Well, actually I would wouldn't be in any of them, but like say I'm looking for key key Q.

    I would actually have to look at uh all three again all three SS tables to find that given key because each of them have uh that key could fit in each of their ranges. But now when I

    52:04

    do the compaction down to the second level or level one since now they're non-over overlapping key ranges. If I need key Q, I know I only have to go look in the second SS table.

    It may not exist. And again, I can use a filter to tell me whether it's going to exist or not.

    But if if I if it could exist then

    52:22

    I know so I only need to look at one SS table. Yes.

    Can you repeat why do we have multiple levels multiple tables? Why did you get question why do we end up with multiple SS tables in level two?

    It could be just

    52:38

    one giant one. Basically the size of the SS table is is growing.

    Sorry size SS table grows per a level. And so the the cumulative size of the SS tables up above that I'm merging down may be larger than what I can fit into a single

    52:54

    SS table. So I drew two, but it could just be one.

    I just want to show too that that you would have these non-over overlapping ranges. So once this is done again, I can blow away the all the tables at at level zero

    53:09

    and keep doing the same thing just keep adding more and more um right doing depends. Now I I again I now I have different ranges and at some point I want to do a merge and again these are going to be overlapping ranges in level zero but when I put when I create SS tables in

    53:25

    level one uh they have to be non-over overlapping. So I may actually end up recreating all the files again to read them all in to memory using my buffer manager and then write them all back out.

    uh because the keys that are in the level zero might be

    53:42

    in any of the files might have to go in any of the files. I may have to generate a bunch more files.

    So if I have like 10 gigabytes of data on level one, I run compaction. I'm gonna have to read 10 gigabytes in and I might be writing out another 10 gigabytes.

    54:00

    Yes. Number question.

    The question is is the number levels fixed? Uh, no.

    I I they can go forever. In practice though, like I I forget the default is in Rox DB, you just end up with like really big files at the bottom.

    54:22

    Uh, question is is for any set of what? Sorry.

    for like

    54:38

    uh the question is uh for for would you have would you have a setting for a data system that say I'm I'm gonna have 10 levels of in my compaction scheme and then you always have 10 levels. No, the the very beginning I could have I could set this

    54:54

    thing to have 10 levels, but right here at this point you only have two it grows. Yeah.

    So never exceed that. You just end up with really large SS tables.

    A lot of them at the lowest level. So why is this good for read?

    Question. Why is this read heavy uh why

    55:11

    is this um better for read heavy workloads? Because when we see universal pure next slide, if I want to go find a key at a lower level, I only have to look at the the SS tables that that where my key fits in that range.

    Can if you go back

    55:27

    just to when we only had level level zero and this is fresh from fresh from memory into disk. These are overlapping ranges.

    So if I want key Q, it can exist in the first one, the second one or the third one. So I have to check all of them.

    Now, in this case here, like it would be

    55:43

    since they're ordered based on time. If I see it in the first one, I know I don't care about the other two.

    But in the worst case, I have to scan all three. Yep.

    Other questions.

    56:06

    The question is, can the same key exist in level zero and level one? Yes.

    The question is because the at level

    56:24

    zero the it's sorted by newest to oldest it does not impact the search speed. Uh that meaning like well no no so if I if I want Q if I only if I'm only at level zero and I want Q key Q if it's not in

    56:41

    the first one I got to check the second one so that's slow it's not in the second one I got to check the third one that's slow right so worst case scenario I have to check everything at level zero the summary table helps me to helps to uh maybe not do look up some tables that or

    56:58

    SS tables that aren't going to have the data that I want. But in the worst case, I I may have to check.

    So yeah, so read the reads will get slower. Um but that's why they have the additional levels because now when I when I start compacting going down the levels, I can

    57:13

    guarantee that again for key key Q, it's only going to exist in one SS table at this at level one. Likewise, if I if I you know, if I keep adding more, same thing.

    Q can only exist in one SS table. Whereas in if it

    57:29

    was just me just writing SS as a table straight from memory like I do in level zero, it could be in any again and I do the same thing. I do compaction down to uh to this level and I can blow away things here again and now I have again non-over overlapping

    57:45

    ranges within uh as ex exists now but again as I add more things it may it may there may be overlapping.

    58:02

    question is similar to the one they asked before like what triggers compaction depends on the system depends how you set it up. So it could be the that if I have a a max number of max size of all the files within a level when that exceeds

    58:17

    something then that triggers compaction. Say it again.

    In this particular example, the question is like why did I do for this one here? Why did I merge one to

    58:32

    level two? Why did all right question is why did I merge in this case here?

    Why did I merge um only two SS tables from level one down level two? It's just the example, right?

    What's that?

    58:48

    one. Yeah, like you would merge more than two, right?

    I'm trying to make this work on PowerPoint, so like for simplicity, yeah, you you can be very But going back and forth, you can be very aggressive and try to merge everything, but like that's going to be you're just spending your time doing compaction, not not

    59:04

    running queries. All right, so let me show universe compaction real quickly.

    Universe compaction only has one level, and again, it's the SS tables as they exist, as they're being written out. uh you know from the me from the mem table in

    59:19

    memory to disk. So we want to do compaction on them but again we don't want to maintain multiple levels because we don't we don't have this right for ramplication problem where every single time we want to do compaction we got to read a bunch of things in and then write it all back out.

    Um so instead we're going to be more more targeted

    59:36

    compactions. So in this case here, like I say, I'm take these three SS tables.

    Again, they all have sort of non-over overlapping uh key ranges uh but they're ordered through time. And then now I'm going to take these three and merge them into a single SS table, compact them and and reclaim the space.

    And I can do this

    59:53

    again for the the next two here. Compact them and put them into a larger one.

    So this is great for uh for for workloads where you're inserting data very quickly, do a lot of updates. Uh, it's great for workloads where I I

    00:09

    typically only want the latest data that I just inserted. Like if I'm inserting time series information, like the temperature of this room, I only care about the last five minutes, 10 minutes.

    I don't care about what the temperature was, you know, a year ago. And so in that case, I'll be able to look up very quickly and find the data that I want in

    00:25

    the newest SS tables that are on disk, right? classic trade-off in computer science of like I'm spending more time more energy to make the um when I write things to make the

    00:40

    rights more expensive but my reads will be faster in exchange for uh or as opposed to making my rights go really fast but my reads a little bit slower. Right?

    So this is making the rights go

    00:56

    faster uh but the reads could potentially be slower based on what you're looking up. Okay.

    So to finish up log structure storage in the last 15 years has become super common and partly this this is because of Rox DB. Rox DB is a log

    01:12

    structure merge tree key value store that actually supports both these compaction schemes. You can set whatever you want.

    I think default is level level compaction. Uh but that is the been used as the starting point for or the the built-in storage manager for many many new systems.

    01:29

    Um Rox DB was not was created by Facebook or Meta whatever it is now. Uh but they weren't the original authors of it.

    It's actually a fork of something called level DB that came out of Google. This is built by Jeff Dean and the people working on bigtable.

    They built this this uh local structure storage uh

    01:47

    embedded key value store called level DB. Facebook forked that and made Roxyb.

    What's the first thing Facebook did when they forked level DB to Roxy DB? What was that?

    No, not copyright it. No, no, I think it was Apache license maybe.

    So, it wasn't an issue. What non-legal thing?

    What's a

    02:04

    technical What's the first technical thing they did? Change.

    No, that's that's a legal thing. He said change license.

    No, leg technical. They wrote they changed the code in what way?

    Okay, what do they do? Removed MAPAP, right?

    That is the very first thing they did was get rid of MAP because it's

    02:20

    garbage and switch to a bufferable manager the same way that we talked about last class, right? And like I said, a bunch of the different systems, you start off using Rox DB as their own internal storage manager.

    Cockroach did this and and then over time they rewrite it. This is again a small smattering of of of log structure storage systems.

    02:36

    There's there's a lot more. And as we said, the challenge of this approach is it's going to be really great for for doing uh writes, but the the compion is expensive.

    Uh and we may have this right amplification issue where we do one

    02:52

    update to a single tupil and even though we never update it to ever again, we're going to read it in, write it out, read it in, write it out over and over again multiple times in because of the compaction process. Whereas like in a tupal or oriented architecture, if I write something to a to a page, it gets

    03:10

    written out to disk. I never run, you know, I never read it again.

    I never, you know, I never write it out over and over again. Yes.

    So does this mean that heavy question is are are all these bad for

    03:25

    right heavy work read heavy workloads? No.

    Like things like click house really really good for it. So because There's a bunch of extra stuff you can do to to mitigate the issues.

    Then that'll be next class.

    03:43

    Okay. So next class and we're actually going to we're going to break your preconceived notion of what a a database looks like.

    Like for now we've been talking about oh these rows that are that are just continuously in in disk or in memory. We're going to flip that and talk about column stores.

    We actually store data as columns not rows. So that

    04:00

    that'll be uh that'll be next class. Um for project one, I'm gonna quickly go over this before we jump to to the the speaker.

    So this has been posted on uh on piaza and the code is on GitHub, right? You're going to build your own bufferable manager in in the bus hub

    04:15

    system. So you're going to build a uh replacement policy algorithm based on arc.

    Again, we'll have a recitation next week that goes over the algorithm more detail, but it's it's pretty straightforward. Then you build a disculer and then a buffer pool instance and it'll be due on September 29th.

    So the replacement policy is is the just

    04:33

    just the algorithm itself that actually decides what to evict uh when you need need a page. Um you're allowed to use the bu built-in STL containers for the ghost lists and the most frequently used most recently used list.

    So you don't need to make build your own separate vector indexes or things like that

    04:50

    inside your not vector indexes your vectors or lists inside C++ just use the SDL ones because it's in memory right the the buffer pools in memory we don't care if we lose the data well what's in our these lists if we crash right it's okay to use the the built-in uh data

    05:07

    structures for the disculer you're basically going to support asynchronous IO through uh SDL sedd promise callbacks and the basic idea is that other parts of the system are going to ask the disc, hey, I want this this page. Uh, and then it's going to be this.

    You're going to

    05:23

    build the P piece of the system that's responsible for scheduling the reads, fetching those into memory, and then handing them off to the buffer manager, right? And you can put this all together and now actually build the the frame storage and the buffer manager and the page table that actually takes the requests for the pages from other parts

    05:39

    of the system goes to the disculer tries to find the pages you want brings it into memory and if you don't have any space it runs your arc replacement algorithm to decide how to remove things. Okay.

    So super important, don't change any file other than the ones we tell you to change because when it runs on grade

    05:56

    scope, it wipes away whatever you give us uh for anything else but those files and if you make any changes in there, it it won't it won't run. And then the projects are all cumulative.

    So project one is the is the first start of this, but two, three, and four will be based on your buffer rule uh implementation

    06:12

    for this this project. So we want to make sure that you we have a bunch of tests to make sure you don't you don't screw things up.

    Uh but you don't want to fall behind because if you're trying to fix your buff manager two months from now, it's gonna make all the other projects a lot harder. If you have questions, please come post some piaza or come to office hours, but we won't

    06:28

    sit there with GDB and tell you how to debug things, right? We assume you know how to do that in C++ because of project zero.

    Just like a project zero, do a bunch of checks to make sure that your code looks good. And then for extra credit this year, we will have a leaderboard.

    So we have a version of bus

    06:43

    tub that runs on grade scope and whoever has the fastest buffer pool notation actually for all the projects going forward we'll have a ranked order on the leaderboard and the top 20 gets uh extra credit and then this year we're going to do we're actually going to try to get a trophy made like the Stanley Cup or not

    06:59

    not that big but like whoever has the highest uh leaderboard scores and at the end of the semester we'll put your name uh and year at the bottom of it and that way over the years we can get this thing and start growing it. Okay.

    Don't plagiarize because you'll get it messed up. Uh and then ask questions.

    All

    07:16

    right. So, let me jump to the the speaker.

    Oh, you All right. This is This is Joseph.

    He's been at Single Store for a while. Uh he's gonna get You're coming next week as well, right?

    I'm gonna be here next week. Yeah.

    Right. He's awesome.

    He's super smart.

    07:32

    Uh he's come next week. The floor is yours, man.

    Go for it. All right.

    Cool. So, yeah.

    I'm going to tell you guys a little bit about Single Store. Um, Single Store is kind of a unique database in the industry.

    I I think it's probably one of the most unique databases in the industry. And

    07:49

    so, let's just dive right in. So, this can be kind of a rapid fire talk.

    So, I'm going to talk a little bit about the two types of database workloads and how single store presents itself as a sort of hybrid option. And I'm going to switch gears completely and talk about uh modern cloud systems and the kind of

    08:04

    benefits Single Store gets from running in the cloud. Um so let's get started.

    Um so the two types of database workloads. So you've maybe seen this in your class so far or you will later in the semester but there's basically two

    08:20

    broad categories of database workloads. The first is OLTP online transactioning proc uh transaction processing.

    So it's characterized by a large number of transactions super high concurrency small reads and writes stuff like that. And so it's like, you know, any row you

    08:37

    need to be able to fetch at any given moment and you might need to be able to update it at any given moment. So everything is is kind of optimized to be able to find stuff, index stuff, do transactions, change things around.

    And so you typically will have very high

    08:54

    concurrency, lots of queries that are small hitting the database all at once, very short SLAs's. you expect things to return in milliseconds and you really care about acid.

    You really care about your transactions being atomic and not

    09:11

    seeing any weird um side effects of you know isolation and and and you don't want your you know rights to be lost. So you you care about these kinds of things with with that kind of workload.

    So you can think about running applications. Um you know you can think

    09:28

    about uh like running literal transactions at a bank or you know if I click on something on my website and it says you pay me and I'm going to send you this you know thing that you just bought. Um

    09:44

    and so typically these kind of databases will use row oriented data structure right because every row has to be findable and updatable very fast. Um so they'll typically use B trees and skip lists which uh allow you to find one row

    10:02

    very fast. Um and you've probably learned about that in this class.

    Um so examples of this Postgress is Postgress is basically synonymous with OLTP these days. Um older databases Oracle and SQL server are of course there cockroach uh spanner um these

    10:20

    things being the kind of modern largecale uh transactional databases. The other type of database broad category is analytical databases, OLAP databases.

    Um so these are characterized

    10:38

    by large bulk processing queries and sometimes the queries will take minutes, hours or even days to run. Maybe maybe these days that's not considered acceptable, but these large reporting queries can take a long time and you

    10:53

    will often run on huge scales of data. terabytes or pabytes are not uncommon.

    Um, so you should think large reporting queries, analytics, dashboards, your loads or big bulk loads that you do um

    11:08

    periodically. You're not updating one row.

    And so the data structures used to build these kinds of things are very different. You do these column oriented data structures and um you're going to learn about that in this class.

    I assume

    11:24

    I hope. I love column stores.

    I think they're so cool and um you know they're they're one of these data structures that just makes me so happy. So um you're uh it's going to be fun to learn about those.

    Um but yeah, so they use these column oriented data structures

    11:40

    that are really really good at scanning really fast, really really fast. Um and so examples of this kind of database are Snowflake, Clickhouse, Redshift, Big Query, Vertica.

    um you know there's different flavors within this thing but

    11:57

    they're all pretty much distributed column store systems and they're good at big data analytics. So the question is why why are there hard to do both?

    And like why are there these two different data structures and I can't just have one data structure

    12:13

    that does both? And so you know you could talk about like oh row stores will cause you know write amplification and you're doing a bunch of random IO and so their scan speed is limited and that's like kind of true but really the reason is that column stores are really really

    12:30

    really fast. So when column stores do scans, they just, you know, they just blow through your data and and it's it's really impressive and it's a really cool technology.

    Um, but column stores are

    12:45

    the updates are generally not fast. So if you want to scan huge amounts of data, you're kind of forced into this column store world and your updates will suck because of it.

    Um, the questions of what other data structures do are almost irrelevant. you

    13:00

    just the column store is just so good. Um so you know there's a fundamental fact of the universe no column store will ever be as fast as row store onpoint queries and so like if you have a database it's either a column store for analytics or a row store for

    13:17

    transactional workloads and typically it's not both but that's what single store is. It's a hybrid row store column store thingy.

    So what is single store? What is single store as a hybrid option?

    Why do we exist? Um, we're not really a hybrid

    13:35

    option. It's it's I it it would not be fair to say that we're some kind of franken thing.

    We're a column store with tricks is what we are. Um, we're we're a column store with a row store bolted onto it that is kind of considered the

    13:51

    first segment of that column store and it's used to store the hot rows. And then we optimize our column stores in a way that don't negatively affect the um the scan performance but make it so that

    14:08

    they're decent, not as good as row store, but decent at transactional workloads. And the way we do that is first of all you change the column store encoding so that they're seekable which by itself doesn't give you much but you

    14:23

    add these secondary hash indexes that tell you the offsets and so it's like oh I know all the offsets I can seek in real fast and you kind of use that row store to do to kind of manage the locking virtual locks for the stuff in the column store. Um and I I don't have

    14:39

    a ton of time. I could give an hour presentation on each of these things, but somehow it's a column store that doesn't fall over when you try to do transactional stuff.

    And that is very unique in the industry. Um, and yeah, so

    14:56

    the the column stores are not generally good at concurrent writes, but this one is. And you know, for those things that really need transactional workloads, we do offer dedicated in-memory row store tables as well.

    We used to be called memsql. We used to be mostly focused on

    15:11

    that in-memory row store thing that we changed our name because we're not really mostly focused on that anymore. But um we do still have those.

    What is this bias? What is what kind of workloads can we do that other competitors can't?

    15:28

    Well, analytical databases love these analytical benchmarks. TPCH, TPCDS, there's a ton of them.

    And these analytical benchmarks that they kind of have the flavor of, you know, they'll give you a hundred queries or whatever, and you know, Snowflake's better on one,

    15:44

    you know, BigQuery is better on a different one, ClickHouse is better on on another one, single store is better on one. And it's it's just like it's a bag of tricks.

    It depends on, you know, does your optimizer have this thing? Does your query execution have this thing?

    Does your column store execution have this thing? Um, and so single store

    16:02

    is competitive with the best, you know, data warehouses on these analytical benchmarks. OOLTB databases do not finish these.

    If you're not column store, you don't stand a chance. Don't even bother.

    So, single store can do these things and we're just as good as anyone else.

    16:17

    But we can also do the transactional benchmarks. we can do TPCC uh and you know we can do the you know other sorts of point workloads and you know Snowflake just it doesn't work.

    It it just doesn't work. So um you know

    16:35

    we're we're the we're the only it's not about being better than anyone else. It's like we're the only one that can actually finish all of these and that's very cool.

    Um so we're the only system that I'm aware of that can finish all three. at least the only commercially available one.

    I I think that there's

    16:51

    some academic ones that can do it as well. Um, and that's pretty cool.

    So, are we a hybrid database? Maybe our marketing will tell you we're hap or something, but I I think of us as an analytical database that doesn't fall

    17:07

    over under high concurrency or or point workloads. uh we can do a wide variety of mixed workloads like dashboards, logistics, manufacturing, streaming AI, that kind of stuff.

    And you know, it's mostly analytical with enough transaction processing the other systems can't keep

    17:23

    up. Cool.

    So, I'm flying through this. I got a couple more slides that are on a completely different topic of modern cloud systems, but maybe uh I'm also over time already, so maybe I should

    17:39

    stop for questions and just let that be the presentation. Yes.

    Any questions for Joe? Yes.

    His question is would using worse encoding means higher storage requirements?

    17:55

    Uh yes. I I I couldn't.

    Yeah, the question is um he's asking about encodings. Let's hold your question next class.

    Joe's like his setup is basically the intro for next

    18:12

    next lecture. This is fantastic.

    So, a bunch of stuff may not make sense now. It it'll make sense on Monday and plus he's coming to campus and he'll talk you talk to him again about it.

    Okay. All right, guys.

    Let's thank Joe

    18:28

    And like I said, uh DJ Cash's uh radio show is this Sunday at 1 o'clock. And then uh see everyone at the um at the visit day on Monday.

    Hit it. [Music]

    18:52

    asset [Music] get the