#04 - Memory Management & Buffer Pools (CMU Intro to Database Systems)

🚀 Add to Chrome – It’s Free - YouTube Summarizer

Category: Database Systems

Tags: Databases Management Memory OS Performance

Entities: ARC Buffer Pool Clock Direct IO Fsync Gate Linux LRU LRUK MongoDB MySQL Page Table Postgres Project One Project Zero Wire Tiger

Building WordCloud ...

Summary

Class Updates

Project zero and homework one were due last night; project one will be released today.
Visit day for database industry friends is scheduled for next Monday and Tuesday.
Students interested in database internships or full-time positions should update the provided spreadsheet.

Buffer Pool Management

Buffer pool is a memory allocation used to store pages from disk for database operations.
A page table tracks the pages in memory and uses a pin counter to manage access.
Replacement policies like LRU and Clock are used to decide which pages to evict when memory is full.
LRUK and ARC are advanced algorithms that balance recency and frequency of page access.

Operating System Interaction

Avoid using OS features like mmap for database memory management due to inefficiencies.
Direct IO is preferred over OS page cache to avoid redundant memory copies.
OS can unpredictably write out dirty pages and cause data inconsistency issues.

Takeaways

Always manage memory and disk operations within the database system for better control.
Implement advanced replacement policies like ARC for efficient buffer management.
Use direct IO to minimize unnecessary data copying and improve performance.
Be cautious of OS behavior that can lead to data inconsistencies, such as fsync issues.
Optimize buffer pool management by using priority hints and separate caches for different query types.

Transcript

00:00

[Music] I'm still associ. [Music]

00:26

Awesome. Again, I apologize for being out of town last week.

I was in London. Uh if you watched the class video, it was bizarre.

Uh your friend, so cash is his old friend. Fat Fat Face Rick that owes you money.

I was supposed to get your money.

00:42

Uh >> did you not >> I did not get your money. Uh like he like whatever took some drugs and then walked walking around like the bad parts of London and like got beat up by 12-year-olds.

So we had >> 12-y old kids. It was super bizarre.

>> He was up. What am I supposed to

00:57

do? Anyway, sorry.

Um, yeah. So, I apologize for not being around.

Um, that was a weird one. That's a new one for me.

Okay. Um, so let's let's open this.

So, the again, project zero and homework one were due last night. Uh, everyone should have completed project

01:12

zero. We're going to go over the grades um to later today with the TAs.

And again, you have to get 100% on this by last night in order to be allowed to take the class. Um, project one we hope to release later today and that'll be due at the end of the month.

And that's going to be related to what uh today's

01:28

class will be about. Okay, the source code is not pushed yet.

Uh so everyone will have to sort of do uh a get pull and merge in the latest version of from the main branch into uh your local code. Okay.

The other thing coming up uh next week on Monday and Tuesday is that we're

01:44

having our visit day for all our friends in the database industry. So there's two days, Monday, Tuesday.

Obviously the class is on Monday, but we'll we'll duck out and teach that and then we can always go back. Um so the first day we sort of research talks and some intro talks from the companies and then on the

02:00

Tuesday in the morning that'll be these info sessions where each company will present for about an hour talk about what their system is what they're building uh and they talk a little about internships and full-time positions as well. So that's why I posted on piaza over the weekend or Friday whatever that was.

Um so if you want to go uh come to

02:17

these sessions since we're having um since you know we only have so many time slots there will be some overlap between two companies giving the same talk or giving different talks at the same time different rooms. So just put down what your preference is and then we just run whatever stable marriage algorithm to

02:33

generate chat GBT to figure out the best schedule for everyone. And then if you want to uh as I posted else as well if you want to get a database job either internship or full-time position uh go to that PA post there and then there's a spreadsheet just add yourself what you're looking for when you're available and then we send that out to uh all our

02:50

database friends including the companies that are coming next week. So we want to get this out to them before they show up.

And then we have other database friends as well. while we send it to everyone else.

And again, this doesn't go to recruiters except for some rare cases. This goes to people that like co-founders or like VP engineerings, the hiring managers for the companies.

So,

03:05

it's not like if you go to Microsoft's website, you know, you fill it for an internship. That's just going in the pool with everyone else.

We try to skip all that and go directly to the people that care about databases. They know what this course is.

>> Yes. >> The statement is we haven't learned that much databases yet.

Don't say that.

03:21

Actually, here's the weird I've ever heard. Uh, so I know that a bunch of database companies use this class for onboarding new employees.

I had somebody tell me last week in London that they were not a database person. A company wanted to interview them for a database job.

And he's like, he told him straight up, I don't know anything about

03:37

databases. They sent him this class to prep for the job interview.

I didn't know you could do that. Uh, and he got the job.

>> This Yes. Um yeah, for the advanced class, one of

03:54

our projects used to be B like uh another Davis company basically took it and that was like the um like the what the hacker challenge for like to hire you. They would like make you implement something that we were having you do in the class.

So if you do the class, this is 721. You can show up and get hired

04:10

easily. Anyway, uh so any questions about this post on piaza and then uh please join us next week.

Okay. All right.

So last class again it was awkward because I was had to teach it in in in the emergency room. Um but we spent time talking about what databases

04:26

look like at the lowest level on disk. Again at the end of the day a database system just a files on disk.

There's nothing really special about them at least from the OS's perspective right just as if you open up VM or Emacs whatever your favorite editor is and created the file start writing data into it. That's basically what the database

04:41

system is doing. Um, and of course how we layered the the the different concepts on top of each other that sort of makes it so we can be do more sophisticated things uh as we as we get further along in the semester.

So today's class is now to jump ahead and

04:57

say okay well I have a bunch of files on disk uh I in order for me to do anything with those files I got to bring them into memory right the classic vonomian architecture I got to bring the memory and then I I can manipulate them serve queries do whatever I want um and so today's class is really about the the management or the movement of data from

05:14

disk into memory and then back to disk as again we'll talk a little bit about how we handle dirty rights later in the semester sorry in in this class we'll go more detail know how we make sure we do things safely to write back to disk later after the midterm. But for now, we're just trying to say, can we copy

05:30

things in and then how do we serve them up, right? And then next week, today's Monday.

So, starting Wednesday, we actually go back back down. So, back to problem number one and talk about alternative approaches to storing data as files on disks.

But the the buffer pool stuff, the memory management stuff

05:45

we're talking about today will still be relevant for uh those other other uh design schemes. All right.

So in in databases we care about or data systems we care about basically two ideas or two there's two two concepts we need to be mindful of

06:01

when we decide how we want to organize things on disk how we want to bring things into memory what algorithms we may want to use to process data or write to data write data to disk and so forth. So the first idea is the notion of spatial control and that's the idea of keeping track of being mindful about

06:17

where we're going to write our data on disk so that when we have to read things again uh we can do so we can try to maximize the amount of sequential access that we have and I said last class that random IO is much more expensive than sequential IO it's a little less of an

06:32

issue in modern SSDs but if you have any kind of rotating device like a spinning disc hard drive that is you know there's a huge difference in the the access times for sequential IO versus random IO, right? So that means that when we start having to write data at the disk,

06:47

we want to put things that are going to be used with each other very often close to each other. The other idea is uh this notion of temporal control and it's the idea here is that when we bring something into memory from disk since that's a very

07:03

expensive operation the thing we're trying to minimize or avoid as much as possible when we bring something into memory we want to do as much work as we can on that data that we brought into memory before we give it up before we release the memory or write some some

07:18

changes back to the disk right because the worst thing to do is like if I got to read three blocks uh or say I read two blocks and I can read the first one twice and the second one once. I don't want to read the first one, do some work on it, throw it away, read the second one, do some work on it, throw it away, then go back and get the first one again, right?

I want to

07:35

maximize the amount of work I can do for every single page I'm bringing in from uh from disk. And we'll see alternative storage models next week on Monday, next week, how you can go to town on this and avoid reading things you don't even need for queries.

For today's class, we're we're going to mostly ignore that.

07:52

So going back to that architecture I showed last class, right? At the lowest level, you have the disk and uh again this is the the nonvolatile storage.

This is considered the permanent uh location of the database, the the resting point of it. And then we have some kind of database file that's going

08:07

to be broken up into pages. And then we said that at the beginning of of the the database file or at some location that's special, we we know where to find these things.

We have a page directory. This is just a sort of a a database of the database pages that I have.

So keeping

08:23

track of like I want this page for this data or sorry for this database or this table for some file. Here's where to go find the offset where I need it.

In this diagram here I'm showing the database is comprised of one file. That's what SQLite does.

Duct DB does it could be multiple files across multiple

08:39

directories or even across multiple uh machines. But for now we don't care about that.

And then above up above the disk we have the volatile storage. We said this is just memory.

And in in this now we're going to have we're going to talk about today is the buffer pool. And this is

08:55

going to be some allocate somebody some region of memory that's been allocated that's in the address space of our database system that we can use to put pages that we fetch from disk into memory so that we can hand them off to other parts of the system. So in the in memory we have our buffer pool.

We're gonna say that the

09:12

placeholders, the locations where we can store pages from disk into memory, we're gonna call these frames, right? And that's just going to say that's going to distant by buffer and we have some offset said that's where a frame starts and we can put a page directly in there because all the pages

09:28

within a single file are going to be the same size, right? These frames can be reused and we know how to easily jump to different offsets.

They're called frames because we're running out of words. We already have pages.

Sometimes it's called blocks, right? We don't call them a buffer because the whole thing is called a buffer or cache, right?

So

09:45

we're we're calling them frames because that's all we got. So then we have an execution engine and again so this thing is where actually execute queries and as it's running somehow it says I want to read some some data and it's in page two.

Doesn't know where page two is but it knows to talk to the buffer pool

10:01

manager and exposes an API that allows you to go get page two. So in order to figure out what page two is, we got to bring the the page directory in.

I'm showing you this as a sort of separate step. In actuality, when the system boots up, you usually load that in the very first thing you do because we need to know what's in there.

But for now,

10:18

it's fine. We're just showing it as a separate step.

Then now we consult the page directory and then that tells us where we want to find page two. We find a free frame in our buffer pool where we just copy that page from disk into memory.

And then now we hand back the execution engine a pointer to to that

10:34

page in that frame. And the guarantee that that the bufferable manager is going to provide for the rest of of our system is that if we go tell this this execution engine here's the pointer to the page that you wanted that pointer will be valid until

10:49

that execution engine or whatever it asks for that page comes back and says I'm done with it. We'll talk about how we do that in a second.

Now the cool thing about the buffer manager is that we can actually reorder and change the location of a page uh anyway that we want if we have to

11:05

bring it in multiple times. So let's say now we we run some kind of eviction policy.

We decide that we want to throw away page two to free memory. Realize we have a free space.

We would have kept it but we could ignore that. So then now if the execution engine comes back and says hey I want page two again.

The bufferable manager would recognize it doesn't have that page in in memory.

11:22

It's got to go back to disk and get it. But now it's going to copy into to this frame and it hands back that pointer to the database system to the execution engine.

And that's perfectly fine, right? It's just an address in memory telling you here's where to go find the thing that that you wanted.

11:38

>> What's that? >> The question is what about the previous pointer going back here?

So at some point when we hand off this pointer the the the the contract is I'm giving you this memory. You got to tell me when you're done with it.

So at some point the executioner says I'm done with it.

11:55

That's why we were allowed to then evict it. >> What's that?

>> The statement is something like reference counting. Yes, but we'll get there in a second.

But there's more. Yeah, >> the question is what if the execution

12:10

engine asks for more frames that are in memory? We got to throw things away.

We'll handle that. Right.

The whole point of like we said at the beginning, we want to give the illusion that we have more memory than we actually do. So we'll have we'll have to handle that ourselves.

Other questions?

12:25

>> Yes. >> Uh the question is is it random how the frames are being selected by the execution engine or or for eviction?

>> No. >> How do we decide how do we decide to do

12:41

the frame? We'll cover that in a second, >> but But you want to write the the rest of the system, the execution engine, so that we don't actually care.

It could be random. Who cares?

You ask me for this

12:56

page. Here's the memory for it.

Go for it. >> Yes.

>> The question is, is the size ratio from a page to frame one to one? Yes, it has to be.

Um and well we'll talk about like there are

13:12

some advanced systems like IBM DB2 where you can actually have a buffer pool that has different page sizes uh per table per index right uh and you can even specify what the replacement policy should be per per buffer pool instance

13:27

but in general like if it has to be like if I'm reading from this I know I'm reading from this database file on disk it's organized in these database page sizes my bufferable has to have frames of those sizes. Okay, so I'm showing this example here

13:44

that we're using the buffer pool to read disk pages or the data pages for for a database. But in actuality, you're going to be using these buffer pools for all the different parts of the database system.

Basically, when when the system boots up, when data system boots up, you want to call Malo, get all the memory you're going to ever need to run

14:00

queries, and then you're done. Never want to go back to OS and get more memory.

Not every system does that. Uh, but in general, that's what you want.

That's that's the ideal case because the worst thing for me to do is start trying to out do a maloc and get some memory I need for like a temp buffer and then the

14:16

OS says I'm out, right? You want to know that sooner rather than later.

So, in general, anytime you're going to allocate memory, um, not from the stack, but anything from the heap for your for your application or your data system, you want that to come from one of these buffer pools. Some of them will be

14:32

backed by disk meaning like if I as you saying if I have if I'm trying to allocate more memory than I actually have available to me it can spill to disk. Other times you don't want to do that like if it's you just say if I'm out of memory then I I'll just kill the query right

14:49

so again we're we won't really talk about these right now we talk when we talk about join algorithms and sorting algorithms make more sense log buffers as well just be aware that there's memory being used for other things other than just tupils or indexes but the main one we want to focus on is those things

15:06

all right so today we're going to talk about the basics of what a buffer hole manager is I think the textbook calls them a buffer cache, a buffer cache manager. Sometimes it's called a a buffer manager.

They're all they're all basically the same thing. It's managing the memory that the data system is going to use to to run queries.

Then we'll talk about um my my pet uh my

15:26

pet topic of why you don't want to use the operating system for any of this because it's going to ruin your life. Um and then we'll talk about replacement policies, how to decide when we want to evict things, uh how to write things out to disk, and then do some additional optimizations.

Okay. All right.

So the buffer pool as I said

15:42

it's just a bunch of memory that we've allocated in our data system that we're going to use to put in fixed size pages and those locations within that that giant array is going to be uh each location where we could put could put a page in we'll call that a frame. So

15:57

again when now the data system other parts of the data system will says I want to get this page it ask the buffer pool manager for it. If it's available it's already in memory then the buffer pool manager just hands back the pointer for it.

If it's not available, then it has to decide what what frame to put it in. And if there's not a free frame,

16:14

we'll have to handle that. And the page you're putting into memory that you're going to hand off to the other parts of the system, it'll be an exact copy of what is given what what is uh what is being read into the from disk, right?

There isn't like

16:30

uh make sure this is true. If the like the low level like the say you're using like a a storage appliance, it might do its own compression down below actually the hardware level, but when you actually do a read against that file, you're going to get back the uncompressed bytes and that's what the

16:46

data center is going to hand back to uh to the exchange engine. Now within that page, it actually may be compressed and the data center knows how to interpret that compression.

We'll see how we do that next week, but in general, it's a one to one copy for whatever's on disk goes goes into memory. So we look at an example like this.

We have four pages on

17:03

disk. We want to read page one.

The buff manager is going to say, "All right, I got to put this somewhere." It picks the first frame, makes a copy into it in there. All right.

Now, once we read page three, same thing. Make a copy and go in there.

So, the buffer pool is actually the memory location where we actually

17:18

store this. There's an additional data structure called the page table that's going to sit in front of this memory, which is actually the entry point for how you find pages that are in memory or not or check whether they're in memory or not.

I think this is just another hash table where I do a look up on the

17:33

page ID and then within that slot within that hash table it's going to tell me whether the the the entry exists or not and where to go find it. So there's some additional metadata in here uh sort of what what they brought about like reference counting.

I'm also going to keep track of like how many how

17:49

many times have I handed out this uh this page to other parts of the system. It's called a pin counter.

Reference counter same idea. And so you would say a page is pinned in memory means that some other part of the system currently has a pointer to that memory and therefore I don't want to free that

18:05

frame up and throw away whatever's in there because I don't want somebody some other part of the system start reading what's interpreting what's inside the page and then I replace it with another page and now it starts getting garbage data or incorrect data. So the pin counter is how we're going to avoid that problem.

18:20

We also can uh maintain a uh a latch inside our data structure to ensure that while we're doing a lookup in the page table to say does this yes question yes >> the question is what sorry question is wouldn't what are the page

18:36

table entry sizes uh less than a kilobyte right because the big part is the is the actual frames the actual memory the hash tables hash table could be can be large in terms to have a lot of entries but Each entry itself is not that big, right? Like so there'll be

18:52

like a dirty flag whether it's been modified that's that's one bite. The reference counter probably a single bite.

Access tracking information that would be like um what transaction is actually is is accessing this for me like who's

19:07

actually what is the thread or the worker ID that's actually holds the pin for this. Again it's it's not that big.

All right. All right.

So, I need a latch within my uh my page table as well. So, that like if I say, "All right, well, I need to go get page two.

I have a free

19:24

uh location in my page table. Let me protect this while I go then do the disk read to go put that into a frame to ensure that nobody else comes along and tries to put something in my page table that that same slot, that same location, right?

And then once I'm done, I can

19:40

release the latch and and throw away the pin." Does anybody know what a latch is? Go for it.

>> UTEx. Perfect.

Yes. So, if you're coming from the OS world, what we call latches,

19:57

they'll call locks. And the end of the day, it's it's a mutx.

You could use the OS's mutex or the Pthread mutx. You shouldn't, but you could.

Uh, we can always do better in the data system. But the reason why we have this between locks and latches is that locks in the

20:13

database world is a separate concept that is about protecting higher level logical entities like I can lock a page, I can lock a a tupil, I can lock lock an index, right? And a latch is going to be what you protect a internal critical section of any kind of data structure inside

20:30

your database system uh from like multiple threads or multiple workers accessing it at the same time. And the reason why we have to make this distinction is that in the lock case we have to deal with stupid people on the outside of the database system like someone starts a transaction and then

20:46

decides to go out you know for a cup of coffee and the transaction still open. So we as a data system have to deal with somebody you know walking away from from the computer and and be able to rectify any issues like deadlocks or live locks and so forth.

a latch since that's protecting internal data structure.

21:02

That's us like who actually building the data system. We're paid a lot of money to do it and therefore we won't be stupid in theory.

Uh and therefore it's up for us to make sure that we don't have deadlocks or other problems like because there isn't going to be this lock manager thing above that's going to kill us if we have deadlocks. So it's up

21:18

for us to make sure we we do this correctly. So this would be a low-level mutex.

You could use the again P3 mutex the Apple one called the parking mutx one or parking locks. That's better.

Um but we we'll cover that in a few more weeks. But the main main idea here is that it's g again to protect the

21:33

internal data structure of the system, right? And there isn't going to be anybody that can make sure that we don't have problems.

And there's not going to be any way to roll back changes unless we do it ourselves. In the case of a lock, if I open a transaction through SQL and then I make some changes and then my transaction gets killed, the

21:48

database system will clean clean up the mess afterwards for me. It won't do that for for a latch.

Again, just also as a reminder between the page table and the page directory. Page directory is that thing I said in the beginning where I use that to find the pages I want on disk uh given an ID

22:05

and the page table is this infeal internal data structure that keeps track of here's the location within frames of all the pages that I brought in from disk and where they reside in memory. Again, you don't have to store that on on disk.

If you crash, you know, you

22:20

lose whatever is in memory anyway, so who cares? There are some systems actually can will write out what the contents of the page table is so that if I crash and come back rather than sort of organically populating the the page table through whoever access whoever accesses what I can actually preload or

22:37

prefetch all the uh all the pages that I had in memory before after the crash and bring that back in. You see this in serverless systems again we'll cover this later but just in general the page table does not does not need to be durable does not need to be stored on disk.

All right. So, what does this sound

22:53

like? What I've been talking about so far.

He sort of brought up the point before where what happens if I have, you know, my database is large amount of memory that I have. We're basically again trying to provide the illusion that our system has more memory than it actually physically has.

What does that sound like?

23:09

Say swapping, but like what is the higher level concept of that? >> Virtual memory.

Yes. All right.

Well, yeah, if you take an OS class, this is what this is what the OS wants to do for you uh through virtual memory. And it has a a a mechanism called memory map files that allows you to take a file

23:26

that resides on disk and you call mm mapap open on it and it basically maps the contents of that file into the the address space of your process right it obviously doesn't do this eagerly it does this lazily so I call

23:42

call the open on the file and it then gives me a starting point of of the memory address and anytime I jump to an offset within that starting memory address it will correspond to some off some offset within the pages of the file that I mapped in. So the OS was going to

23:57

be trying to do this uh basically trying to do the same thing we're doing talk about today, but it's going to do a terrible job at it. Um I guess I'm sort of poisoning the well already, but trust me, you don't want to do this.

All right, so here's that file we had on disc before and we have virtual memory. So again I I call m map on it and then

24:14

in my address base my process it's going to say all right well here's the four pages starting at some offset uh for the file that you mapped in. So then now when any kind of any thread or process accesses one of these pages right you try to again just read a memory address

24:29

uh at one of these offsets the OS says oh the thing you want isn't actually in memory it's a major page fault so it pauses your uh pauses your thread go then fetches whatever the the page you wanted puts it into physical memory and then wires up the virtual memory table

24:45

to now point to that physical memory and then your your process gets gets gets control back again and you can do whatever you with it, right? Do another read another page.

Same thing, right? It's not in memory.

We get a major page fault. We block fetches it in.

We we wire it up and then our process can run

25:01

again. So now the challenge is going to be what happens when we try to access something again for this file.

But now we're out of physical memory, right? What's going to happen?

I've already said, right? It's going to

25:16

block your process. Then it's going to go decide now which page in physical memory to evict.

You know if it's if it's not dirty it just throws it away. If it is dirty meaning it's in modified since you've read it in it'll write it back out to disk.

Then copy whatever the pages that

25:33

you want it in once it's put put in the physical memory and then hand you back control. So again, this seems like exactly what we want in our uh in our database system, but because the operating system is doing it for us, it doesn't know anything about SQL, doesn't know

25:49

anything about queries or transactions or whatever we're trying to do, doesn't know anything about what actually is in these files, and therefore it's not in a good position to make decisions about what the right thing to evict is. So what are the problems?

So the first thing is that the OS can

26:04

decide to swap out a dirty page anytime that it wants. Right?

This background writer can go through and says, "All right, you've you've modified this page. So, let me go ahead and preemptively write it out for you so that at some point if I need to evict it, I don't have and since I'm installing your your process, I don't

26:20

have to block you for a longer long time while I write it out, I can do this in the background. And that way when you want to go evict this, you just can throw it away because it's now been marked clean." But the problem is we have no control.

we in the database system have no control what it's going to write out and

26:37

when it's going to write it out because again it doesn't know what's in these files doesn't know how how the upper level part of the transaction is actually modifying them so there may be a case where there's a dependency between uh multiple pages where I can't write out this dirty page till this other dirty page has been written now

26:54

okay we'll see this when we talk about transactions later on but like you basically if I modify a page in the data system I want to make sure I write a log record that says here's the change I made to this page before I write out that change. But the OS doesn't know that.

It doesn't know that there's a write ahead log or this other thing over here. It says, "All right, I got some

27:09

ready pages. Let me go ahead and write them out." You can prevent them from being evicted using MLOCK, which basically the same thing as a PIN for the pages, but that doesn't prevent it from from running it out, right?

So that that's a problem for us. Then we have

27:25

these this stalling issue where the data system doesn't know what pages are in memory. it tries to go then access something and if it's not memory then my worker either a thread or process will have to stall while the the OS decides to go fetch it uh fetch it in.

So that

27:40

means that my worker is not actually doing any useful uh you know any useful work that it could be doing while the while it's going fetching something from the disk in the background because it's it got blocked. Now, you could start, you know, doing a little play play a little game where you

27:55

have a separate thread on the side that can go touch a page uh that you you think you're going to need. And that way, if it it gets blocked, uh the the worker that's running the query can still do other stuff, but then you're actually you're adding more complexity to your system when what the OS is

28:11

trying to do, you know, trying to make it simp, you know, trying to do something simple for you, but now you're actually adding a bunch of stuff in order to deal with the OS doing stupid things for you. Another problem that's actually more nuance and not really obvious until you start building real systems is that error

28:27

handling becomes more more challenging now with the OS uh with MAPAP because a failure can occur at any time anywhere in your in your application in your system. That means that at any time I'm touch that memory, it might have got evicted or might been there might be a problem

28:44

in in the actual underlying hardware and I'm I'm not not going to get an exception. I'm going to get a interruption or interrupt.

So now I got to have interrupt handler code all throughout my system. Whereas if I was managing disk and memory myself in my database system, at the time I try to go

29:00

do read it, if I can't read it and I get an error, that's the only location where I need to handle that. But if you use MAP because it's again trying to be transparent to you, it's all over in your system.

>> Yes.

29:16

>> Because I would think if there's a page just because the page is not for you, right? >> So it's not a page fault.

It'd be like an interrupt like the disc is broken or something. >> Yeah.

So you try to read something and the

29:32

someone ripped the file out. It's gone.

What do you do or lip lip rip the hardware out? >> You get interrupt.

>> Uh the question is statement is and they're correct. Uh that like if if someone screws around

29:48

the system uh and starts breaking your files. Uh, but the point I'm trying to make is like if I say so I read something from disk, I put it into memory and then I rip rip the hard drive out, >> then in theory the system can still run

30:05

because it has it in memory, right? And so so if I then try to read that page again since the file is gone, I'll get a failure at that at that moment and I can handle it right there.

Whereas with MAPAP, if I if some other part system tried to read it again and since it's trying to be transparent to me, I just do a memory access.

30:23

non-tremistic. Yes.

Yes. And it's a bunch of more engineering work you have to do.

And the last one, I wouldn't belabor this too much, but uh just sort of take my word on this. Uh it's going to be slower than doing everything ourselves because what is the how the OS tracking virtual memory?

It's the same

30:40

thing as the page table that we had before. So, it's going to have its own internal data structures with its own latches it used to protect protect those things.

And we in the data system world, we can always do a better job. So again, I don't want to go through this too much, but saying like you can

30:56

start tricking around the the the OS to make it look more like what we actually want to build in our database system, but it's it's just as much work, and in the end, you're you're relinquishing control to what's being brought to memory and and when when it's being written out, at what time, and when. like all that is uh you're giving up to

31:13

to to the OS and you're just better off just writing everything yourself which is what this class is about. So there's a bunch of systems that have been uh tried to use MAPAP over the years.

Uh probably the one that you're most familiar with is is MongoDB, right?

31:30

When MongoDB first started, they didn't have their own buffer pool we're talking about today. They relied on MAPAP, right?

But since then all these systems have have gotten rid of it because they basically realized all the mistakes that I'm I'm telling you here. Uh you know they realized that over the years like

31:45

this thing is just not maintainable that you want to be doing everything yourself in in your data system. I like to point out because went IPO in what 2018 they had a ton of money.

You had a ton of top engineers and if they couldn't make mat work be functional uh then you know what what

32:02

what uh hope does everyone else have? So instead what what MongoDB did was they bought Wire Tiger.

Wire Tiger only uses MAPAP for one very small use case but the whole regular system is still based off uh uh you know buffer pools that we're talking here today. SQL light has

32:18

MAPAP. It's for portability reasons and by default you don't get MAPAP.

You have to tell it you want MAP. By default you get the buff stuff that we're talking about today.

So this will be an over overarching theme throughout the entire semester is that the database system meaning us actually building the system. We're

32:35

always going to be in a better position to make decisions about anything and everything, right? Because we know what the queries are.

We know what the query we know what things we've executed in the past or what we think we're executing in the future. We know what our data looks like, what the dependencies are between them.

Therefore, we're always going to make better decisions than the OS. And we'll

32:52

talk about these tricks later on, but like we'll have better prefetching than the OS. We'll have better replacement policy in the OS.

We can schedule things differently. The end of the day, we don't want to talk to the OS because it's not going to be our friend.

It's always going to try to screw us over. If you don't believe me, go look up Lionus's post on the mailing list about

33:07

what he thinks about database people, right? He's always going to try to ruin our lives.

So, we never wanted to talk to it as much as possible. So, we actually ended up writing a paper a few years ago explicitly on this exact topic.

So, why saying it's my pet pet uh pet project? Because all these companies would come give talks at CMU and talk

33:23

about how they're using MAPAP. And I was like, "That seems like a bad idea." Like, "Oh, what are you talking about?

It's great." And then like two years later, they were like, "Oh, yeah, you were right. That was terrible." So we then we wrote a paper paper about it.

Uh you go to that link there, there's a video that can sort of summarize the key things I'm talking about today. Um as

33:38

well. And then uh just a few weeks ago, yeah, a few weeks ago, somebody posted on Hacker News saying like, "Hey, I started building a data system using MAPAP uh because I thought I could figure it out." Nope.

They were wrong. And then they were like, "Oh, I" because he knew about the paper that we wrote before, then he realized, "Oh, yeah, I

33:55

guess I have to do what Andy said." Uh, so again, the rest of your life, never use MAP in your data system. Okay.

So, if now we're responsible for memory and we're responsible for deciding what

34:10

comes in and out, we have to decide what to evict when we run out of memory. Yes.

>> But I don't get by. Isn't the database the statement is um uh how do we bypass

34:26

virtual memory? I mean, yes.

So, so when you call Malo, yes, you get you get anonymous MAPAP. The thing I care about is I don't want the OS to decide how to how to flush pages out.

So, it could start thrashing for us because we do

34:42

anonymous MPAP, you get a bunch of memory that when you call maloc and then it starts running memory, it starts paging things out. But in general, like I was saying, when you when the system boots up, you want to allocate all the memory you you need ever at that moment.

And therefore, you have to tell it use exactly the amount of memory that's

34:57

physically available. So if you're if your if your your system you're running on has 100 gigs, you would tell your data system, hey, you have 80 gigs, do whatever you want with it, and it mallets whatever it wants, and that'll never get thrashed.

>> But that's not true. Like another user program, then

35:14

>> the statement is that's not true. If the if another if another system running on the same box allocates a bunch of memory too, could could you start thrashing?

Yes. Don't do that.

That's like what else would like you never want to run another process or

35:31

another system on your data system. Same machine.

That's considered better, you know, general the right way to do things. If it's like a word WordPress website, who cares, right?

It's something real simple. But if it's like any mission anything mission critical, you don't run you don't run like a

35:46

Bitcoin miner in the same box as your your data system. >> But then why would the OS >> why would the OS want to page you out?

>> Why would you? >> So um the statement is why would the OS

36:02

page you out if you're if you're saying that I'm running on a box that only has the memory that I need for the only thing running is my data system and has all the memory I could ever want. You're correct.

it won't page you out, but it could write out the dirty pages and that that you're screwed on

36:18

as well as all the other other stuff. There's other reporting issues, too, like you want to say, all right, how much database, how much memory is my data center using?

Well, now I got to like go look in whatever is in the the talk about the page cast in a second. Like you can't just say that you can't just go to the OS and say, how much memory is this process using because

36:35

it's like that plus the page cast or whatever math stuff. It becomes more complicated for reporting.

Okay. All right.

So, go back here. Sorry.

Um, so we we have some fixed amount of memory that we can use to store data uh

36:52

for the pages we bring in memory. But then again, at some point we're going to run out of memory, run out of free frames.

We have to decide how to make up make space. And this is called the buffer replacement policy.

And the idea is that we're going to run some algorithm that can choose which which

37:08

page we think we're we're least likely going to need in the future in the ideal scenario and we're going to go ahead and invict it, free that frame up and then now bring the the requested new requested page into memory and reuse that frame. So there's a you know this is an old problem in in in just

37:25

computing computer science in general like it's basically caching, right? because we have to deal with that memory hierarchy where the things that are have more space are going to be slower and farther away, but the fast stuff like DRAM is going to be much more limited in size.

So now when we run our replacement

37:41

policy algorithm, we obviously want to be correct to the extent we can because we want to make sure that we throw things away that we're least likely going to need in the future. Uh, and we want this to be as fast as possible because again, if if the the cost of a dis IO is like 1 millisecond, but if a

37:56

replacement algorithm takes 10 milliseconds to run, we're just better off just reading things from disk because that's always be faster, right? And the amount of memor memory we're going to use to store the whatever metadata we have about what's in our buffer pool, we want that to be as small as possible because if we're spending,

38:12

you know, megabytes to store uh again metadata about what's within the the memory, then that's memory we can't be using to store more data. So I said this is an old problem.

Uh one of those common solutions uh in in

38:29

systems goes back to 1965. LRU least recently used.

And the basic idea is that you just keep track of a time stamp for all the pages you have in memory based on when when they were last accessed. And then when it comes time to evict something, you just pick whatever one that was accessed the the longest

38:46

ago. And anytime you access something again, you just update that time stamp.

All right? So say I have a simple a database has three pages and my prefer pool has three pages.

And so again, I can just maintain this this link list here. It's one way to do So it could do a table where the order in which they

39:02

exist in the link list corresponds to the order in which they were last accessed. >> Yes.

>> Question is is where would these timestamps being stored? Yeah.

Metadata page tables.

39:19

access and then you >> the statement is uh let me walk through this and I think it'll make sense but yeah when when you go try to get a page I want to get page one I go look my buffer pool I see that

39:35

it exists there so I'm I don't go to dis and get it but then I go find whatever it is in the in the the LRU list or could be a table doesn't matter and I just move it to the to the front and then then I hand back the pointer to whoever wanted Yes,

39:52

>> you save it as if you have a list. Why do you need the time stamp?

>> You could sort the time you sort of as a list, but then you got to like search list every time you or you have a table and just store the time stamp in the table effectively the same way. This is just two representations.

I'll show when I show l I'll show a table.

40:12

All right. So then now when I want to fetch something I don't memory in, right?

I just I pick whatever's on the last page at the end of the list and I go ahead and pick that and then uh free up space for the data that I want. Right?

So again, this will give you the exact order in which uh the table is being

40:28

accessed. But maybe you actually don't need the exact order.

You can do a more simplified version called clock which came out in 1969. And it's basically an approximate LRU.

And the idea is now instead of having a time stamp uh or a link list to keep track of like when things are accessed,

40:44

I just have a simple reference counter that's a single bit for every single page. And then now all I need to do is check to see has this bit been set meaning has this has this page been accessed since the last time I checked it.

40:59

Right? So at the very beginning I I would look at at this one here the if I access page one I'll set its reference count now to uh to to one and then now there's this thing called the clock hand.

I'm showing this in sort of circular pattern but obviously we store

41:15

this in in a list and it's just a cursor that goes through anytime I want to evict something to try try a free frame I just go through and check to see whether the reference count is set to one. If it is, then I'd put it back to zero.

And I keep scanning along till I find one where the reference count is zero. Meaning since the last time this

41:31

algorithm ran, this page was an access. So therefore, it's safe for me to go ahead and and evict it.

>> Yes, >> you're using the word page, but here are you frames? >> So yes, the statement is um

41:46

I'm I'm using the word page, but do I mean frames? So these are the pages that are in frames.

So these are the pages that are in memory. >> A frame is like a memory location where I can put a page in.

>> So you memor

42:02

statement is do you am I iterating over the frame locations and not the page ids for these are it's a powerpoint diagram. It doesn't matter but in general like you you would just you would have the the page table would be the separate thing you iterate over that that's separate from the frames.

42:22

Sorry. Yes.

Here. So in the page table, you have all the possible page ID insert.

>> No, the page table is going to have what are pages that are in frames right now in memory. >> It's not a mapping of page ID.

>> That's the page directory. The page directory is like here's all the frames

42:37

that exist. The page table says here's the mapping from a page ID to a frame.

>> Right? So it has all the page IDs >> that are in memory.

>> The list of keys changes. >> Yes.

Because I because I if I evict something that I hash

42:53

>> yeah it's a hashmap but if I if I evict something I I have to evict going back here I evict page two right in my page table I have an entry for page two that's going to point to a frame where page two exists. So now when I want to go to evict it I would say assuming this this

43:09

algorithm runs said all right I'm going to evict page two I remove it from the page table. So if anybody comes now looking for page two they're not going to see it and they're they know they have a cache miss.

So what's in the buffer pool? Sorry, what's in the page table is whatever's in the memory right now.

43:25

>> We'll talk about ghost pages in a second. There you could keep track of all the pages.

That's not this though. That's later.

I don't want to copy anything. Other questions?

Yes. >> Statement is this is not a reference

43:40

counter. It's just a bit.

It's it's a reference counter that goes up to one. Just one.

That's it. I It's not a reference counter in terms of like who has Yes.

who has who's holding this page right now. It's just a saying this thing

43:56

has this thing referenced accessed. Yes.

The terminology maybe it's hard to they reuse a lot of these terms. It's called call it a reference bit, but it's really a an access bit.

Was this thing accessed since the last time I checked?

44:14

All right. Again, just looping through set.

bring in page five, set this reference count to zero, and I keep you say these other pages are accessed. So, as I scan along, I change their reference bit to back to zero.

I come back up here. This guy's set to zero again.

So, I go ahead, I can go ahead and invict it.

44:31

This is what basically Linux does in its own internal page table. Uh, it's a it's a multi-hand clock.

Um, the high level idea is basically the same. >> And again, for some things, this is okay.

Again, we're not having exact listing, but this is probably good enough. Yes.

For each iteration of this

44:46

does the cursor or does it start? >> Yes.

So he said in does this algorithm resume from the beginning or where it left off last? That's where we left off last.

>> Yes. >> This only runs when we need to.

45:03

>> Yes. The statement is and it's correct.

This algorithm only runs when you need to evict something. So everything we're talking about today is like how do you decide what to evict when I when I need a frame?

>> But the clock's like motion. Does it only saying we're accessing

45:18

page five and z question like when does the clock actually run so if I access a page I set the bit to one I have to right the the

45:33

clock sweep only occurs when someone says I need a frame I don't have any free ones then it kicks in and runs and it starts setting them back to zero >> why is this Question is why is this faster than LU? Because the amount of metadata I have to record for this is like super simple.

45:49

It's a bit >> then is does this take longer? Like is the computational complexity less than LRU?

>> Uh again it depends on how you

46:10

Sure. But when I act statement is like it's 01 to pop something off the end of list, but when I have to update it, I got to go find it and then put it put it to the front again.

46:26

>> Yeah, that's >> Yeah, but this is this is like this is nothing. >> We all right.

I don't got this point yet. They they bring up a very good point.

This is not your algorithms class where like, oh, if it's constant time, they're all the

46:42

same. No, constants matter a lot, right?

Like the difference between 1 millisecond and 10 millconds is a lot. It's an order of magnitude.

So, this isn't this isn't who teaches I don't teaches algorithms. This isn't algorithms.

We care about this, right? So, like it is constant, but like you

46:58

know computation it's the wall clock time is significant. Yes.

>> Does the clock start at the same point every time? >> No.

The question is when where does the clock start at the same point every time? No.

Because in that case, you'd be hammering whatever the first one is. It's where you left off, right?

So going back here,

47:14

right? So I flip this guy, right?

Say now the the clock runs again. So these guys get access down below, right?

Set to one. Then I got to evict something.

It when it starts where it left off. All right.

I don't spend too much time

47:29

on clock. Not does this.

Uh it's just an approximation of of LU. LRU is actually common but not in the simplified form that I showed before and that's because both LU and clock are susceptible to a problem called sequential flooding which

47:46

basically means that if I have a query that comes along and does a sequential scan which means I'm going to read the table from beginning to end then that's going to blow away any uh uh recency information that I' that I've been maintaining in my database system these pages because this is going to go

48:02

through and rip out throw away things and put a bunch of pages into my buffer pool that maybe I actually don't want to be in there because I only need it for that that that sequential scan. So I lose all sort of the information that I have about again the recency information because you're basically polluting the

48:17

buffer pool. And in some workloads like OLAP for analytics, we'll talk about that next week where you're not really doing single lookups on single keys.

You're actually doing these scans over and over again and you don't actually want least re uh least recency recently used. you actually most recently used

48:35

because then you start playing games of like oh I I have to do I do much more scans so I'll scan from beginning to the end next query shows up maybe I want to scan from from the end to the beginning because the pages that I just read are going to be in the buff in the buffer pool right so what looks like this so say I

48:52

have a query at the top I'm doing a it's called a point query I'm trying to find a single record where ID equals one assuming ID is the primary key so somehow I figured out that ID equals one is in page zero. So I go ahead and put that by my buffer pool.

Then now a a

49:07

scan query comes along. I want to compute the average from some column in the table and that's going to go from beginning to end.

And at some point it's going to it's going to run uh when it gets to this last page here, page three. Doesn't have any more free space in the buffer pool.

So then it's got to run the the you know the the LU algorithm decide

49:23

what to kick out. Well, page zero was the last thing that it read the one least recently read.

and I'm going to go ahead and invict that and put it in page three. But let's say this query at the top where I'm doing these ID lookups, that's actually super common and I'm doing I'm doing a bunch of lookups on

49:39

this ID equals one. Like a workloads are very skewed where people always trying to read the same thing or write the same thing over and over.

Taking like you know Taylor Swift, she's on Instagram whatever her Instagram account versus like some random guy's cat's Instagram account. There's more people reading Taylor Swift's thing rather than the the

49:55

cat guys thing. So, we want to keep the Taylor Swift thing in memory.

But if we have if we're just using LRU, we wouldn't be able to do that because we would not know that Taylor Swift is popular and we would end up evicting her page if we used that that basic LRU algorithm,

50:12

right? So, you say, all right, well, what if I do something like keep track of uh how often these pages are being used and there when I want to run my eviction algorithm, I want to do the one that has been the least frequently accessed, right?

just maintain a counter for every single page of how often they're being

50:29

accessed in when they're in memory. And then now when I want to evict something, I just pick whatever one that has the smallest count, right?

Seems reasonable. Of course, there's a paper on this from 1971, two years after uh clock.

Well, this causes more problems because

50:45

now to his point about the complexity, now it's logarithmic complexity maintaining this information. All right, because for all the the table.

And then furthermore, if I have someone that does something weird like go look up the cat guy's uh Instagram account in the middle of the

51:01

night a billion times and then at no point does did everyone go back and look at that that cat guy's account, his counter is be super high and I'm never going to end up being evicting it even though it's not useful anymore anymore at all because there's no information about time in this count. It's just like

51:17

how many times would it was it accessed now? how how long ago was it accessed?

So the solution to this is a simple enhancement that came uh out in the early 90s uh called LRUK and there's a great paper that that talks about this and this is actually what's used in SQL

51:33

server. This is actually what Postgress implemented first in the the late 90s uh written by this guy Tom Lane who's he did his PhD here at CMU.

Um, and the basic idea is that we're going to maintain multiple LRU lists and then when it comes time comes time to evict

51:49

something, we're going to uh calculate the the the length of time from when it was last accessed or the the oldest access and the current time. And then that way we can decide who who has been accessed the longest ago.

And since now

52:05

we're gonna have these multiple uh LU lists, we would know whether someone just accessed something once and never accessed it again. And so therefore, we don't have that that blowout before where someone has a high count that never comes back.

So the

52:23

in order to make sure that if we evict something and then bring it back in, we don't have to learn from scratch all over again that this thing is a hot page or not. We can actually also maintain what are called ghost caches.

a ghost list where we can keep track of even though something got evicted, we'll

52:38

still keep in memory its access timestamps so that when we go fetch it back in the second time around that we can prepopulate our LU lists or time stamps with what it was before it got written out to disk. So again, this now provides more uh

52:54

persistence into our algorithm so that if we write things out again, we're not just learning from scratch all over again. So let's look at a simple example.

So now instead of using a uh an LU or sorry a link list to keep track of the ordering I'm going to maintain this separate table right and you can think

53:10

of this as like the page table uh where and I'm just maintaining time stamps of things. So say that uh in the past I've accessed pages 0 one and two and that's what's in in my buffer pool and assume the time stamp is just a monotonically increasing

53:26

counter like a 32-bit integer right 1 2 3 4 5 six so forth. So the query starts off here uh I have it my my I have an in clause where I'm going to access three different values on three sorry four different values on four different pages.

Now I'm going to show this in sort of the order that it gets that

53:41

they're specified in the in clause. But again, SQL is unordered.

So in actuality, you would actually want the system to to figure out what pages these these these records are on like 1 131 11 and 41 and then do them in sequential order. But in for simplicity or sorry to

53:58

show the the how this the page history works, I'm going to I'm going to have them jump around randomly. So when I my query starts off, I I want to read ID equals one.

that's in page zero. So now because I already have this entry in my buffer pool, I'm gonna have so the the two columns, one is be the last access

54:15

uh the most recent one and then the second access since we're doing k equals 2 here, I'll have a second list and keeps track of like the um the the oldest time stamp for this. So at the very beginning if I access it once there's nothing in that sort of second column.

And again I'm showing k equals 2. You can go up to k equals 6.

Uh some

54:33

systems can do that. All right.

So in this case here because now I'm going to access at time stamp four. Uh one gets slid over to the other column and then I'm put time stamp four here.

Then now I jump down to access page three. All right.

So now this is not in memory

54:49

uh in our page table. So we got to decide what we want to evict to to free up space to put put it in memory.

So the eviction policy is going to take whatever the current time stamp is for us. Again that's just a counter increasing.

And then we look at the the oldest last access time stamp, right,

55:06

with the kith one. So in this case, k equals two.

We'll look at the the last access number two column. And we'll just subtract that to figure out what the distances between those the two axises and then choose whatever one is the is is the largest, meaning it's it's been the longest since it was last access

55:22

before. If you don't have an entry in the second column, you just use infinity, right?

And if you have ties, you just pick whatever one has the the oldest time stamp. So again, we're going to look at the second column here.

You just go through take the current take the current time stamp five subtract it by whatever is in in the uh in the second

55:39

column in this case here because it's time five subtracted by one you get four. But then the other two guys don't have entries there.

So it's infinity. So then now I would say all right well both of these are infinity.

I'm going to pick whatever has the the the smallest time stamp the oldest time stamp in the first column. This case here it'll be page

55:56

one. We go ahead and pick that and put that in and then we update it to page five.

do it again. So now if I access page one, that's the thing I just evicted.

So it's not in memory. Uh so I got to do that same algorithm that before where I

56:11

just subtract whatever the time stamp is with the the last access one. In this case here, the the two bottom ones are still infinity.

I pick page two because this time stamp is three is less than five. So that's where I'm going to go ahead and evict it.

So that's where I'll put page one. But now again remember I can say I have that ghost list or ghost

56:27

cache where I keep track of the timestamps of pages I just evicted. So I just evicted page one uh before.

So I can go ahead if I have the ghost cache I can put time stamp two in there. And that way when if the eviction policy runs again this doesn't end up getting

56:43

thrown out immediately all over again. I access the last page, page four again.

Run the same algorithm, run the same calculation. Now I get this middle guy here because he hasn't been accessed twice.

I'm going to go ahead and pick evict page three and put put page four

56:59

in there. Yes.

[Music] >> The question is is the ghost hash part of the page table or is that a separate

57:15

data structure? It's usually a separate data shell.

It's not it's not that big. So it's not that big a deal.

Yes.

57:31

>> Question is, is this time stamp is this I'm showing this as a it's a logical counter. Would you do that in a real system or would you use like hardware clocks like the the wall clock time?

You would do you typically would use a logical one because the wall clock time

57:47

can change when there's like um like if you like daylight savings and things like that, you know, time can go backwards. That's going to be a bigger problem when we talk about transactions later on.

How to deal with that. Yeah.

Oh, time is terrible. Yeah.

58:04

Other questions. Okay.

Okay, so let me show a sort of simplified version that my SQL uses called approximate L ruk. The basic idea is that they're still going to have one linked list.

Uh and again could be a table, be link list, doesn't matter. But

58:20

they're going to partition it into young and old regions and you're going to have two entry points into these into the link list that allow you to find the beginning of the of the old region and the beginning of the of the the new region of the young region. So now when I access a page, say accessing page one

58:36

here, I got to evict something because it's not my buffer pool. So I'll pick whatever is uh I'll go into the old list, slide everything over by one.

So I'm going to get rid of page eight. These guys slide over and that's where I'm going to put page one in.

58:53

And the reason I'm going to put it in the old list rather beginning all the way at the beginning there is because if I never access this page one again, then that it'll just get sloughed off at the end, right? No big deal.

But if I do now access page one again, then what I want to do is now put it at the beginning of

59:09

the young list because I'm saying, oh, this clearly is important because it keeps getting access again. So, let me go improve its chance not to get evicted very soon by putting it to the front of the young list.

Right.

59:25

Again, it's approximate. It's not exactly the way LRUK did it, but it kind of gives you the the roughly the same benefit.

All right. I want to talk short on time, but I do want to talk about this arc one because this is what you have to implement in um in project one.

Uh and

59:43

as I said, the projects this year are harder than the previous years because you can by all means use use LLMs to help you code. So this is called the adapter replacement uh cache.

It was invented by IBM research actually for their disc storage systems but I think it went into IBM DB2. Um it's in ZFS as

00:00

well and actually was in Postgress uh was added in Postgress 2004 or five. Um but IBM patented this and the Postgress people were kind of worried about that.

So they made some little tweaks to the algorithm to make it look

00:16

less like the patent although IBM hasn't enforced it. And actually I think it expires next year.

Uh but this is this is roughly considered this the this kind of policy or the kind of algorithm is roughly considered the state-of-the-art for buffer replacement

00:32

uh algorithm. But the basic idea is that we want to get all the benefit of something like least recently used as well as the benefit of least frequently used.

Um, but instead of having to maintain sort of separate lists or sorry, instead of having to balance how

00:50

much we're going to depend on one list versus the other and you have to manually tune that, we instead want to have this sort of built-in policy that decides how to size up the different uh lists based on the access patterns of the of of the query. And then of course we're going to maintain a ghost list to

01:06

keep track of what things evicted uh recently so that if we go fetch them back in we can we can uh we can record that history and rely on that. So in the sick of time I mean I know we need this for the project but um we can follow up online to do recitation

01:22

separately but main basically there's there's these multiple lists we're going to maintain. there's this parameter P where we keep track of how much we're going to favor uh putting things in the least recently used list, the recent list or the frequent list.

And so we find ourselves accessing the same things over and over again. Then we'll allow

01:39

the size of the the the pages we're tracking in the frequent list to grow and correspondingly shrink the recent list. So you sort of automatically get the best of both worlds of both leastly used and least frequently used.

and it can eventually narrow in exactly how the

01:55

what the workload actually wants. And because it's dynamic, if the workload changes over time and the access pattern changes over time, this thing can handle it.

So again, uh we can we'll we'll follow into a separate recitation on the algorithm because you'll need it for the project. Um and there's there's a lot of

02:11

videos online that discuss how this works. Um but it's basically the same concepts we had before again except you have this P parameter that changes where you where things are going to go.

Okay. All right.

Other tricks you can do to make things go fetter is to have

02:27

information about uh how the page is actually going to be used. Like if I know that I'm doing a query scanning a table for special scan and it's unlikely that this table is going to be used by other queries, then I don't want to be put it into my my

02:43

global buffer pool. I want to have a little separate side cache or separate portion of the buffer pool that's dedicated just for my one query that's running.

So then now as I'm scanning through, I'm not putting things in the global buffer pool uh and having that run its own effiction policy. I'm putting this in my own little buffer

02:59

pool on the side, but it's really just a a a partition piece of the total me amount of memory that's available for the system. So Postgress does this.

They basically maintain a circular ring buffer for some portion of the buffer pool so that every time I ask for a new page, I'm just appending to the end and then eventually it'll wrap back around.

03:15

So it's really cheap to actually implement. Um and again you you minimize the effect of sequential scans or squal flooding on the rest of the system.

Another trick you can do is maintain priority hints to uh about how the query is going to act or how the execution

03:31

engine is using these these this data and what where it's actually in the pages so that when the buffer pool replacement algorithm runs it knows to avoid pages that it somehow thinks are more important than other pages. So think of something like I have an index.

We haven't talked about B plus trees yet, but assuming it's it's a tree

03:47

data structure. And so I would have priority hints for my pages and say the top of the index, the first page, that's super important because that's where everyone has to go to in order to get inside the data structure and figure out where where they need to go, right?

So we think of like this is not this is not valid SQL, but assume I'm inserting

04:04

values where I'm just incrementing some counter by one over and over again, right? Well, if I'm sorted along from min value to max value along the bottom on like my leaf nodes, then anytime I want to insert something, I always got to go to the root and I'm always going to go down on this side of the tree.

So

04:20

therefore, I should tell my data system, hey, that the index page zero. That's super important.

Yes, I would see that it's being accessed all the time as part of my frequency information, but maybe I want to just make sure that the data sim no matter what never never pages it out

04:36

because the first thing I'm always going to do if I want to access the index is going to to that route. And same thing if if I want to do a lookup.

There's other tricks you can do about priorities. Like if I know that a if a there's a transaction that's running

04:51

multiple queries and the first thing it does is read some record and then immediately update that record. I don't want to have look like I don't want the that to look like two separate unique accesses to my buffer pool manager.

I want to tell it oh by the way this is part of the same transaction. Therefore,

05:07

it only updates like the access counter once because it isn't it didn't become more popular because you know everyone's trying to access it. It's just one transaction trying to access it multiple times.

So SQL server can do stuff like that again. So there's a bunch of hints and a bunch of information we can provide to the buffer manager based on

05:24

how we know queries are accessing this data. And I didn't say this beginning but I'll I'll say it.

I'll say this multi throughout the semester. This is one of the things that's going to separate the expensive enterprise systems like the oracles, DB2s, the terod datas from the

05:39

open source systems like Postgress, my SQL, SQLite like they're going to have spent a lot more time and more money to make those these buffer replacement algorithms as sophisticated as possible. like the ARC thing is is in IBM and that paper came out and and um the post has

05:56

implemented but I it's not public but there there's a bunch more things they've added it since then in the the DB2 version. All right.

So what happens now if we want to evict a page uh and we we

06:12

identified this one we want to remove but it's been marked dirty meaning some part of the system has modified the changes since it was brought into memory and therefore we can't just throw it away. We have to go write it back out to disk before we can reuse the frame.

So if the page has not been modified

06:27

then it's super easy for us because again we just drop the memory no big deal. If it is been has been modified then in order for us to use that frame we have to wait until we write that data out to disk.

It gets flushed. In some cases we have to write out the another log another record or another page that

06:43

tells us about what was modified in that page like a log record. We have to flush that page.

Then we flush the other page before we can free that frame up. And that's obviously going to be be slow for us.

So what we want ideally is when our buff replacement algorithm runs we we only

06:59

want to consider uh you know free pages and so the more free sorry clean pages the more clean pages that we have the better chance we have of finding one that that will be um you know le least likely to be used in the future.

07:15

So, similar to how the OS can do that preemptive background writing for pages in MAPAP, the data system can do the same thing. Again, this is just you're just walking through the page table, finding all any pages that that are marked as dirty and going ahead and writing them out in a background thread.

07:30

So then the idea again, by the time someone wants to go free that frame, the dirty data has been written out and we don't have to block and wait for ourselves. This is sometimes called page cleaning or buffer flushing, the basic idea is the same thing, right?

And then once we've written out the dirty page, we can go flip flip the the dirty bit

07:46

and then it can be evicted easily. Of course, now the challenge is going to be if we spend all our time doing background writing, that's going to slow our system down because now all our I/IO is flushing out dirty pages uh when we could be using it for other running other queries.

So again, the the high-end enterprise systems have all

08:03

sorts of thresholds and other backup mechanisms you can put in place to say how aggressive you want to be on background writing. um where other systems might just you know run on a on a on a periodically on a schedule or if so many pages been modified uh like I

08:19

think post it'll do background writing of 30 pages in memory have been modified then then it kicks in there's simple things you can do like that but the the commercial systems are more sophisticated all right so let me try to get through uh disio quickly and then we'll we can

08:37

we can pick back up on um uh the optimizations you can have in your buffer pool uh next class. So the you know when we have to write pages out to read pages from disk we want to do it in such a way that we can

08:53

maximize the throughput or the bandwidth of of of the hardware so that again that that we can you know for for a certain amount of fetches and certain amount of time we can get as much data as we can from disk into memory. discs have gotten a lot faster in recent years.

Uh but it still is is a pretty significant

09:09

bottleneck. So the way the operating system in Harvard is going to try to do this on modern systems like NVMe drives is they want to maximize amount of parallelism.

They try to be clever about reordering IO requests to to make sure things are done in in the in sort of maximize

09:26

amount of sequential ordering, right? But the challenge is that it doesn't know which of these requests have a higher priority than other requests because it just it doesn't see that information.

Right? So you can set the IO priority for doing uh for doing disk reads and

09:42

disc rights, but you can only do it at the process level in Linux. So for a single database system has two threads that want to read data.

One is for a mission critical query and one's for like a background thing. Who cares if it's if it's slow?

Linux doesn't know anything about those two requests. So

09:58

this again why we want to do all the the scheduling in our own system and in a lot of the lot of the system they'll just tell you turn off the any kind of sophistication sophisticated uh scheduling information or scheduling policies in Linux go do the most simplest thing because we're going to do everything better ourselves inside of

10:14

our database system right and the basically the way this works is that the data system maintains its own internal queue of the IO requests that that are coming from up above from the from the buffer pool manager Uh and then it can reorder things based

10:30

on how those queries are accessing the data based on what it knows is the important things to consider. Like again most simple thing with squalential IO versus random IO.

If I have a bunch of requests for random random uh from random pages from multiple workers or multiple threads then I can just look at

10:47

all the requests sort them by the page ID or the location on disk. And that way I can do a bunch turn all these random IO's into a sequential IO.

If again if it's a part of a query and or a transaction that's important I want to prioritize those versus r random

11:04

background tasks. If I know I'm writing out log data that's really important to write that out first and make sure things are durable before I write out maybe like other random dirty pages.

There's a whole bunch of things we can consider that the data system simply is not going to or sorry the OS is not going to know about because it doesn't know what's inside of our application or

11:20

system that's actually running. I keep saying application because from the OS perspective, the data system is just another application.

So I don't mean like your your Node.js application over there, but from the from the OS perspective, the data is another application. So that's what I mean by that.

But I really mean our system we're actually building. The OS doesn't know

11:37

any of this and it's just going to make things harder for us. And so we want to try to avoid that as much as possible.

Another thing we got to be be mindful about is the OS page cast. So when I call fre or read using a lipy poss call that's going

11:53

to go into the kernel into the file system that's running in the kernel and then it's going to consult its internal page cache in the OS. And if it's not in the page cache, then it goes down the disk, copies into the page cache, and then it then copies that again to a to a

12:08

user space buffer that goes up to our our data system up above. Right?

But now, if I'm maintaining my own buffer pool manager, that's another copy of pages that that are out on disk. So that's actually two copies now that are in memory in in our OS or in our system.

12:24

So you have one copy of a page in the OS page cache, another copy up in user space in our database system. That's super wasteful.

We don't want that. So instead, you can use what's called direct IO where you basically buy tell the the the file system, don't put

12:39

it in your page cache. Bypass all that.

Go directly to the hardware, get what you want, and then hand me back the the the memory buffer for it. And that avoids that additional copy.

Things are much much faster. There's actually other optimizations you can do called uh using DBDK, data plane uh data

12:57

plane data kit. We'll cover this in advance class.

you can basically have the data system talk directly to the hardware and not go through the file system at all and get get data out. So most data systems one with direct IO.

There's one system that's widely used

13:13

that doesn't use direct IO. Anyone know which one it is?

>> SQLite. Uh now I think you can run those guys.

I don't it might be on by default too.

13:28

It's Postgress, right? Postgress famously because it's like some relic of the 1980s relies on the OS page cache.

Every other data system out there tells you As I said before like if you have 100 gigs use 80% of the memory for the buffer pool in your database system. Postgress says use 25%.

Because they

13:45

want the other remaining memory to be used for the OS but then this has all sorts of problems all sorts of performance issues um because again you're maintaining multiple copies of data uh and and it's wasteful. So postgress has been trying to get off uh OS page cach and switch to direto for

14:01

many many years. So this is a LinkedIn post from last year talking about how they they have a patches for this and then coming out in two weeks is postcrist 18.

This I don't think has direct IO on by default but it's going to have it's now going to have asynchronous IO. So now I can have back

14:17

again back separate background threads in Postgress or processes in Postgress. Go retrieve data from disk.

Uh, it's still going to use the OS page cache, but I don't have to block the rest of the system while I'm doing that because you need to have asynchronous IO before you can do direct IO because because direct IO is going to be slower because

14:33

you're never going to really hit the OS page cache. Yes, >> the question is do you have to make changes in the kernel to do what?

Direct IO. >> No, no, direct.

IO is like a is a pix

14:48

command. Like when you call fop, you just say you pass in O direct, right?

Okay. By default, that's not that.

Yeah. You don't make any changes to the kernel.

All right. Bottom line is this is coming in Postgress.

They're going to get off of uh uh OS

15:04

page cache very soon. Uh and because there's again the OS is always going to cause problems for us.

All right, I'm going to finish up this last last example here. Again, just I know I sound crazy up here saying the OS is our enemy.

We don't want to talk to it.

15:20

Let me show why. All right.

So, say I have a dirty page in my buffer pool. I got to write it out to disk.

I call fright. What happens?

15:38

>> What's that? >> I heard I heard it.

Was it Say it again. >> Say it again.

>> What are you saying? Sorry.

>> Say it again. Sorry.

>> A rest file. Sorry.

Okay. Um, I thought you said get arrested by five.

All right. Sorry.

Uh,

15:55

>> rest of the file, but like where >> they say on the disc or not? >> Depends on when you flush.

We're not We're not flushing yet. So, it sits on it sits on a colonel buffer, right?

So, his point, if I

16:12

really want to get it to the hardware, I got to call F-Sync. That's another positive command.

It's a blocking call where you tell the OS make sure this thing is actually written to the hardware. Now the hardware can play games and lie to you too, right?

Because the hardware can have its own battery down there where it you do you do fsync it writes to the battery back RAM or

16:30

storage and then it's not really on the actual storage medium yet. But you know in a rare case you lose power then the battery can then do the final write.

But when I call f-sync I block until I get get a notification saying my uh my flush has succeeded.

16:46

But if that thing fails, what happens? Well, Linux would in the old days would mark the dirty pages as clean because a bunch of dirty pages are modified with fright.

17:02

I always keep track of them. I call f-sync.

It wants to write the dirty pages and want to mark them as clean and it marks them clean then notifies you that sync is completed. But if F-Sync fails for whatever reason because the hardware fails are flipped out or something, it comes back and you get a failure notification.

You get a uh a

17:18

failure um error number, but then the OS marks those pages 30 pages is now clean. So the problem was a bunch of database systems would do fsync in a while loop.

While f-sync fails, try again. So the

17:34

first time you call f-sync, it would fail. It would loop back around then try it again.

But since the OS Linux would mark the page was clean, the last time you called F-Sync, even though it didn't work, when you call F-Sync the second time, says, "Oh, yeah, you're good. Done." And it looked like it would succeed.

17:51

So, so why would it do this? Why why would Linux do again?

They fixed this, but why did they used to do this? Because Linux wasn't worried about the database system, even though it's the most important thing in the world.

They were worried about some with a thumb drive pulling out the thumb drive and

18:06

then now you have a bunch of uh page in your page table from a thumb drive that's never going to come back. So they would mark it as clean so that they can clean it up later on.

Right? But it was lying to us in the database system that our our page was actually written even though it actually wasn't.

So some guy on the Postgress

18:23

mailing list posted this like 2018 uh or 2017 like hey look I think Postgress lost some data for me and I didn't have a colonel panic. I didn't have a disc crash.

It was a silent error. Some guy went started rooting through the code in in both Postgress and Linux.

And turns out, you know, Linux was lying to us for

18:40

20 years. And it wasn't just Postgress.

MySQL had this problem. I think MongoDB or Wire Tiger had the same problem.

So, I'm going to put this up for most time semester. Don't do this, right?

Don't call F-Sync again. Assume everything's going to be okay because the OS is going to lie to you.

So, this was a big scandal was called F-Sync gate. uh this

18:56

has six been since been fixed in Linux uh and also fixed in postgress uh and my SQL and MongoDB but this is a good example the OS is going to make you know read the documentation it's going to say one thing is actually to do something different so therefore we have to be super protective and make sure that we

19:12

don't put anything that anything that could fail will fail and we have to do extra stuff in our data system to make sure that we don't lose data okay all right we're over time apologize we'll pick up on this ne next class and then uh Uh post everything on post on patza

19:28

and we'll post project one there today. Hit it flips acrobats over [Music]

19:54

the fortune the fortune maintain flow with the brain. It's a fortune Maintain my straight flow with the gra.