#03 - Database Storage: Files, Pages, Tuples (CMU Intro to Database Systems) | YouTube Summarizer

Transcript

00:00

[Music] I'm still associate. [Music]

00:25

as I said in class last week. Um I can't be on campus this week.

I'm in London. Um and actually right now we are actually in the emergency room here in in London.

00:41

so Fatface Rick uh did a micro data of LSD last night. Went walking around and then he got beat up by a bunch of like 12-year-olds.

Uh he basically beat that up. Um, so we had to bring him to the hospital and I'm out here in the waiting room

00:57

waiting him to get cleaned up, stitched up. Um, and then we can go back.

So I figured while I'm sitting here, we have to teach about databases cuz that's, you know, it's super important life. So let's jump into this.

So last class we were talking um about SQL and then today

01:14

we're not going to talk start talking about the how to actually build the system. So as a reminder, project zero is due this coming Sunday coming up.

Again, you're required to pass that if you want to get into the class and homework one is also due this coming up Sunday um at the at the same time.

01:31

So now that we understand what SQL looks like at the sort of logical level like how to define tables, how to run queries on it and you know what are the properties of uh relational algebra and how that fit into relational model.

01:46

That's sort of the the last time we're going be looking at uh the application level of a database uh for for most of the semester. Now we're going to really start talking about how you had to build the software to run the queries that we saw or store the data that we want to

02:02

put in the database. And so that's really what's going forward this point semester went forward is like actually how do we actually build build that system and so the outline of the course is roughly looks like this.

So, we've already sort of covered what the first week of the relational database looks like and we're now going to go sort of

02:18

at a level by level basics and start looking at like, okay, here's how you build the different layers that you need to put together and interact them to actually build a full-fledged database system. And so, this is sort of the rough outline of where the course is going for this semester.

Um, you know, a

02:34

lot of the things we'll be getting to the very beginning will be like how to build a single node system. Then once we understand that then we'll finish up and talk about how can you how can you scale it out and run across multiple nodes.

And so the way to think about this is a database system is essentially a bunch of layers where there's a it's a well-

02:52

definfined API that one layer exposes to the next layer in order to provide certain guarantees or certain operations on on data what whatever the um whatever the the level of uh

03:07

the the level that they're trying to interact with. So you can figure at the very bottom you have this disk manager the thing that actually reads and writes data from disk.

It's essentially a database is just files on disk. And then above that you have the buffer manager that's going to be responsible for bringing in pages from disk, bringing

03:22

them to memory uh and then handing them out correctly to the other components such as access methods for then accessing data on behalf of operators that are executing queries and above that we have a query planner optimizer to try to then take SQL queries and convert it to physical plans that we can

03:39

then execute our system. So again, you just think of the data system as a bunch of layers.

And what we're going to be going through this semester is actually we're going to go start at the bottom and work our way up with the stack. So the application is up here.

This is what we talked about before like the relation database and SQL, right? And it

03:56

interacts with the front end that's going to do query planning. And then we're just going to again we're going to talk about how to go up the stack now and be able to execute those queries.

So today's class we're actually start at the very bottom. We're going talk about storage.

Max class next week uh when I come back in town we'll actually go back

04:11

up uh to the Buffalo manager talk about how to bring some pages in memory but then we actually need to go back down and talk about different alternatives storing things on desk. So that'll be where we're at on um that for the next two weeks.

So today again we're going to start talking about what the background is of

04:27

what what the database is going to see interact with in terms of hardware. Then we'll talk about the lowest level actually how you store things in files.

How you then divide those files into pages and how you then divide those pages into tupils. So today's class is be a

04:44

row oriented centric view. Meaning we're assume we're going to organize the data uh as tupil says in rows.

We'll see in uh in two weeks how we actually flip that on its head and actually store things in an alternative way. So today is going to be a classical database system architecture, how the original

05:00

ones were built in the 1970s and then we'll go from there and um talk about how you know alternative methods the more modern methods in in a few weeks. We kind of understand the basics first before we get to the other ones.

Okay. >> So the other thing about this class too

05:15

I guess bring up is that the we'll be focusing on what I'll call a disk based or disoriented system architecture. And what I mean by that is that the the software the software system the database system is going to assume that the primary storage location of a

05:32

database will be on nonvolatile disk like an SSD network storage it doesn't matter but it's considered volatile and then it's a classic vonomic architecture where you can't operate on anything that

05:47

well exists in disk. and said you have to bring thing bring pages from disk or the files from disk into memory into volatile storage like something like DRAM and then that's where you read and write to the data that you want and the data system is is then responsible for

06:03

writing up those changes to the vol the nonvolatile disk at >> at some later point. So this is really going to be focusing on and then all of the uh components that we're going to talk about the outwards we're talk about is really be predicated on on this architecture

06:20

assumption. So the way to think about how we're going to interact with nonvolatile and volatile storage is so this classic hierarchy like this.

You've probably seen this in classes, right? This is should not be anything new.

Um but you sort of think of the the the bottom

06:36

layer you have the the the very large and but slower storage and the very top you have the very fast and cheaper one. Right?

So the top would be things like CPU registers because it's the smallest amount of data you can store or access on a CPU. Um, and obviously it's gonna

06:51

be really, really fast, but it's also be really expensive. It's gonna be very limited.

Then as you go down the stack, things get bigger. Um, the capacity is larger.

Uh, they're going to get slower and they're going be actually much cheaper to store per, you know, per bite or per gigabyte. So, this is the sort of

07:07

way we want to think about and how these different storage layers are going to work with each other. But there's also this division line that I'll draw here between DRAM and SSDs.

And basically say at the top you have again what I call volatable storage um where again if you lose the power on

07:23

whatever the the computing device you're using to store this everything is blown away. Um, but they're going to be they're going to support what I'll call random access or bite addressable access, meaning we can jump to individual bite offsets without worrying

07:38

about bringing in a bunch of other data that maybe we don't actually need. Now, with cache lines on the CPU, that's not entirely true, but for our purposes in this semester, uh, we can ignore that.

And then below this demarcation line, we have uh nonvolatile storage. So this

07:54

will be things like spinning disc hard drives or SSDs or network stores like like S3 on Amazon or Google cloud storage and so forth. So these devices will be uh nonvolatile meaning if we write to them and we can send a command to say flush the data to this storage

08:11

device and then we be guaranteed if power is lost that we come back and data will be there. Again if obviously if the machine catches a fire and the discs melt, we can't do that.

But there's ways to get around that we'll cover later in the semester. These devices are also going to be what are called block adjustable.

And that

08:27

means that the the smallest granularity in which we can access data is going to be through a block. And typically this can be 4 kilobytes, sometimes larger, sometimes smaller.

We can't do the bite addressable access that we can do in DRAM or CPU caches. Meaning if we want

08:44

to access a single bite, we have to bring in the whole page or block where that bite exh resides. Even if you don't need that need that other data, right?

That's sort of the trade-off we have to make in in in our system design when we actually want to start storing things on nonvocal storage.

09:02

So for the purposes of this semester, what we're going to say is that anything below this this demarcation line here between nonvolves and volatile storage, we're just going to say that's just called disk, right? Well, it doesn't matter whe it's an SSD or spinus hard drive or network storage.

Even below

09:18

network storage, you can have things like tape drives, but people wouldn't you people really only use that for archival stuff like Amazon Glacier. So everything down below is block addressible and um and nonvolatile.

We'll just call that disk. And then above that line, we're going to say that

09:34

we're just call whatever is in DRAM. We're just going to call that that memory.

Um and then above that will be anything that's like directly on the CPU, like on the actual die of the socket itself, uh like CPU caches and CPU registers, right? We're just going to say that's CPU storage.

And so for

09:50

the purposes of this semester, uh we're actually going to ignore CPU related stuff. We can talk a little bit about more efficient ways to go to >> more efficient ways to to store the stuff, but like um for our purposes

10:06

here, we can use all that, right? We'll cover these things in the advanced class.

That's where we really care about like you know cache locality and things. for for this semester it's really about is it in memory or is it in disk and then how do we move back and forth.

>> Now the other thing to be mindful of too

10:23

and this is going to be where we're going to deviate from maybe traditional algorithm classes you've taken uh in in in you know computer science classes before where we actually need to be aware of how long it's going to take for us to access data at these different levels. And so there's this great table

10:38

here um that comes from this this website you see in the corner. Uh Jeff Dean has a version of this that actually predates him goes back to Jim Gray uh from the early 1990s.

But basically it's a table like here's all the the time it takes to access different uh levels of storage right from the nonvolatile ones

10:55

sorry the volatile ones down to the nonvolatile ones. And here I'm showing these numbers in measured in nanconds.

So basically just shows like if I'm going to read something from DRAM it's going to take 100 nconds. Um, but if I read an SSD now, it's going to take me

11:10

16 maybe 50 50,000 nconds, right? So, it's like two orders of magnitude slower than reading from DAM.

So, obviously, we want to be mindful of like when we're going to read things from disk and try to minimize the the the amount of time you have to how many times you have to do that. Then, once you get below an

11:28

SSD, that's when things just get like super super slow. So I realize as humans it's it's hard for us to to wrap our heads around um you know times in terms of nanconds.

So there's a really easy trick that Jim Gray came up with where if you just

11:43

replace nancond with second then you really start to see how uh how expensive these things are are going to be relative to like reading data that could be in memory. So if you read something like L1 cache that would take a second.

So that's like if I want to read a book, I just go over to that chair there and

11:59

read a page in the book and it would take me a second or 4 seconds. But then if I want to read something in a tape drive, um it's going to take 31 years.

That's equivalent of like me flying to Pluto and back to to read a single page in a book. It be ridiculously long.

Um you want to try to avoid that as much as

12:16

possible, right? So again, as I said, most systems are not going to be stored tape drives.

So that's not you have to worry about you know that that level of latency but certainly there's a lot of systems and risk systems today are certainly these SSDs spinning hard drives and network storage at S3 and so

12:31

we got to be mindful of like when are we going to read and write uh data from disk and try to avoid having to block the system while we're waiting for those things. The other thing we need to be also be mindful of in our uh implementation our algorithms is the difference between

12:46

sequential and random access or random IO. So again when you assume you're you're talking about quicksort algorithms in intro classes and so forth you just assume oh read write anything in memory and the time it takes to read any any object or entity takes the same

13:02

amount of time. But in reality on real hardware the latencies between random access and sequential access uh can be quite significant.

So random access would be like I'm jumping around to different locations and reading uh different bits as I needed or bytes as I

13:17

need where sequential access would be read like a stride of data that is contiguity with each other and you know so I can just do one look up and get multiple bytes kilobytes that are all close to each other. So just give you some ballpark numbers

13:33

here. Random IO would be on a fast SD um maybe 80 to 100 microsconds.

So you know let's speed on the high end um for like a like a consumer grade drive where sequential IO would be you know you can

13:50

read a bunch of data in in terms of you know 10 to 100 nic and again depends on like how much data you're reading how many threads are reading. This is a whole bunch complicated uh that we're not going to really cover in this class.

But the main thing to be going to be awareful mindful of is that like there is a huge difference between sequential

14:06

and uh random IO especially on spinning hard drives and we want to try to maximize the amount of sequential access we can do whenever we have to read and write anything from disk. And that means that we'll make certain choices in our algorithms that try to reduce the number of writes uh to random pages so that um

14:26

>> so that we store data in continuous blocks and then certain >> certain systems will be certain things that seems like crazy to do but again if you're aware of like access is much faster than random access you would that sort of makes sense. So like my SQL for example will uh they'll store their uh

14:44

when when they write dirty pages to disk they'll first write it to this double writeback buffer and that's a sequential write where they put all the the the all the dirty pages they want to flush out they write it those sequentially then in the background uh then the background they'll do the

14:59

random writes at a later point but at that you know you're not blocked waiting for that dirty pages to be written uh because you've written that sequentially the first All right. So, the ultimate goal for a database system we're going to try to

15:15

build this semester is a system that um can manage a database system that's larger than the amount of memory that's available to the system. So, if you only have 8 gigs of RAM, you can run a database handle database that's 16

15:31

kilobytes or 32 kilobytes or uh or gigabytes or larger. So the the challenge of course is going to be to provide this illusion is that we have to move data back and forth between disk and and to memory, right?

As I saying sort of the bonom architecture and so we want to do this

15:48

in in a clever way and take advantage as many optimizations as we can where we want to avoid large stalls while we're fetching things in disk and have the system sort of grind to a halt just because we're waiting to go go fetch IO. And so there'll be a bunch of things

16:03

we'll go do as we see go go go along about allowing multiple queries to run at the same time. Um reusing data that we bring into memory, being clever about what we write out, you know, get from from memory into disk.

So there's a bunch of things we're going to want to

16:19

do and we're design certain algorithms in such a way that we can reduce that that burden that cost of having to read my data from disk and again make it look like we have enough memory to handle everything. And of course at some point if the working set size of of the data

16:34

of all the queries that are running meaning all the things that you need to be in memory is larger than minimum memory you have that's going to be a problem there's no way to get around that but for other sort of in the in the general case will be okay.

16:52

>> All right so everything I've said here is kind of handy and vague. So let's actually go into more details of like look look at a high level picture of like what a discordated data system could look like and then we'll see what we're trying to then design and build uh as as we go along through the rest of

17:07

the semester. So as I said before the database and a discarding system it's just files on disk.

Sometimes it's it's one file like some systems like duct DB and SQLite they pride themselves on storing every the entire database as a single file but most systems are going

17:24

to store the database across multiple files on disk. These files aren't special in terms of the operating system.

Meaning it's like the operating system just use a bunch of files that is created with you know f fright or f open. Um it's really the data system interpreting what those files are where

17:40

the the sort of important stuff the magic happens in the system. So we have one on our basel disc.

We have some kind of databases file and we're going to break this up into a bunch of bunch of pages and these will be sort of fixed length size uh uh segments or divisions

17:57

of the file uh and at the header or at the at some special location usually it's usually the header there'll be what call page directory and that's just a kind of internal database of keeping track of like here's all the pages that

18:12

I have on the disc and then now I in memory for my data system. I have what I call buffer pool.

Sometimes it's called the buffer cache or uh the memory cache or something. They're all different systems call them different things, but it's basically the same thing.

And this is going to have a

18:29

bunch of uh um frames, we'll call them, where I can store pages I'm going to bring in memory. Um so again the data systems job is really about managing the movement of of disk pages back and forth between disk and and and memory in order

18:44

of the service whatever the the the queries are that I'm executing >> number 157 please go to position 4 >> it's not us okay um so uh I have my execution engine right I'm not defining what this is for now assume there's

19:00

something that wants queries and so it's executing something that says I want to get page number two. So at the very beginning assuming that nothing's in memory.

So the very first thing we need to do is that they're bringing the directory to memory because the directory is going to tell us what pages we have and where to go find them. And

19:15

again for us right now in this example here we assume it's a single file but it could be spread across multiple files and multiple directories or even across multiple machines. But at this point it doesn't matter.

So, >> so we bring the the directory in and

19:30

we'll look inside that and that's going to tell us where to go find page two, right? So, simplicity assume we can just compute where offset is in this file and it finds it for us.

Then we go fetch that page into memory and then now we then hand back to the execution engine a pointer to that page in memory and the

19:48

the buffer pool is going to guarantee that this that that pointer is not going to get replaced or swapped out with another page until the execution engine says they're done with it. So now it also notice here that we didn't say anything about what's inside these pages.

It's up for it's up to whatever

20:04

up above and the system to then interpret what's in these pages is to decide what the bites actually mean. But this this lowest level here we don't know and we don't we don't actually care right now.

So then the execution does does whatever it does interprets the page layout and let's say that's doing update. So wants

20:19

to update the contents of page number two. So then it writes back the change to page and then it can then say I completed and at some later point the buffer bulb is going to write out that dirty page to disk and update the files files and it does it in a way that's safe so that if

20:35

we crash and come back we don't lose any of those changes that the the execution engine made to that page. So we're not going to go into the details of all these things in this lecture.

We'll discuss these other lectures, but we'll discuss some of what these pages actually look like at the

20:50

files in this lecture. And then we'll come back to this in five and six.

And then we're going to spend in lecture four, we're going to spend time discussing what this buffer pool manager looks like uh in in memory and how it does some of the writing out to disk. We won't discuss everything how to do this in a transaction safe manner.

That'll be

21:07

later in the semester after the midterm, but we'll just have a general idea of how the the mechanism is actually going to work. Um so we'll cover that in next week in lecture four because also too that'll be what you'll need in the first project and then for what the execution engines go look like we'll cover that in

21:23

lectures 13 14 uh in a few weeks. Okay.

All right. So there's two big questions we have to deal with in our system.

The first is how we're going to represent the the databases which are files on disk. And then the second question we

21:38

have to deal with is how we're going to move that that those pages back and forth into from disk into memory and then hand them off to other parts of the system in order to interpret their bites. So for today's class we're going to focus on this.

As I said next class we'll then come back to problem number

21:54

two uh and understand what the buff manager is actually doing for us and how it works. So as I said the the database is just files on disk.

Um and there's nothing special about them in terms of again

22:09

what the operating system sees. It's just as if I you know create a bunch of files in any application and then the data system is up to interpret what's in them.

Now historically uh these file formats has always within the files has always been proprietary meaning like

22:24

they are what the they're specific to whatever the database system that created them. Um and they're not usually interpreted by other other database systems.

So like you can't take a Postgress file like it's file format and then open it up in my SQL or Oracle right it's not going to

22:40

know how to interpret those bytes because it doesn't mean anything to to the other system right there are there's a newer trend in building what are called portable file formats things like parquet where it's a file format that is open like an open spec

22:56

>> and that any system we read right to them but we're not going to discuss that in this class we'll cover that when we in a few talk our column stores. The other thing I also point out too is that the the database system is typically going to go run on offtheshelf file system like ext4

23:14

uh WinfS what whatever your the operating system installs and that's going to be typically good enough for what we need. Um there was uh some systems in the 1980s that uh would actually build custom file systems on top of raw block storage.

23:30

you'd buy like a storage device. It's just raw blocks and then they would they the Davis would install its own custom uh file system inside of that.

You typically don't see that uh certainly not open source systems. Um and you only really see this in like sort of really high-end enterprise systems like Oracle

23:48

and Terod data. Most people aren't going to do this.

So if you look at Oracle's documentation, you could have a they have this thing called Oracle ASM. It basically says like instead of having the operating system manage the the file system for you with a logical volume manager or

24:04

whatever instead the data system is going to do that for you. Um it's basically injecting a custom file system to the operating system.

You know that still exposes the what the standard posics API the operating system expects but it lays

24:19

out data in such a way that it's very specific to what the data system wants. There has been a study on this for a while, but you get roughly about a 15% improvement in performance for this.

Um, it's it's a major engineering effort and most systems aren't going to do this.

24:38

All right. So then the data at the lowest level is going to have this thing called storage manager.

Um, again, sometimes it's called the disk manager, but it's usually the same thing, but it's the it's the component of the system that's responsible for reading and writing data out to disk. Um, so again, you could rely on the OS

24:56

to do some things, but not you don't want to have to do everything. Um, and most of the the higher end systems, the better systems are actually do their own scheduling for reasons and writes and figure out where they're actually going to put data uh on their own.

Don't let the OS kind of just figure crap out

25:12

for you because it's going to it's always going to do do a bad job. All right.

And then there's files that we're storing as I said can be broken up into pages usually fixed length um within the file and then some systems will allow you to define a different

25:28

page size per table like IBM DB2 you can do this and there but still within the the single um the file itself the collection of files for a table they'll be all the same side size >> number 159 please go to position

25:46

And then the storage manager is going to responsible for um uh keeping track of these pages reading and writing to them um and it keeps track of available space and then when there's a request up above that says hey I need to store so many bytes it'll you

26:03

can then figure out where's the free space to actually store this. Now the other thing we be mindful of is that the within a single like files on disk for like a single instance running on a box by itself um we're not going to maintain

26:19

multiple copies of a single page on disk um like like a physical level at a logical level we may have multiple copies of data within a tupil because we might store them in heat files talk about in a second or like and then store it again in the log file but it's it's

26:35

not the storage manager responsibility ility to like, oh, I got to write this page out. I'm going to make multiple copies of it.

That's either going to happen at the level below it, like in the file system. Um, like with something like RAID or like a storage appliance, I can make multiple copies uh or have, you know, replicas in S3 or it's going to

26:52

happen up above the storage manager where there'll be something that says I I have to have logical copies of this data across multiple uh multiple nodes. All right, at the lowest level in the storage manager, it's not going to do that.

It's really about I got to write data to a single um you know a single

27:09

instance or a single location of the first single page. All right.

Then as I said the page is going to be a fixed size block of data and that's going to basically contain everything that's in our database system. Right.

So it's contain the tupils that are in the data. It's contain metadata about those data.

The

27:25

catalog of what tables I have what what information I have. The page directories basically stored as pages as well indexes records.

All this is going to be stored uh and broken up to divid internally as pages. Now, most systems are not going to mix

27:42

the page types. Um, meaning like you're not going to have within a single file pages for table X and table Y or index X and table together.

It's typically going to be like here's one file has data just for this one one page, right?

27:58

Some systems are also going to require uh systems the pages to be self-contained. Meaning everything you need to know about what's inside the page uh has to be stored in the page itself.

So you can imagine things like if it's a it's a data page for a table,

28:14

you would say here's all the the metadata I need to have know like that this page is for this table, right? You would do that for like crash recovery, but you only see that in like Oracle um whatever, you know, paranoid about not losing people's data and rightfully so.

The other important thing to know is

28:29

also that the u each page is given a unique identifier called a page ID. Um, and a page ID is going to be this unique number that the data system is going to to maintain to keep track of like here's how to address and find this this particular page I'm looking for.

And it

28:46

could be unique per the database instance, per table, per database, but it's it's going to be some number that's going to tell us where to go find how it, you know, address that single page. Now, it could be an offset in the file as well, but it could just be also a logical location that would then use the

29:01

page directory to go then find out where that thing's actually being located. So now important thing to understand also is there's there's actually three notions of pages in in a system.

So at

29:17

the lowest level you have what's called a hardware page, right? This is the um this is typically 4 kilobytes and it's the >> it's the smallest atomic unit of data that the hardware can guarantee that it can write out to to to the storage

29:34

device actual the actual physical medium of the storage uh that's done atomically right so this would be uh if I have to write out say two 4 kilob pages or 8 kilobytes the hardware can only guarantee that it can write the 4 kilob

29:50

one atomically. So it can't guarantee that it's going to write up both pages or none of them.

It has to guarantee that it can only guarantee it can write one of them. All right.

So then now above that you have the the OS page and by default in

30:06

Linux uh this is 4 kilobytes. um in some in new versions of Linux and you can get what are called huge pages and then you can be like 2 megabytes to one gigabyte but this is basically the mapping of the of a hardware page to a so

30:22

a logical page that the OS is going to keep track of in its own its own internal page directory. Then now above that we have the database page and this is going to be what the database is how the database is going to organize it own pages.

Um so again the

30:40

hardware pages is always going to be 4 kilobytes and that's the the largest block being guaranteed to write atomically but then the OS can have its own page sizes and the data system have its own page sizes as well and we just have to be mindful of how we're going to do this mapping from a database page to

30:56

to a hardware page and be aware of that if we have to write more data that we can do atomically in a page we have to do a bunch of extra steps to make sure that things are written safely before we tell the outside world before everything is uh has been has been uh has been

31:13

flushed. So the database page size is going to vary per system right uh on the smallest actually with something like SQL light I know for some embedded devices it can go down down to 512 bytes by default it's it's 4 kilobyt um Oracle by default is

31:31

is 4 kilobytes the DBT is 4 kilobytes Rox DD which is a embedded storage engine that's used in a lot of systems that's 4 kilobytes wire tiger is the storage engine for MBD that's 4 kilobytes Then you have things like SQL server and

31:47

uh and Postgress, right? They're 8 kilobytes.

Then my SQL famously can go up to uh 16 kilobytes. So the the size of a database page uh what you actually want to use in your system depends on many different factors like the environment that the hardware

32:04

environment the system is running in, what the data it is you're actually storing, like are they really wide tables that have a lot of columns or small number of columns? Are they are they big blobs of data or are they just small integers?

Right? And it actually depends on the workload you expect to run in your database system.

So in

32:22

general for data systems that are doing read heavy workloads. Um and let's see more about this in in a week or two.

Uh where these ones you actually want to store things as larger page sizes because you know you're going to do a lot of sequential scans where you're reading a lot of data that it's all contiguous. So for every page I go have

32:39

to read um I'm going to bring in tupils that I know I'm going to want to you know that I need for my query. So every fetch for a page um get a database page brings in useful data for me.

Um we'll see how we handle that when we switch from a row store to column store. That's

32:54

going to be the big one of the big things they're going to do. And then for systems that are more write heavy meaning like I'm doing a lot of like updates to tupless um like updates and deletes and inserts these ones you typically want a data page in the 46

33:11

kilobytes because again the nonprofit storage is block adjustable. So that means like I only have to write you know if I'm only writing updating a single bite within a page I have to write that whole page out.

So it's a 4 kilobyte page and I update a single bite. I still have to write 4 kilobytes out to disk.

33:28

So I don't want really massively large database pages because I would have to write all that out. I can't just do like sort of fine updates on things.

So for this class we'll start off 40 16 kilob

33:44

in lecture six. We'll see then how larger pages can make a difference when we do read heavy workloads.

All right. So, now there's a bunch of different ways we can we can manage these pages that are on disk.

Um,

34:01

and again, they're all going to have different trade-offs of how we want how we do certain things. So, the the heat file will be the most common one, the basic ones we'll see, and we'll we'll assume we're doing that for this lecture and next lecture.

Um, we just we just assume a bunch of pages and address them

34:17

and bring them in, and they're not ordered anyway. Uh another approach would be a tree file organization where actually I'm storing the the pages in a tree data structure and as I traverse the tree I'm bringing pages as I would need them and um we'll talk a little bit

34:33

about that later on. Um again for this class we don't need this sequential organization or codify organization sometimes called ISAM.

This is an older architecture in the 1970s that people really don't do anymore. Um but it's it's sort of similar to the tree tree

34:49

one. We don't we don't have to worry about this.

And then hashing is another variation similar to a heat file. We just you things aren't older.

You can jump around into them. Right?

So I said for this lecture we're going to assume we're doing heat files. Um and then for the most part right now for the next few

35:04

more slides we don't have to care about what's inside the pages and then we'll go into much detail what what they look like. But just understanding what the page structure looks like it doesn't matter what's inside them.

So a heat file is going to be a unordered collection of pages uh where

35:21

again because SQL is unordered models unordered we can store tubables anywhere we want in random order just wherever we have free space um and then it exposes a basic API to create pages because that's the basic operations we need to do at this lowest

35:37

level in in the storage manager we also need to uh be able to iterate over all the pages meaning like give me all give me the page ids for all the the pages on a given table or given index because we'll need that be able to do sequential scan but when we do like looking up things in indexes um we'll

35:55

see how we get the page ID from an indic lookup and then we need to be able to get that single page we also need to be able to scan across all the pages as well so for this this means that again we need additional metadata uh essentially what we had in the page directory to keep track of the location

36:10

of these files and also keep track of where do we have free case um because anytime we need need able to store something we got to say where can we put it and if we since we're going to assume that we're storing things like you know 8 kilobyte pages like it is in Postgress most tupils aren't 8 kilobytes so we

36:26

need to make sure that uh we can you know fill in free space that we have our data doesn't balloon to be too big so here's a basic uh configuration we assume a database as a single file so

36:43

Assume something says, "Hey, I want to go get page number two." I'm not defining how I'm doing that just yet. There's some part of the system says, "Give me go page two." I could do basic just do simple arithmetic to say I know the starting location of my file.

I know my page number. I know my page size because again, I'm assuming all the page

36:58

sizes are the same on disk. And I just do simple math to be able to say jump to the offset that I want to find the thing I'm looking for.

In cases where the fi the data is uh uh is broken up across multiple files on disk. Right now we want to get something

37:15

like page number 23 again there's this page directory thing which we haven't really defined that's going to have the metadata we need to go find the file that may contain the page and then once we get in there we can then do simple math to say for a given page number um what's the in a given

37:32

page size what's the what's the offset I should jump to to find the bites that I need. It's a pretty basic architecture and again different data system different things how they implement the page directory but a high level all these systems work basically work this way.

37:48

So a common architecture is to use what's called a page directory um as I sort of before and this is just like a database within the database that keeps track of here's all my uh pages that I have for sort of logical database objects right you could break it up

38:03

across uh different ways like you have a your data system supports multiple databases you could have a directory per database you have a directory per per table for index or whatever right different systems will do different things um but the basic idea is that I have a way to say I want to get data

38:19

from like table X or table Y here's the file or here's the directory here's the location on the network that has this data and then within that I can then jump to the offset that I need that based on the the page directory information about where

38:36

the where to find the pages that I'm looking for. Now the tricky thing is going to be is making sure that this page directory is synchronized um on disk with the data pages um because you know if we crash and lose this thing

38:52

we're kind of screwed because now we don't know where to find anything else um and then if we start allocating new new pages for our files we want our page directory sort of reflect that. Now, the truth is it doesn't actually have to be 100% always synchronized because if you crash and come back and this thing gets

39:07

trashed, if you have enough metadata in your files, you could actually recreate this. It just makes recovery a lot slower.

Um, and a bunch of systems going to have, you know, different ways to to do this more efficiently. But in general, it's uh in general, this is something you want

39:24

to keep synchronized if we can, at least in memory. Certainly in memory.

Uh but on disk we can kind of be a little bit looser than we would be maybe other data. All right.

Again just it's a trade-off between how fast you want to run at runtime versus how long it takes to recover if there's a crash.

39:40

So this uh we can keep track of additional metadata about for every file how much free space we have in a given page or a given file. um what are pages that we have that are completely empty that we can jump to them really quickly and they keep track of like are these pages keeping track of metadata about a

39:56

table or index or whatever or they keep track of like like actual user data like from tupils and the page directory can can maintain all this information for us all right so that's what things look like on disk um

40:12

uh assume that the davis is a bunch of files are broken into pages so now let's talk actually what the pages actually look like on the inside. So the in general at a high level every page is going to have um what we'll call the

40:28

header where we keep track of metadata about what's inside the page. So obviously you keep to check things like the page size check some about the contents of the page um the version of the software that that wrote that information right so that way if you upgrade the version and change the

40:44

layout of pages you know whether this page reflects that new layout or not um there's much information we can maintain about transaction v visibility like what data should be visible for what transactions um uh for that one we don't have to worry about this class we'll cover that

40:59

again after the midterm uh it could have additional data information about like how the data is actually being compressed what you know what encoding they're using to store things in some cases like like in Oracle we had said they had to be self-contained actually team the table schema in the header of

41:15

the page as well so if there's a crash you can come back and actually just look at the page and say I know what table this belongs to here's a bunch of different that are inside of it sometimes you'll see also summary information uh or sketches that keep track of like the here's the values that I have on my my in my page So if I don't

41:32

maybe to scan the entire page, look for certain certain, you know, certain pieces of data. Like I'm looking for all records that uh or all tubs that have the name Andy in it.

If I have a summary in my page keeps track of like the number of Andy tupils inside of it, I can just read that and if it's zero,

41:48

then I know the rest of the tupils. High level idea of how the summary stuff works and you only see that the read heavy stuff we'll cover later.

Um the check sum is so important because if you crash come back and read the page back in or actually not even a crash just go read a page in you want to know whether

42:04

it got corrupted on disc you can trust the bes you're reading the chuck someone tell you that again different systems will do different things and it's the enterprise ones that'll have the more uh safety controls in place to make sure that you're not losing data or corrupting data. Okay.

So then now

42:20

within the page we want to keep figure out how we're actually going to store tupils in it. Right.

As I said at the high level, there's always going to be a header. But then now we don't talk about how we actually put tupils inside the page.

We'll put data inside the pages, right? Assume we're doing tupless.

We won't worry about uh indexes for now. Um

42:37

that'll come later. Um and then we're also going to assume that we're doing what's called a row oriented storage model.

Meaning we're going to organize the each record or each tupil is as a row where all the data is being stored continuously when it's not on the page. And that means also assume that each

42:54

tupil uh has to fit inside of inside a single page. We'll see how we handle larger values in a second.

But for our purposes here in discussion right now, we'll uh we're just every tuple is continuously stored within a single page. So there's actually three ways to do

43:11

this um uh to organize data into pages. So for this class here today, we're going to assume we're doing tupu orient storage.

Meaning the entire dup is stored in entire it's stored in its entire end of page. Um, but there's actually two other

43:27

methods we'll cover uh in the end next week, next Wednesday called log structure to index storage where we're actually going to store uh not the tupole uh but we're actually going to store the um

43:43

basically the delta of a tuple and when things get changed um and then have to end up having multiple copies of a tuple uh if it gets updated multiple times and that'll make more sense when we talk about later on index organized storage is basically storing as I said the the data as a tree structure um but we'll

43:59

come to that later for our purposes here assume that there's only one copy of the single tupil and it's going to be stored in the page in its entirety uh in continuously okay so if you wanted to start doing this how should we keep track of the tuples in a page um well we could just

44:17

do something really simple where you say all right I'm assuming that the the tupils are all the same length and then I'm just going to pen the tupils do the first relocation that I have in my page and I just have my header I just keep track of the number of tuples that I have. So I can do simple math to say at

44:33

what offset do I jump at the page to find the the data that I want. So every time insert a new tuple it's appending it one after another in the page like this.

So now if I delete a tupil well that comes more tricky right? So if I delete tuple two uh now I have the number

44:49

tuples equals to two. So I can't just jump to the end anymore um because I make sure I don't overwrite uh tupil 3.

Um so maybe I can just do a quick scan and try to find the first free slot and then put my tupil that way. But then this art starts to get kind of

45:04

expensive. um uh if you know if I if I have to scan every single time and further also too I can't do any compaction now because if I if I move um tupul 3 back to where tupal slot was tuple 2 slot was or location was in the page then I had to update a

45:21

bunch of other metadata or index data structures because now the physical address of tupil 3 would have changed in my in my page but then also this doesn't work either if I have variable link data because now I may have much of free

45:36

space that I can't actually use to start a new tuple that I need need or have to start moving things around again pay that penalty of updating indexes or updating other parts of the of the datab system. So to overcome this problem, we're going to use a technique called slotted pages.

45:52

And so what I'm describing here is basically at a high level how any row oriented row storage or tube oriented system is going to do this. The exact details of what the metadata is storing or whether the slot isn't in the beginning of the end, it varies per system.

But at high

46:07

level, this is how basically how how they work. So what's going to happen is that we're going to have the at the beginning of the page we always have a header because it tells us again the metadata of what's inside the page like a check sum and so forth.

But then then I'm going to have what this this slot array and it's going to be a fixed

46:25

length array that keeps track of a uh offset within the page for a tupil at a given slot. And at the bottom of the uh the page that's where I'm going to actually store all my data.

So this is going to be any fixed length or variable

46:40

length data from the tupil or my tupils they're going to be at at the bottom. And so the the slot array basically is uh giving us pointers or locations or offsets within the page.

Like if it's 4 kilobyte offsets then the sort of 4 kilobyte pages the offsets aren't aren't

46:57

going to be that big. And it's going to tell me for a given tupil defined by a slot here's the offset within that page to go find it.

Right? So as I want to add more tupils, the the slot array is going to grow this way and then the tupil data is going to grow the other

47:12

way. At some point you reach in the middle where there's no more free space and then the page is considered full.

Now is it guaranteed to be 100% full occupancy of data? No.

Right? Because the very well linked data down below might have you know might not not fit exactly align exactly within the page

47:29

and have have you know use all the data. But in general this works good enough.

And the advantage of this approach is that now if I uh if I delete like a tupil, if I do like say tupil three here, I can either leave tuple 4 by

47:45

where it is by itself and then hopefully another tupil that comes along gets inserted could fill in that space or actually I can then just move tupil 4 to be the end of this page or to the end where two ends and then all I need to do is update the slot array to say the

48:02

location of tupil 4 within and then the slot array is now at this offset and what I get is in direction that allows me to change the location of a tupil within the page without having to go update other parts of the system. So

48:17

I don't have to go update indexes. I don't have to go tell them you know here's here's the location of 24 has changed.

So the the blast radius of the modification I do from this compaction uh is just limited to the single page >> number 165. Please go to position.

48:36

And as I said before also too that the um the the the cost of bringing things into disk is very expensive versus the cost of manipulating things in memory. So once I bring the page into memory, I can change

48:52

around the the contents of this um this page doing compaction or not doing compaction. And that's cheap relative to having to read my data from disk uh all over again.

So once it's in memory, I'm pretty much free to do whatever I want to reorganize this page. Some systems

49:08

like my uh like Postgress won't do it. Some systems like SQL server will do it like every time you update a tupil.

But having this indirectional layer having this indirectional layer makes a big difference for us because you can we can change things change locations in the page and not worry about updating uh

49:25

indexes and other things. All right.

So the other cool thing is that um the not cool thing that's all cool databases. So another thing we want to uh talk about is how we're going to identify um each logical tubable with a

49:43

physical location. For this we're going to use what we call record ids.

Some systems will call them uh tupal IDs or page ids um but or sorry not page ids tubal I ids record ids or row ids. Essentially, the idea is all the same thing, but it's a unique

49:59

identifier for a for a logical table that gives the that stores its physical location or represents its physical location within the database. And so an example would be like a record ID could say like here's the file ID and

50:14

the page ID and then the slot number within that page. Um and then in like other parts of the system if I want to unique I want to be able to say what's the what's the index pointing to what's the record ID or the address I'm pointing to you would represent that through this this record ID.

Now most

50:30

systems don't store this um instead it's derived uh at runtime based on you know as you read the data right so looking at a bunch of different uh implementations right so in ingress has

50:45

a tupal ID it's four bytes postgress has ct ID it's six bytes right that's going to the page ID and slot number like for these systems they don't actually store this but you can uniquely address tupils uh based on it Um the uh SQL light is an interesting one

51:03

for its row ID because it doesn't uh it does actually store it for you and it no matter what you declare as the primary key the thing called create table it's going to use the row ID as the primary key um think of like it's like a

51:19

monotonically increasing value um and it uses that because that guarantees that can uniquely identify any given record even if you UPD we don't provide a a primary key when you uh when you create the table

51:34

right now it's important to understand that this is something that will be exposed to you uh through a data interface um hopefully maybe I can do a demo um when I go back and come back to this and cut this in the video here but

51:50

you'll be able to see this through SQL and sometimes you can address the the the the records through this um but it's not actually something you you would want to rely on the application because these ids can change for a logical tubable based on how the data system

52:05

starts moving data around right um so you could do compaction and and change the physical location of data it's still correct because at a logical level it's since since models unordered I can change the ordering of tupils and

52:21

everything's still correct the record ID may change but again I shouldn't have relied on that in in my application All right. So, we know what files look like.

We know what pages look like. Let's go inside deeper inside the pages

52:37

and talk about what the tupils actually look like. So, at its core, a tupil is just a bunch of bytes.

It's a bite array that have a header inside of it. Just like a page has a header, tupils are going to have a header, too, that contain metadata about what's in the what's in the tupil

52:54

itself. Now, it's not going to contain meta typically doesn't contain metadata like what the schema is because that done at the page level and this is why you don't mix page types or mix data within a page because you don't want to redundantly store the same schema information over over and over again for every single tuple in the page.

The page

53:09

header just has that for you, right? But this will have information about like the like a a bit map of like what columns are null.

uh information about what the what t what transactions can see the data or whether they can see that tuple

53:25

or not for that purpose we we don't we don't need to worry about um but then the the the data system is going to maintain a catalog that keeps track of the schema for every single page sorry every t or table so that when you go read a page

53:42

for a given table you know how to then interpret the bytes within these these bite arrays with for a tuple to say what the data actually contains. [Music] So as I said the header it's going to contain basic information like visibility like what what transactions

53:57

created at what time this data and so forth have a bit map for null values. Uh then everything else have that is is just the attribute data.

So typically the the the attributes for a tupole are going to be stored in the order that you

54:12

apply them. Um, we'll see some ways you you there are some you could reorganize them uh to ensure things are aligned correctly, but most systems don't do that.

Instead, they're going to rely on padding to make sure things that are correctly aligned across

54:29

uh uh typically 64-bit values. So, let's say we have a simple example here.

We have a table fu and it has uh five columns, two three integers followed by double by a float. Again, it literally just the the bite array for a tub will just be stored in the order in which is defined

54:46

by the create table statement. So now the challenge is going to be is when you have things again that span these these boundaries.

Uh and you can be clever how you're going to handle it. So for this example here, it's pretty simple, right?

I have an ID 32 bits. I have a value that's big and that's 64

55:02

bits, right? And assuming the header is is 32 bits, this will fit nicely within a 128 bit uh uh se sequence of of bytes.

So again, if I want to jump now to get this attribute, I just look at the I get

55:17

my offset to the tupil by looking up the slot array in the page. And then now I I just do some other arithmetic to jump over the header and then jump to this location to find the the attribute that I want.

And then I basically do the equivalent of interpret cast in C++ um

55:35

to then convert the address that I'm looking at to be a third gig integer. Right?

This is something that's done at runtime. Um this is a compiler directed.

This is actually there's no computation done here. It's just it's ensuring that the the any code that comes after this

55:51

interpretation of this casting here is going to operate on you know 32-bit uh 32-bit signed integer. So now the tricky thing is going to be is when you have data that doesn't fit nicely into uh these these sort of word

56:06

boundaries, right? For simplicity, we'll assume it's 64-bit words.

So the challenge is going to be the table might have uh smaller types or larger types that might go across these boundaries and we have problems, right? So this 32- bit integer here fits nicely in the first half 64-bit word, but now I have a

56:22

64-bit uh date or time stamp here and that's going to span two words. Likewise, I have now a two byte character 16 bits.

Um, and then a zip code like this, right?

56:38

>> So, if I have things like nicely uh, you know, exactly aligned, I'm going to have this problem where now I try to go access like the uh, like the the date here um, the time stamp, right? I got to go fetch two um 64-bit words to go read feed that in and

56:56

some you know compiler might complain of CD might freak out depending on what architecture I'm running on this become become problematic. So an easy way to get around this is just to pad to make sure that everything's always right aligned.

So it's wasting space but it it just ensures that I don't have these uh

57:11

problems that I talked about before. So first case the the integer is for the ID is 16 32 bits.

So I just put another 32 bits of zeros that come after it. The data system knows that anytime it tries to access the data, I only need first 32 bits, but and it just ignores what comes

57:29

after that. But this ensures now when I do a look up on the date, I'm jumping to the right location and it fits nicely within my my word.

Another approach, u you could do this, I've only seen this in in the research literature, I don't think any system actually does this, is just to reorder things so that they're they're nicely

57:45

packed in. Right?

So if I if I just move the columns around like this or the attributes around like this then I land with a nice alignment that I that I would want. Now logically still is uh still in defined in the order when they call create table but physically the bits are moved around.

Um and then it

58:01

just knows how to do that uh reinterpretation at the right location to get the data that that I would need. Like I said you could do this.

I don't know of any system actually does it. Um at least I I haven't come across any that do do this.

58:16

And then to make sure that the the next two book comes afterwards is nicely aligned. We just put you know padding after the last one.

So again we still have make some space but it's uh this is why I said those systems don't actually do this. All right.

So now what are the actual

58:32

attributes look like? So the for the basic data types like ins, floats, uh strings and timestamps.

Um this is roughly how they you know how they system actually implemented the um you

58:47

know could be some variation based on the hardware which you're actually running on but in general this is what you would do. So like for a 32-bit integer or big 64-bit integer um the same way you would represent things C++ or C this is what you would get in in your system and that's going to be hardware uh that would be in general

59:04

that would be um it'll it'll be the same across hardware you do have to worry about NDNS um but for our purposes we'll assume it's x8 x86 we're not going to worry about that problem light um tled about SQL left for

59:22

everything is strings and then interprets it at runtime. So that way if you run on a little Indian or big Indian system uh it always comes out to be correct.

Um but most systems aren't aren't most systems do not do that. Um and they do a little extra work to keep track of like what is the Indianness when they store things so that if you

59:37

come back up when a different you know you copy your database to another another machine that that clips things around you end up with the the data as you'd expect. Um but for now we ignore that.

So for floats and reels uh for

59:53

floating boot numbers this will follow the what's called the I 754 standard again this is a hardware spec that says how the the these float boot numbers should be represented in hardware and again most systems follow that numeric and decimals are fixed point decimals these are these are going to be uh data

00:08

system specific we'll see what they look like in a second for uh text strings either var binary text or blobs um they either have things be in line within the tupil itself. So it'll be like the length of the tupil of the length of the

00:24

attribute followed by the actual bytes or there'll be a pointer to another location or another page that'll have that data um that you can follow along. We'll see what that looks like in a second.

Um if the vars are small enough, you'll store them in line. If they get too big, then you store them uh in

00:41

overflow uh overflow pages. We're not going to talk about this class uh coalition sorting, but you do have to be super careful about um for text strings about who's actually deciding, you know, with the right way to flexically store data.

00:59

for different languages has different orderings uh and data is worried about that. But for purposes here at this lowest level for these bytes, we're we're not worried about that.

We'll talk about that later when it comes to sorting. But it is something we need to be mindful of.

Um and most things you know by default some system will store

01:14

things as ASKI you not have to worry about unic code or UTF8 UTF16s you may store things as that as well like there's there it's gotten more tricky in in recent years or last 20 years um and you can be mindful that people might be

01:30

storing things that aren't you know English characters all right and then time stamps um and dates often times internally you would represent these things it's the the number of microsconds or milliseconds since the Unix epoch. Um sometimes some

01:48

systems have their own internal date format or time zone format then keep track of time zones or not. Right?

This varies per system. Um, but the one I want to spend a little time talking about is uh is the difference between like a floating reels and numeric because this is a good example where the

02:04

database systems are going to do a bunch of extra stuff to make sure your data is actually stored correctly in the way you expect even though it may not be as the fastest way to do this because the hardware can do things much faster than we can but guarantee that you know we don't have any loss of data or loss of precision in what we're storing.

02:22

So the um for variable precision numbers all right this would be like what we call the floatingoint numbers. So these are in think of this just using the native data types you would have in C++ like call you call float or you call double right and that's in in the code

02:38

that's defined by this I74 standard and this is what most architectures are actually going to implement and then because they're also going to have u specific instructions that can operate on these data that is really really fast um but the problem is going to be that

02:55

again it's they can't guarantee that they can store exact values So the example I always like to give is something like this, right? You have some simple C code.

Um, and you want to store two floating point numbers X and Y and 0.1 and 0.2. And so I run this code

03:10

here, you get output that looks like this, right? It looks like as that I would expect, right?

That I can store X + Y, I get 0.3, follow a bunch of zeros. Same with um just storing 0.3 by writing out 0.3 by itself.

But if I change the

03:25

precision of the basically number the number of decimal digits I want the decimal point like this now you see you actually get wildly different numbers right and neither one is exactly 0 to.3 right and this is because the way the

03:40

hardware is going to store these floating point numbers it's using variable precision and it can't exactly store certain decimal values like 0.3 right so even though you know with a higher with a less less precision in my output it looks like it's the same but

03:56

in actuality the actual bits in the hardware are not the same and if I so now if I write this out to disk and I read it back in and based on what calculation I'm trying to do I may end up with incorrect results or unexpected results and this is why data center will

04:11

maintain um what call fixed precision numbers where the data system can basically keep track of the the where the decimal point is and the scale and other metadata about every single number that it's going to store. And then now I would store maybe like a

04:27

string representation of the of the of the the decimal and then I know I use that extra metadata I'm storing to be able to interpret the uh what those the bite array actually means to guarantee that any calculation I do I get the

04:44

expected precision I would want on results. Right?

This sometimes called fix point decimals. Right?

This basically the same idea. like I'm doing decimal points but I'm managing this everything inside the data system myself because you know we can again guarantee that we end up with with correct values.

05:00

So this is super important in things like certainly banking um time computing interest you don't want rounding errors anything with like scientific instruments you don't want rounding errors uh especially if you're like sending a rocket to the moon or something right so that's why we want to be able to do this and it's not uh you

05:16

know they don't take it lightly it's not trigger actually implement these things to guarantee the correctness and under all possible circumstances so the example I always like to show is actually Postgress's implementation of their fixed point decimals they have a data type called numeric And if you actually go look in the postgress code, this is essentially what

05:33

it looks like, right? That you you have this strct here and it has a bunch of metadata about what the um what you know what's going to be in a in a in a fixed point decimal.

And then at its at its core, the decimal will just the value

05:48

itself is being stored as a bite array. Um and then all this additional metadata is then being used to to interpret that bite array to generate the answer.

uh you know to to convert it back into its correct form uh that guarantees the the

06:06

exact value that you expect. And of course now this means that when we want to do simple things like add two numerics together right it's not just you know calling you know number plus number an instruction in in a register we have to implement all this stuff ourselves.

So this is the code from a

06:22

post a few years ago uh but basically it's the same in the newer versions to add two numerics and you can see here there's a bunch of if then else's where checking to see is the number positive is it negative is it is it is it infinity is it not a number right and this is all code just add two numbers

06:37

together and certainly this be way more expensive to compute than just again calling the simple addition obstruction on two two data registers this is something that like you have to run all this code but this will guarantee that the um that the value you would get is

06:53

what you would expect. Right?

All right. So the other thing to be mindful is how we're going to handle nulls.

So there's basically three approaches and as I said the the most common one would be in the header of every um every

07:09

tupil you have a bit map that keeps track of uh what values you have within that tupil are null or not. Um, and typically you just store this bit map to be the same size as the number of entries that you have regardless of whether columns could be null or not, right?

It's not worth the overhead of

07:26

checking that um, you know, as you're interpreting. Um, but this is the most common approach people use.

One that's more rare is to do special values. You basically say within the domain or the range of values you could have for a given data type, you

07:42

designate one of them to represent null. Right?

So if you're storing things as 32-bit, storing 32-bit integers, you know, 30- bit signed integers, you would say that the smallest value you could have represented by in 32 min, you say that value is considered um is

07:59

considered uh null. And so you prevent anybody from inserting that value.

And of course, this means you have one less possible value you can have in your domain, but now you don't need to store that header um per tupil. So this will be more common in column stores, which we'll cover in a few weeks.

Um and this

08:16

is you sometimes see this in memory databases as well. They're trying to reduce the size of the metadata they're storing for Drupal.

But um >> forwards here the top one is more common. >> The last one is to store a

08:31

flag per attribute in the tupole to keep to tell you whether this tuple this values is null or not. Right?

So think of like if I'm storing a 32-bit integer, I had to have a one bit flag in front of that value for that 32- bit integer that

08:47

tells me whether it's null or not. Of course, now as you talk about alignment, you can't just store things as single bits.

You'd have to store an entire bite. So that means that I have to store a 30- bit integer has to be stored as uh 40 40 bits, right?

To keep track of the first whether the first one's null or

09:02

not. So this is a bad idea.

I'm only bringing this up to say this is a possibility what you could do, but as we go out the entire semester, I'll try to put this little marker here to tell you like, hey, this exists, but don't do this. So definitely don't do this.

I I only know one system that did this. U

09:18

and they immediately got rid of it and a few years later. Um so this is a bad idea.

If you have a row store, you want the first one. If you have a column store, you want the second one here.

And then we have actually our own paper actually how to store nulls correctly in column stores. Again, we'll cover that more in in a few weeks, but if you take

09:34

the advanced class, we'll read this paper and go go into way more detail about how to do it. All right.

So, before I set the assumption that we say that a tupal has to fit into a single page, um, of course, now if I have to

09:50

support really large attributes, this is not going to work, right? If if my attribute I want to store is larger than my page size, what do I actually have to do?

So this is where overflow pages come in. The basic idea is that we can have separate pages where if the value of a

10:07

given tupil is larger than it can fit in that page, then I have a separate page where I put that value in. And now within my tupil, I store a pointer uh basically a record ID um to that other overflow page.

And then now as I'm

10:23

scanning the data, I'm going to interpret this this data within a tupil. I know I had to I've come across one of these these pointers to this overflow page.

I know need to follow that original data. Right?

So I say this this contents attribute in this this table here is too big. So I have an overflow page where I'm just going to store all this variable data.

And then now within

10:40

my original tupil, I'll store the size of the data and then the the page number and offset like a record ID to where to go find this this overflow page. So different databases do different things to interpret like to trigger when you store things in overflow.

So in

10:55

Postgress by default if the if the um anything try to store is larger than 2 kilobytes even though they have 8 kilobyte pages they put in the overflow page in my SQL and uh SQL server and I think Oracle as well if the thing you're trying to store is larger than half the page size uh or larger than the page

11:13

size or larger than half page size then they put it in the overflow page. There's a bunch of optimizations we can do to uh to mitigate the overhead of trying to fetch these pages.

Um so we could take these overflow pages and compress them using like snappy or gzip

11:29

and we'll cover that in in two weeks. Uh another technique is called German strings.

Um this obviously comes from the Germans where in addition to the size and offset that I store in the original tupil. Uh I can actually store a prefix of the of the the value.

So if

11:45

I'm just trying to do string matching where like you know find all records find all records where the the contents field starts with with the word andy I could check the prefix in the original tupil uh and see whether I have a match there before even following the the uh the pointer to go catch the overflow

12:01

page. I can it's this back this idea we said at the beginning of like trying to avoid having to read uh data from disc that I don't actually need.

If I sto spend a little storage overhead of storing things a prefix uh can avoid me having to go do that in certain situations,

12:17

right? And if the the data I'm storing in my overflow page is too larger than the overflow age, I can just again put another record ID at the end of it or the front of it to tell you where the subsequent page could be found.

So I can chain these things together to reconstruct large tupils.

12:32

In some systems also too, you can actually represent uh data in pages that are not managed by the database system. These are called external value storage.

So this would be like if I have a you know a 20 gigabyte movie, I don't want to maybe put that in my database system,

12:48

but I want to still have the data be aware of that this data exists. So I could have a record that has a essentially a URI or pointer to some external storage that's managed outside the database system that if I need to I

13:04

can suck it in read it and hand it back to queries but the responsibility of maintaining that data making sure that's it's transactionally safe and making sure that it is uh that is going to you know any guarantees would have for making sure the data is

13:20

durable or safe on the regular data. It can't it's not going to guarantee those things, but it still allows me to stream data out to fetch it, serve it into tupless as needed.

So, not every system supports this. Oracle calls them B files, Microsoft file streams.

It's basically a way to like, you know,

13:36

storage large files outside the data system, but then still suck it in if you need to. And there's a paper from a few years ago, actually it's over it's over 20 years now.

Um that uh from from Jim Gray guy that went to 20 more databases uh

13:54

where they examined whether you should put things in uh the data system or not. I think their recommendation was like anything larger than uh I think it was like 128 kilobytes should be external less than that internal.

14:09

I don't know if that's entirely true anymore, but people have thought about this problem like how when should you store large things in the database or not? In general, database systems usually run on more expensive hardware.

So, you don't want to store like massive files. Um, and something external value storage is a way to sort of get around that.

So, it looks like it's in the database even though it's not. Then you

14:25

can put it on cheaper storage. All right, so we've covered a lot.

Uh, Fat Face Rick is still not out of the hospital yet. So, I I got to go figure out where he is.

Um but we've covered data storage the

14:40

basics of it. So again data system is basically a collection of pages that can be stored on disk or some nonhold storage and then we have different ways to keep track of what's what's where those pages are located and then we'll have different ways to store what's store those pages on disk single file multiple files or whatever and then we

14:57

have different ways to store the tubils that are inside those pages. Um and then again the because of this layered architecture >> number 169 please go to position to ticket number 35

15:12

>> that's not fabition the because we have this sort of architecture the other parts of the system don't need to be entirely aware of like oh this is actually how I'm storing things like in a slot page architecture or not uh all that is abstracted away and then we can still you know we can change things out and

15:28

modify them and extend them and enhance them as needed without worrying about breaking the entire system. It's not always the case.

Uh but that ultimately that's the goal. Of course, there's there's a trade-off of having too much abstraction that makes things inefficient.

But again, it it depends on the limitation. We'll see when we want

15:43

to relax certain guarantees uh or certain restrictions later on. All right.

So again, today's class was about how to represent data as as a database of files on disk. Next class will be then how we then talk about when we bring things into memory, how do we how do we keep track of that?

15:59

How do we then hand off those pages of memory, those pages of memory to other parts of the system? Then how do we make sure that we write them out uh in a timely manner, in a safe manner um so that we we don't crash and lose anything.

Then after that, we'll come back and talk about uh different ways

16:14

again store things with disk again. But I'm going to cover the buffer manage the memory stuff first because you're going to need that for project one uh that I go out next week.

Okay. That's money over

16:31

[Music] the fortune. Get the fortune

16:47

maintain flow with the brain. [Music]

#03 - Database Storage: Files, Pages, Tuples (CMU Intro to Database Systems)

Summary

Transcript