🚀 Add to Chrome – It’s Free - YouTube Summarizer
Category: Computer Science
Tags: AlgorithmsB+treeDatabasesDataStructuresIndexing
Entities: B treesB+ treesDatabase systemsMySQLOraclePhil LemanPostgreSQLSQLSQL ServerWrite-optimized trees
00:00
[Music] I'm still ass. [Music]
00:24
All right, round of applause for being cash. Thank you.
uh a lot to cover and we do people don't people have comments from feedback for you. We'll get that in a second.
Uh but this is the first time we started on the time all semester. So yay awesome right and it's also the most important lecture because it's going to be the most important data structure in all of
00:40
computer science in all the world is be B+ trees and hopefully today I'll convey to you why that's the case. I don't care about any other data.
I mean you care about them because you need them but it's all about the B+ tree. That's the most important thing.
All right. So again reminder project one is due this Sunday coming up.
Uh we had recitation
00:57
last week. The video is on uh is on piaza.
Go to that post there along with the slides. And then as I said beginning of the semester for every project since they're all due on Sundays and we don't have office hours on Saturdays and Sundays normally, but the day before the project is due, we have a special Saturday office hours with multiple TAs
01:14
for multiple hours on campus. Uh so you can come and ask hopefully what aren't last minute questions like how do I get started?
Like more deeper things to help you try to try to debug things. Okay.
Then homework three will be released on Wednesday and that'll be due on October 5th. And then just as to put this in
01:31
your radar, the midterm exam will be in this room uh on Wednesday, October 8th. And it's going to cover lecture 1 to 11 inclusive.
All that material up to and including but not the week of the exam. So the week before the exam, all of that will be on there.
And then I'll release a study guide with a practice exam next
01:47
week. Okay.
Any questions about any of these things? All right.
So also starting today is we're kicking off the beginning of our seminar series this semester. So a bunch of you emailed and reached out and said how I love databases maybe not as
02:03
obsessively as much as I am but like I do but like how do I get how can I get more databases in my life and go beyond the course. So again we have these optional seminars starting today every Monday.
Uh and the theme this semester will be on iceberg. If you don't know what iceberg is, uh it's one of the most important pieces of database technology
02:20
that's come around the last three or four or five years, right? There's a reason why data bricks paid, I think, $2 billion for it.
Billion with a B. Snowflake tried to buy them for 600 million.
Data bricks came in and uh kneecapped them and took it for six or two billion, right? So today will be
02:37
introduction to iceberg. Uh and then next week will be hoodie which is the alternative to iceberg.
So both these sort of technology or systems were built at the same time. Iceberg came out of Netflix.
Hoodie came out of Uber. Iceberg is the dominant one.
And then following after that we'll have mother
02:52
duck give a talk uh about the duck lake system which is another alternative to uh iceberg. Okay.
Again this is optional. All right.
So what I do want to cover is the the feedback we've gotten through the course. Believe it or not we get emails and we get messages on YouTube and we do
03:09
read them. Uh so the first feedback we got was actually for DJ Cash.
So the first guy says uh he heard your radio show which I heard went really really well. Uh at least from what I heard.
Uh he's basically saying like you to get more money. Not me, you should get more
03:24
money. Uh this guy uh Sweaty the Glove uh says they you got the best beats and that he wants to buy your merch.
So we You're doing that, right? You're releasing something.
We're working on it right now. you release like stickers and shirts or something?
03:39
T-shirt. Okay.
All right. Awesome.
Uh so that'll be available for for DJ Cash coming up. Uh for me, somebody said in the first lecture that I wasn't uh hyped up enough.
I wasn't speaking fast enough that so you could still watch me at 1.25x. Uh so that was surprising.
Again,
03:55
maybe I was down that day, but I'm you know, I get excited. I start thinking really fast.
I'm surprised this guy's complaining that I was too slow. Um and then this guy says that you can't be a nerd or a gangster.
You can be either nerd or gangster. Can't be both.
I don't want to comment about my past, but like,
04:10
you know, was I arrested? Yes.
Am I here teaching you guys? Yes.
Databases? Yes.
So, I'm not saying I'm doing something really malicious, but like whatever. I know a lot of people that have gotten in a lot of trouble in life and end up doing databases.
So, this guy doesn't know what he's talking about. Plus, he's got a photo with his kid.
So, that's his
04:26
problem. All right.
But one feedback that is actually correct. This guy here uh pointed out that I made a mistake in I think lecture six um where I talked about how there's this distinction between OLTP and OLAP and I made the claim that OLAP was coined by Jim Gray.
04:41
Uh and this guy points out and he's absolutely correct that is actually wasn't Jim Gray it was the other touring award winner in databases although there's there's four of them. It was one of the other ones, uh, Ted Cod, the guy that invented rational model.
He was the one that was getting paid to write an article about like, hey, here's what
04:56
OLAP is. And then the article got got retracted once they found out it was, uh, he was basically kind of shilling for this company called Sspace, which is a data cube.
You don't even know what a data cube is. It's another way to do analytics for column stores, but this is how people did things in the 90s.
And anyway, so he's right that this in this
05:12
lecture here when I show this picture, it wasn't Jim Gray. Uh, it was it was Ted Cot.
Okay. So again feedback like that is super useful not maybe not me you know complain about me but feedback for cache and feedback for the course material is always appreciated.
Okay, so
05:28
last class uh we were talking about hash tables. We said these are an important data structure you need in your database system to support various functionalities, various parts of the of the system and we spent a lot of time talking about static hash tables.
We kind of rushed at the end on dynamic hash tables. So I'll pick that up again
05:44
and go over that one one more time. Um but the most of the time these these hash tables that we're talking about will be used as for again for internal data structures like things to keep track of like your your page table and so forth, right?
not so much for indexes which we'll see mostly how they're going to be used in B+ uh today. Right?
So
06:03
again remember last class when we finished off dynamic hashing there was essentially um there was three schemes. There was chain hashing where you just keep adding things to this long chain.
We said oh that kind of sucks because now you if in the worst case scenario you're you're regressing to a linear scan and you're you're hitting up O. Um
06:19
so we talked about these two approaches extendable hashing and linear hashing. So I quickly just go over the the example again and then we didn't get through deletes for uh for linear hashing.
So I I want to cover that as well again. So these are two approaches that allow us to reorganize and resize our hash table without having to
06:36
reorganize everything. Right?
When we had these static hash tables when we ran out of space or the fill factor got too big uh that we then basically had to make a second copy of a hash table, usually double the size of the original one and then take all the keys out of the first one and put them in the second one. Right?
And that's a very expensive thing to do especially if your hash
06:51
table is really really big. So with extendable hashing linear hashing the idea is that we can split buckets incrementally uh and grow them without having to reorganize everything.
So in the case of standable hashing you have this notion of a uh a global bit counter. Basically it's the number of bits we have to examine uh in our
07:08
hashes. Right?
So when we hash a value or hash a key we're always going to get 32 bits or 64 bits. this global uh marker just says how many bits uh do we actually need to examine to figure out where to go and then we had this bucket array that was going to correspond to uh
07:23
bits uh that that that are the part of the hash hash values. So in this case here we'd have uh some some of some of the bits or sorry some of some of the buckets only require one bit and they're end up pointing to the same bucket and then these other two in the bottom are looking at two bits right so then when
07:41
we started doing things like a lookup a hash on a we would look at the global uh the global bit counter is set to two. So we know we only need to look at two bits to figure out where in our bucket array do we need to jump to.
Right? In this case 01.
We follow that uh and we end up with this bucket here at the top. We
07:57
find our key. Same thing.
Now we do a put in B. Uh we only need two bits.
It lands in this. No problem.
That's fine. Now we do a put at C.
It requires two bits. But then when we follow the pointer, we land up the the bucket here in the middle.
but it's out of at a out of slots in our buckets. So now we got
08:14
our overflow. So when we do an overflow in uh external hashing, we're basically going to double the size of our bucket array, right?
Uh in actuality, you obviously usually allocate maybe most of it uh or large enough size. You're not doubling over
08:30
time. Also too, this thing is just not that big.
So it's not like the whole hash table. We have the keys and the values and everything else.
So it's very expensive to double things. In this case, it's not that not expensive.
So now that I doubled the size, I can split the bucket that overflowed. And then I set now the the global counter to three.
08:47
And then some of the buckets that have that weren't split still only require one bit. In that case, they're all pointing to the first bucket.
The second ones that need two bits point to the other one here. And then for the ones that we split before, now they're using three bits and they're now pointing to two different buckets.
So now when I go back and I want to put C back in again,
09:04
now I look at three bits. I find find the the location in my bucket array and I follow the pointer to the bucket that I that I want.
Okay, so linear hatching was a different approach where instead of splitting the
09:20
bucket that overflowed, we're actually going to end up splitting the bucket that's where something else is pointing to this thing called a split pointer. So there was a question last time, oh this seems kind of wasteful that like why are we not splitting the bucket that overflowed?
We're just splitting whatever one the split pointer is
09:35
pointing to. And the idea is that over time uh things are sort of balanced out.
Eventually you'll end up splitting the bucket that overflowed and things sort of nicely balanced out. All right, let's go back here.
So we still have our bucket array. Uh and now again we're not
09:50
looking at just the bits, just sort of what offset we are in our bucket array. And then we have a split pointer that's that's a just a demarcation to say what's the next bucket we're going to split.
So the very beginning it's going to soon as we overflow we'll end up splitting bucket zero and there be some
10:06
hash function we just hash the key then mod by n with ns the the the size of of this bucket array. So we have n at this point n equals four.
All right so now I want to put so when we get six hash by six just say it's the identity hash mod by four and I get two
10:22
and then I know how to jump to my bucket array at offset two and I land to the bucket that I want and there's the key I want. Now I'm going to put 17 17 hashes to uh you know when I mod four I end up with one.
It points to that bucket here. But that now that bucket is full.
So I
10:38
have to overflow. But instead of splitting this like I did in extendable hashing, I'm going to extend it out like I do in chain hashing.
Just add another bucket and have it maintain a length list. But then now I'm going to split whatever the split pointer is pointing at.
So in this case here it's pointing at zero. So I'm going to go ahead and
10:53
split that guy at the top. even though that's not the one that overflowed, but this is just how the algorithm and the protocol works.
So now what I'm going to do, I'm going to add a new entry to my bucket array. So I I have a slot four at the bottom.
And then now I have a second hash key or hash function where I'm
11:08
going to modify two times the original one. So before we had n equals two and now we're going to modify by eight, sorry, by four, right?
because it's sorry n equals two because I have yeah n equals four in the original one going back here and then now when I add the
11:25
new hash function now it's 2 n so that'll be eight so I'll have and eventually I'll have eight slots in my bucket array and the split pointer is keep moving down and then at some point it'll wrap around and then I start the whole process all over again all right so in this case here again so we're going to take whatever the the the bucket at the top that the split point
11:41
is pointing at that we're going to end up splitting that so we have to look at all the keys that are inside that bucket and figure out hash them again with the new hash function and figure out where they go. In the case of eight, eight stays where it's at because eight mod 8 is zero.
So that stays at the top. But then when I eight, when I take 20 and mod 8, I end up with uh four.
So it
11:58
lands here at the bottom there. Okay?
So it's not like when I do the second hash because I'm just doing 2n in the new hash function, it's not going to land in any arbitrary bucket. It's either going to be the original bucket or the new bucket I just created.
12:14
Right? And then this split pointer moves down by one.
And now when I do a lookup on 20 all right 20 mod 4 is is zero. So when I do my first lookup I pro I would see that oh I'm this thing for the first hash function is zero.
Well the split
12:29
pointer is pointing it now at one. So since zero is less than one I know that I've already split it anything above the split pointer.
So to figure out where 20 actually is I got to hash it again now with the uh with with the second hash function. And then now I'm gonna land
12:45
the location that has the key that I want. Same thing.
Get on eight. Eight mod 4 is just one.
One is where the slip pointer is at. So one is equal to one.
So I know I don't need to run the second hash function. I can use I can just jump to
13:00
the first one and find the thing I'm looking for. Yes.
The statement is the decision of when to use hash one or two is is determines
13:16
based on it has to do with what the first hash function points at and is above the split pointer. So going back here I always got to hash the first one.
I hash the first one first time I get zero. Zero is less than one where the split point is pointing at.
So I know I have to go to the to the second hash function.
13:39
the statement is and he's correct that say I keep splitting because overflows and I'm going to keep ex adding new entries to my my buck array up until I get to seven right and then at that point the split pointer wraps back
13:54
around to zero I can yes throw away key the first hash function key one mod n or key mod n and I'll just have and then key mod 2n will be the new hash function or sorry the first hash function and then I'll have key mod 4N after that. Yes.
14:12
Is this just like a link or something like Yeah, this question is how do we do the overflow? It's just the bucket hashing we talked about before chain hashing.
So just add a pointer. Keep going.
If we put something that is what we're going to end up splitting again.
14:27
So the question is uh if we Yes, he's correct. So say now say this bucket here one say it was actually at the bottom right and I keep inserting into it keeps overflowing.
Yes, you would keep splitting anything up above it the split pointer because then eventually you
14:45
you'll overfl it'll get to it then it'll split. In this case we already overflow right?
Yes. So so your question is like if I assert something again here does that count as an overflow?
Yes. because I'm I mean like you could not do it and say
15:00
all right I'll just keep going until that second bucket is is full because I've already created no big deal but again think of it extremes I don't want to have this really I don't want I have to do a linear scan jump inside that to find things I'm looking for so if I if I can avoid that by splitting it's I'm worth it's worth doing that
15:18
okay so again this I've already said before basically the idea is that we're just going to keep doing this until even though we're not splitting the one that overflowed eventually But we will get to this right to do delete uh it's basically just the reverse right so I want to delete key 20
15:35
uh same thing I first uh do the first hash function I get zero I know zero is above our split pointer so so then I'm going to hash it again and now I get four I jump down here right I delete 20 but now the the the bucket is basically
15:50
empty so you could just leave it there if you wanted to or you just do the reverse you could kill it uh roll the flip pan pointer back back up and then you know that four slot in the bucket array nothing's really pointing to it. Nothing can get to it anyway because because my my hash function won't ever jump to it.
So I I could just leave it
16:06
there, right? And the reason why I maybe don't want to clean things up so quick or eagerly immediately as it happens because if someone now inserts 21 back in, I'm going to be back where I started before and just have another overflow and just do do the same thing all over again.
So,
16:23
I'm saying in in implementations of this, you you kind of relax the rules about how quickly you want to uh revert back on deletes because you may be doing the same thing over again and just you're just spending time moving this one back and forth and and uh free, you
16:39
know, reusing space or freeing up space and then allocating it all over again. And we'll see the same idea when we talk about B+ today where there'll be there'll be times where the so go through what the what the algorithm says you have to do but in real limitations
16:55
they'll relax this a little bit because it'll make a difference in performance. Okay.
All right. So let's jump into today's lecture.
Uh again the the beautiful BL trees. Um, so first we do an overview on what the data actually looks like and then we'll talk about different design
17:11
choices you have and actually implementation and then I'll try to plow through as many optimizations as we can until we run out of time. But I'll cover the main ones that I think are the most interesting, most important ones uh in in the beginning.
Okay, so B post is by far the most
17:26
common data structure you're going to have uh in databases used for indexes. like when you call create index in in nearly every single data system you're going to be getting a a B+ tree or something that looks like a B+ tree.
uh in some systems like postgress that support hash hash uh hasht indexes you
17:44
have to explicitly say I want to create table using hash or right and then because otherwise the default is always using the B+ tree the reason why is because B+ trees can do any search you can do point queries you can do range scans you can do partial key lookup sometimes we'll see how to do that where
17:59
in hash tables you can only do uh point lookups like go get this one tupil I can't do range scans because everything's unordered in the hash table so it doesn't mean anything there's no notion of a range the values aren't ordered. And if I have multiple keys or multiple attributes in
18:14
my uh my index key like table or column A and column B, I can't do a lookup on that index unless I have both columns, right? Because I can't go give me the hash on A and then have that have me be mean have that mean anything without B
18:29
because I have to have both of those keys together. So B plus is going to be way more versatile.
So now the tricky thing is going to be is the name is going to be a bit confusing right so there's going to be a class of data structures called B trees of which there's a specific data structure called
18:45
a B+ tree but then there's also another data structure called a B tree but the B B+ tree and the B tree are both B tree data structures but the B+ tree is not the same as the B tree I'll explain what I mean okay so the but
19:00
generally you know this this sort of category of of data structures refer to what are called balance uh tree or balance trees. All right.
Um the the original B tree paper uh is roughly attributed around 1970. No one's quite
19:15
sure exactly when when this maybe 1969 1970. Um and the the original paper is basically this one here uh from from came out of Boeing, right?
And this second author here, McCree, he's actually CMU alum. He did his PhD here
19:30
in the 60s and then he went off to Boeing and then they were building a data structure. Uh they built a data system and they needed the data structure and they basically invented the the D tree a B+ tree.
Uh but the paper most people site refer to as the like the the original definition of the
19:47
B+ tree came out nine years later. This famous paper called ubiquitous B tree.
Right? So again think you know it's the early days of computing uh and obviously there wasn't the internet so things moved slowly more slowly than they do now but by 1979 the B tree or the BSG was considered you know the essential
20:03
data structure we use in data systems so much they were it was everywhere and they were calling it ubiquitous right so the challenge I think it's going to be is like there's original definition of the B tree the B+ tree but no one follows what those original papers actually talk about they're actually going to use bits and pieces of other
20:19
data structures that are extensions to B+ trees that come along over the years and in particular the one we're going to talk about today as well is called the be link tree which actually was invented here in 1981 in in Carnegie Melon and the first author for this Phil Leman
20:36
he's still here at CMU so he's on the fifth floor or the sixth floor at the dean's office right the dude's dude's awesome he's still here um and so when you go look at the postgrads code also as well they're basically be implementing this this data structure as well if you go look in the source code it'll say for the actual
20:52
you know the the B tree data the B+ tree data structure so they're going to call it a B tree but it actually is just a B+ tree again people get fuzzy with the names but lo and behold it says right there the this directory contains a correct implementation of Leman and Yao's uh high concurrency B tree management algorithm come comes from
21:08
this paper written by the dude over there in uh in gates right so again so post is called this B tree but it's really a B+ tree and actually they're going to call this an NB tree which is non a a nonbalance. So they're they're going to relax some of the
21:24
things that we talked about before. Everyone take a guess what the B means in B+3.
Boeing. Yes.
He's seen the video. Yes.
Okay. Uh so people think it means balanced.
People mistakenly think it means binary. Binary like a binary B
21:39
tree is basically where M equals three. We'll cover that in a second.
Uh or next slide. Right.
But according to Phil, he said he's like 99% sure, 90% sure that when he talked to like the dudes that wrote all these other papers uh that were at Boeing, he said, "Yeah, they
21:55
named it after Boeing because they were at Boeing, right?" All right. So, so the Boeing tree doesn't sound very uh catchy.
Uh but you know, people people just heard it as the B tree or the B+ tree. So, I'll try to be very careful and I'll distinct I'll make a
22:10
distinction what a B+ tree is versus a B tree is. But for going going forward when you when a database system says they have a B tree 99.9% of the time they really have a B+ tree.
Okay. All right.
All right. So, B+ tree is going to be a self-balanced ordered Mway
22:26
tree that is going to support the searches, sequential access, insertion, deletions, all the things that we'd want to do in a data structure in log n uh time where m's going to be defining the fan out the number of sort of uh
22:41
pointers that are going to come out of your of each node which either could be to another node in the tree or could be to the record or whatever it is that that we're trying to find. So, we define m by that.
And so for every node, if you say it's an uh it's an Mway tree, it'll have m outbound pointers and that's the
22:57
degree of the node. But then it's going to have n minus one keys that it can reference inside of of each node.
So it's being claimed to be perfect balance, meaning that the distance from the root to any leaf node will always be the same anywhere out through the
23:13
throughout the tree. It means we're going to have to do a bunch of stuff in our data structure to ensure that this is the case.
And one of the key requirements is that we're going to say that every single node except for the root because it's kind of a special case or it is a special case. Every single route needs to be at least half full,
23:29
right? And then when it becomes less than half full, we got to do a bunch of extra work to to to get make it be half full by by taking keys from our neighbors or merging things together, right?
And so the reason why this data
23:45
structure is going to be so important because especially in the early days but even now in in with modern hardware it's be optimized for reading and writing large blocks of data very efficiently right this data structure is going to convert what would otherwise be potentially random IO into sequential IO
24:01
and as we said in the beginning that's going to make a big deal when it comes to actually you know constant time factors for actually running running our system right if we can maximize amount of of IO then we're going to get better performance And as I said, some systems implement that implement the B+3 will relax these
24:17
properties. Uh but we'll ignore that for now and go strictly by by this definition.
I say also too the fan out is going to be pretty large, right? So m is usually a large number.
Uh you know, think of like a of a 8 kilobyte page in
24:32
Postgress. I can put a lot of key values in pairs inside that thing, right?
So the the height of the of an a real system of these data structures of B+ it's be like maybe four or five levels because the fan out is going to be quite large. But in case of PowerPoint, you know, I have to make it fit.
So we'll
24:48
look at simple examples here. So this would be a a three-way tree.
So there's be uh the fan out is three for every node. So I have three pointers coming out and I'll store two keys inside each node.
So at the top we have the root node. The middle layers we call the inner or non-leaf nodes.
Uh and then the
25:06
bottom nodes are called the leaf nodes. The way to think about this is that along the leaf nodes are our our keys are going to be sorted from this case here low to high.
You can define you want to be the opposite you can go high to low. But these are going to be sorted according to some coalation ordering that that that we would define in our database system.
25:23
So the contents of the nodes uh for the the root and the the non-leaf on the inner nodes is going to be alternating key value pairs of some kind of pointer to another node in our data structure followed by a discriminating key that's going to allow us to determine whether we want to go left or right to find the
25:39
entry that that we're looking for. And then in the leaf nodes uh these will be key value pairs where it'll be the actual key that we we we want to find and then its corresponding value which could be a pointer just some data structure something else or could be the record ID in the case of other systems
25:54
like like my SQL could actually be the entire tupil for our purposes now it it doesn't quite matter. So way to understand these keys, they're basically in the inner nodes in the root.
They're guideposts that allow us to figure out where we want to go when we do our traversal into into the index. And so
26:11
for this example here, I'm going to say I have a my root node, I have key 20. So if I want to go to our left, uh that'll be all the keys that are less than 20.
And then on the other side of the right, there'll be all keys that are equal to or greater than 20. So for any given key, I can just look at that uh look at
26:29
what you know look at that node, look at the key and decide whether I I want to go left or right. One thing we're we're going to also add as well, and this will make it easier for us to do splits and merges in a second, is add sibling pointers.
This is what the the the BLink tree in 1981
26:44
added for us. So now when I want to do a lookup like a range scan like find me all the keys that are uh greater than 15 uh 15 or greater I could go start my root node traverse down to the bottom and then now scan across to on the on
27:00
the on the leaf nodes and follow those sibling pointers to find all the entries that I want and I don't have to backtrack and go back up to my parent to jump down to the next one. Yes.
27:15
the question. Yeah.
Yes. We're sorry.
What is the question? So what is the values of m?
Uh so I mean n is just the number of keys at the bottom. You count those.
Uh but n would be um m in this case there's three
27:33
because for every node right going back uh say say the one at the top right you have you have node pointer tells you how to go down to one side. You have a key, the discriminated key that tells you whether you know what whether you want to go left or right based on the value of it followed by the next pointer.
27:50
Yes, for range query I don't see why you would need sibling pointers. The question is for range pointers you don't see you don't know why you need sibling pointers in line you'll need them for splits and merges but for for once you reach to the bottom you you if you're just reading you never to go back up.
28:06
Yes. question in that root node there's three node pointers yes why what question is why do I need three node pointers I'm only pointing at two how am I allocating this
28:23
thinking a real system what's the backing memory of this no no what's the backing memory of the data in a node like I got to I have stored the node the contents of the node somewhere what am I storing it in what is it a page
28:41
pages are all fixed size. So in my PowerPoint example, yeah, I have two pointers coming out of it.
But in a real system, I'm going to allocate a 4 kilob, 8 kilobyte page. I'm I have to, you know, I'm going to allocate the space for it.
So it's always going to be there.
29:00
Because his question is why does that why am I why did I just what I just said explain what there's three pointers? Because it's not like an inmemory data structure where I'm just calling maloc and I only get the data that I actually need.
I'm gonna go to the buffer manager say I need a page right and you're gonna get back an 8 kilobyte chunk of memory so I'm going to allocate the space for
29:15
it in my diagram yes the space is not being used in a real system you would you would have that space allocated anyway even it's not still not being used okay so on the leaf nodes um
29:31
I'm just showing that that you we you'd have pointers for siblings uh on the back side of the array right so at its core What is what is the B+ stream? What are we doing?
29:48
He says binary search tree over over what? Yeah, but what's the data structure at the bottom?
What is this? No, not hash table link list.
Right. So all this data structure is and all these
30:03
tree data structures is just allow us to jump into our link list more efficiently because otherwise without this stuff at the top what do I have to do? I got to scan along linearly on the leaf nodes to find the thing that I want.
Right? So sorry go back.
So it's not that
30:19
exotic if you just think about think about the top part is just the scaffolding we're going to build up on top of this link list so that we can jump to the bottom to find the thing that we want or or the keys that we want. Right?
Skit list basically the same thing. Try is a little bit different but
30:35
but you know in the end of the day that's what this data structure is doing for us. It allows us to jump to locations in our link list without having to scan from beginning to end every single time.
Again because always thinking extremes right if I have a billion keys or trillion keys I don't want to do that linear scan every single
30:52
time. My BS3 will allow me to jump into it.
And as I said before the fan out will be quite large in a real system. So I'm only have to jump down at most maybe five levels and I can represent a lot of keys in in my in my leaf nodes.
31:07
All right. So the leaf nodes themselves the node themselves are just think about the arrays of key value pairs, right?
The the values that we're going to be storing are going to be different based on whether we are an inner node versus a leaf node, right? If it's an inner node, the values are going to be pointers to
31:23
other nodes and these are just page numbers, right? that I can do a lookup in the on the buffer pool to go go get the data that I want.
And then the leaf nodes will vary based on what actually want to store. Do I want to store uh record ids to pointers to tupils?
Uh do I want to store the entire tupil itself?
31:39
I can do a my SQL and SQL light. In some systems like my SQL as well, the leaf nodes will actually store a primary key as the values.
So then I can do take the value the primary key and do a lookup in the primary key index to actually get the get the record ID or get the table. Right?
And then for nulls we have to treat them
31:55
specially and typically you either put them at the beginning of the the the the list or at the end of the leaf nodes. I think by default in postgress they put them at the end.
So again I'm showing this sort of diagram like this uh of you know in my before because it has to fit in PowerPoint. I'm showing these sort of
32:11
key value pairs over over and over again. In actuality a real system wouldn't actually store it like this.
Um you would have some uh some extra metadata. They would store it at the top.
uh they say like what level I'm at, how many slots can I store, what's my
32:26
previous pointer and and and next pointer. Sometimes there's additional metadata say like how many keys do I have inside here?
Uh what's the high key? What's the max value I can store in my page?
Some extra things that make me find find data more quickly. And then the you could store the example I was shown before you just have the key and
32:43
key followed by its its uh record ID or sorry the pointer right afterwards as the value. In a lot of systems, instead what you do is you store all the keys sorted as a separate array.
Then you store all the values just in a fixed length array uh or or some kind of array
32:58
that you can then just jump to offsets like we did in the column store based on whatever key that I'm looking for. So now I I can if I want to find the key that I want uh I can just do binary search on the key array and not have to jump over data that that I actually don't need.
Yes. I still don't get the relationship
33:15
between a node one page. So question is one node one page.
Uh not always but you just you could for simplicity think of it like that. Yes.
If a the database page is 8 kilobytes the node is going to be 8 kilobytes. That's what I'm saying.
The fan out be
33:31
going to be quite big because you can put a lot of key value pairs in 8 kilobytes. What's that?
Question is when will it not be one page? Give me a few more slides.
Sometimes my my my keys can be quite big and I got to overflow.
33:47
The question is how do you maintain sort of keys? I don't pick your favorite sorting algorithm.
Quick sort right it's in memory. The question is does is that mean whenever a key is sorted a key is inserted am I going to resort it in most systems?
Yes. You don't have to
34:04
but in general yes but again like it's in memory I'm already reading it. So I hold the I hold the latch on it, the mutex on it.
So I can do whatever I want once I have that. Question that's a lot of movement.
Uh not this is going to fit in cash lines.
34:20
It's not it's not gonna be that bad. I can I can sort 100 keys pretty quickly.
Yes. Sorry.
Question is keys be variable length. Yes.
We'll we'll solve that. Give me a few more slides.
Yes. That makes it
34:36
harder. All right.
So, I've already said this. What are going to be the the values on our leaf nodes?
Uh, in most systems, it'll be the record ID and it's like a page number and an offset. That's always fixed length.
And that'll that's, you know, that'll allow us to jump to find that we want. Um, in other systems can
34:53
actually be the tool themselves. We saw this in index organized storage in MySQL and Postgress and Oracle and other systems, right?
My SQL and SQL light give you this by default. Like you when you create table, you get it.
In SQL server and Oracle you have to explicitly say that I want it.
35:08
All right. So now that you understand the basics of B trees and B+ trees or you know the basics of B+ trees let me make the distinction what a difference between a B tree is and a B+ tree.
Actually this should be 197 not 1971 but in the original B tree all the values were values could exist
35:25
anywhere in the in the inner nodes. Meaning if I'm doing traversal to find the thing and I'm find the key that I want, I may hit the first root node and then I'll find the key that I want and then I get the value right then and there.
35:40
In the B+ tree, the values again like the record ids or whatever it is that's actually representing the data of the tupil that I'm trying to find in my data structure only exist in the leaf nodes. So in order to find whether a key exists, I always got to get down to the
35:56
leaf node and it's either there or not there. So B trees are going to be more efficient in terms of space, but I now got may have to do this uh depth first search and jump up and down and come back up to the root over and over again to try to find the thing that I want.
36:13
And that's going to be random IO. That's gonna be bad.
In the case of a B+ tree, once I get to the leaf nodes, ignoring splits and merges, once I get to the leaf nodes, I'm trying to find data. I can do sequential scans to find the all the find the things that I'm looking for.
36:28
And that's a key distinction. So when we delete a key now, it may in a B+ tree, it may still exist in the inner nodes, but it'll it'll get removed from the leaf nodes.
And that's okay. In a B tree, if we delete a key, it has to be completely removed from the tree.
36:45
Red, black trees work the same way. All the other ones work the same way, but they're going to be, you know, they're going to have their own issues, right?
So B plus trees are going to going to be slightly larger than B trees, but we're going to get better performance because we do more sequential scans. And actually, we'll see in two weeks or next
37:02
week it'll when we start having multiple threads jumping into our data structure, B+ is going to be way better than a B tree. All right.
So let's first talk about how we want to do insert. So the protocol uh as I said for for doing a a lookup is again you just look at the keys figure
37:18
out where you want to go left and right and then jump down and keep going till you find the leaf node and you either find the data you want or you don't. Now to do an insert we're going to do that same lookup procedure where we're going to traverse the the the data structure find the leaf node that should have the key that we want to insert and then go
37:35
ahead and try to insert it. If there's enough space in that leaf node, great.
We're done. If there's not enough space, meaning the all the the the the array is full, then we have to do a split.
And that means we we in this case here, we want to now redistribute our keys that
37:51
are in the the the node that overflowed to to other nodes, which then may require us to then do a bunch of splits going up the data structure. So upon insert we may have to reorganize the entire tree in the worst case scenario but in most of the cases uh you know
38:08
most cases we'll have enough space in the leaf node we're trying to insert into and then worst case scenario we'll have to then maybe split our parent right and then the split process we're just going to redistribute the keys evenly by just taking whatever the middle key is and using that as the
38:23
halfway point split things in half and then push the middle key up as a discriminator in the in the tree in the tree above us and recursively do this till everything's balanced. So going back to a really simple example here, now we have again a uh four tree with three keys.
So there's four uh four
38:40
branches that could come out of every single node and then we have our sort of a range ranges here. And now I'm showing like just four greater than equal to four n less than 12.
Here's the more compact form of that. Right?
So say I want to insert six. So again I start at
38:55
my root node. I want I I I look at the keys that I had there.
Six is greater than four but less than 12. So I know I want to go down to this leaf node here.
But this node is full. So I have to split it.
So what I'm going to do is I'm going to end up creating a new node, a new leaf
39:12
node that's currently empty. Then I'm going to take half the keys in my my first node uh or take all the keys, figure out what the middle the middle point is.
Half the keys are going to go on the in the original node. The other half the keys are going to go into the other node.
So you would end up like
39:28
this. Right.
The six went into the original node, but nine and 10 got moved over to the other one. Am I done?
No. Right now, I got to go back up to my to my my my parent and add a key there to make sure
39:44
that if anybody's looking for nine and 10, they know how to go find it. So, in this case here, I can just choose the the the smallest key on the on on this note here.
Nine is basically what I split on before and put nine at the top there. And this point I'm done,
40:02
right? I just I just update my well you wouldn't store this the ranges up here the keys essentially giving you that ranges right because when you traverse the tree you know how to interpret the contents.
Okay. Yes.
40:20
The question is you don't remove the nine. Why would you do that?
statement is why why did I why don't I remove the nine then if the key exists it has to exist in the leaf nodes right so going back here nine was in here
40:36
before right 5 9 10 so nine's in there same is and he's correct any key that exists my data structure has to exist in leaf nodes yes
40:52
H question is why isn't four there? Because four might have got four was inserted and then got deleted.
That's okay. In a B tree, you can't do that.
In a B+ tree, you can. All right, let's do another insert.
Let's now insert uh eight. Eight.
Again,
41:08
root node, we follow the the pointer. Eight is less than nine but greater than four.
That takes us down here. We go ahead insert.
And that's fine, right? And let's look at let's look at a bigger table.
So, I want to insert 17. 17 should go here.
we uh that's fine. We
41:23
can absorb that. Now, I want to insert 16.
16 should go between 15 and 16, but there's not enough space in here where I I can store it. So, I can I'm going to split the node again.
Uh so, essentially, I'm going to create a new node and then shuffle all the keys
41:39
from the the new node that that got overflowed. Put half on one, half in the other.
Right? So, say I split on uh on 16.
So everything 16 and greater goes in to the new node. Everything 16 less stays in the original node.
But now I
41:54
have to I have to update my my my root parent in this case the root node to now add 16 there. So anybody that is coming down can find us.
But of course now the problem is our our root node is is full as well. So again you recursively go up and keep trying to add insert new keys
42:11
back up. And if you have to split your parent node, you got to run basically the whole protocol again.
And as you go up from one level to the next, you don't reorganize anything below you because you've already handled that. Make sure it's balanced.
But as you go up, you make sure everything else is balanced until you reach the root and you're
42:27
done. So in this case here, our root node is full.
So we're going to split it again and then grow the tree and make it taller. So 13 is going to get moved up because that'll be the middle key for our uh for for you know for for the new route of
42:43
the entire tree. And then 16 is going to put where between nine and 19, right?
But now what we need to do is split this because anytime I I've inserted something up, I have to split whatever node uh that that has got overflowed. And so I'm going to create two new nodes
42:59
here. Well, you just copy you you you create a new one and then reallocate things in the original one.
That's fine. And then now I update my uh my root node the new root node on 13 to now point to uh values less than 13 which is five and nine and values greater than 13 which is
43:14
16 and 19 on this side or greater than equal to 13 which is 1619 here. So the new organization of the discriminator keys is is looks like this.
Yes. Is there a reason why when we initially try to
43:32
move stuff to the at the very beginning. Yeah.
Yes. So, he's talking about optimization and we'll see this when we do deletes.
43:48
If I'm back here, I I want to insert 16. Uh I could I could recognize, oh, well, the guy on where where node 911 are or even 20 213, they can absorb one more key.
So I could follow my sibling pointers and just shuffle things around.
44:05
Move 13 and 14 over the other one and then put 16 on this one. I wouldn't have to split it.
You could do that. It's basically what you do when sometimes deletes and other things.
It's just more machinery, right? It's an optimization like like I'm just trying to do the basics first.
44:21
But yes, you could do that and it still would be correct, right? And then the question always people ask is like going back here, right?
at this point here, why not just say we're done here, right? Why do I have to create the a new new node at the second level because then it's actually
44:37
no longer balanced, right? It doesn't follow the protocol where you have exactly um you know it's still the traversal is still locally still correctly correct but from the top point from the root node perspective all the keys are on one side of it not the other side.
So it's not considered balanced in
44:53
that way. So that's why we always have to split whatever the one uh whatever the level just overflowed to create new nodes and then reconstruct the tree like that.
And again in the and since these are all backed by pages if all the keys on this
45:10
side of the tree were out on disk and and not in our buffer pool manager or not in our buffer pool we don't have to bring them in and do anything because we're just updating pointers to the page IDs for the other parts of the tree. And so the nice thing about this data structure is that we just, you know,
45:25
there's parts of the tree that we don't need while we're reorganizing things. They just can sit on disk and not not bother anybody and not take up space.
All right. So look at the deletes.
Deletes are basically the opposite of this.
45:42
Again, the idea is that when we delete a key, if after the deletion of a key from a leaf node, it's still at least half full, congrats, we're done. We don't do anything, right?
If though deleting that key now makes us less than half full, we we can first try
45:57
to redistribute and rebalance the data structure in the way he was suggesting by try to stealing keys from our siblings uh so that we again we don't have to reorganize everything all over again. But if stealing a a key from an adjacent node and usually you only look at so the
46:13
one one guy over you don't try to look at all your siblings. If if it's if you can if you can steal it without rebalancing, you're done.
If though if you steal a key from either one of your siblings and makes them less than half full, then you have to do a merge. In which case, you may you will
46:30
end up deleting a key discriminator key on on your parent above you, which may again recursively go all the way at the top and cause you to rewrite the entire tree. All right, so let's here we're going to delete key six.
We go ahead and delete that. doing turtle right now we're less
46:46
than half full after that. So we can we can try to borrow from a rich sibling meaning a a sibling that can can absorb a delete from from us and not and still be let more than half full.
So in this case here we can take nine from our our our sibling go ahead
47:04
and put that in data structure but now we need to update our parent pointer or parent key for the discriminator because anybody looking for any value greater than equal to nine as in this current form in the tree would end up falling to the along the right side and at least
47:19
this leaf node and then completely missed the key and think it was actually not there when it actually was there. So we got to go up and update the that guy here by putting 12 there.
And then now now our tree is considered correct. Yes.
So in a production system if you're
47:35
going to do something like this would you store metadata about the number of children in the number. Oh wait.
So you said for uh in a protection system. What do you mean?
Sorry. Production system.
What's that? A production system.
Oh production. Sorry.
Sorry. Sorry.
47:50
Yeah. Um question is how do you find out whether you can steal keys from your neighbor?
large. Yeah.
Say you only look like you only look the next door next door neighbors. You don't try to look at all the possible siblings.
You could, but it's just like now you're taking latches,
48:06
which again we'll talk about next week. You're just taking more locks and taking latches, not locks.
Taking latches and that just can slow everything else down. Yes.
48:22
Yes. Yes.
But for for simplicity, we're just assuming leaf nodes. All right, let's look at a more complicated example.
So I'm going to delete key 15 here. So again, if I delete key 15 for this node or when I
48:38
delete key 15, the the node is less than half full. So I can borrow from for steal from a rich sibling, right?
And then 17 goes over there. Update the parent now to put 19 there.
And now everything is is all all coc and correct, right? So that's fine.
48:54
But now I want to go ahead and delete 19. Right?
So I go ahead and I delete 19 and I want to try to see whether I can steal from one of my neighbors and I can't because if I take either from 1317 or 2123 take from any of those nodes
49:10
they'll be less than half full and then now that's not balanced and we got to fix that. So in this case here we don't have a choice.
We're gonna have we're going to have to do a a merge, right? So, we're going to merge from our our neighbor to our right.
It could do it from the left. Different systems do
49:25
different things. It doesn't matter, right?
As long as long as we sort of to go one way, it doesn't matter. Actually, I think you you go both ways.
For simplicity, most people go one way. All right.
So, we're going to go ahead and merge these guys. So, in this case here, we're going to basically take all the keys 2123 and put it in the the node
49:43
where 20 is located here. And we have to then delete the the discriminator key in our parent because it doesn't need we don't need to point down now to this other node that doesn't exist anymore.
Right? So after we do this, our leaf nodes are all balanced.
That's fine. But now this parent node is less than half
49:59
full. So now we got to do this.
We got we got to do a uh we have to do a merge. And essentially to do this because it's the root, we can pull the root down here.
merge these guys together into a to a new node like this, right? And the
50:16
height of our tree has shrunk and then all of the discriminator keys correctly point to the pages that we want. Yes.
So his question is what what am I
50:33
what do I mean by pull down? Yeah, you would see well this guy's under full.
Uh if I check take my sibling following that sibling point I'm just not showing in the diagram then that guy would be under full. So I have no choice but to go up in this case here is the root.
So therefore I know if I just I can bring it down and then uh collapse them
50:50
together into a single node. But a point thing to point out here so I deleted key 19 but I still have key 19 in my my inner node.
And again, in a B+ tree, that's okay because it's just a guidepost to tell us where to go find the things that we want. You wouldn't
51:05
actually say, "Oh, I got key 19 in node. I'm I'm looking for key 18.
I'm done." You always have to check the leaf node to see whether it actually exists or not. So, it's okay that key 19's in in there already.
Which means that we can throw away anything we don't need in our inner nodes when we do this compaction or merging because again, we're not
51:23
throwing away real data. As long as our discriminator keys correctly point to the parts that we want in our in our leaf nodes, it's okay for us to go from five keys to four keys.
Yes. His question is has the ribbon underfill
51:39
from the beginning in this example. Yes.
The rooe is like special case. You can kind of let that you can let that go less than half full.
That's usually how it works. Yes.
We have to kind of redistribute
51:54
approach. Yeah.
Question is if if the root if the the node 59 could could handle taking a a sending a key over to the sibling, what do we have to redistribute? Again, you wouldn't have to touch the leaf nodes, right?
All you you'd have to do say there was key I don't know four,
52:10
five, nine, and I send nine over, right? All you need to do is drop the pointer from from I use the laser pointer.
Drop that pointer here. Get rid of this.
Then now you have a new one point to that and that's all fine. move on.
Yeah, that's fine.
52:28
Okay. Well, if you follow this so far, congrats.
This is this is the B+ streak, right? Of course, the devil's in the details.
Implementing will be challenging for project two. Uh but the the gist of it is is pretty straightforward.
The challenge really
52:45
comes to be when you start having to do uh concurrent access of doing splits and merges while other threads are trying to read and write data at the same time. That's where things get tricky.
All right. So I sort of mentioned this before that uh the great things about B+ trees is that you can you can use
53:00
composite keys or composite indexes use them as composite indexes which basically means you have a key that's comprised of multiple attributes. So I can create a table on create an index on table XXX and in here you can see that I have uh I'm using three columns AB and C for uh for for my key.
And I can also
53:18
specify additional metadata about like what should be the sort order on a per key basis or whether I want the nulls to be first or last on a per key basis, right? And then now in in a B+ tree because I uh because I'm not hashing where I'm sort of randomizing the the
53:34
keys, I can do a bunch of different lookups on having either the entire entire keys or all the keys or a subset of the keys and everything just all works. So again, have a key index on A, B, and C.
I can do a lookup on A and B and C or next A and B and then some
53:52
systems uh Oracle and now SQ or postcrist now has has two you can do what are called skip scans or partial keys where I don't have the prefix I don't have A but I can still do lookups on B and C there's extra machinery to make that work um we'll see that in in
54:08
the slides okay so again say have a key on A and B uh and so again you see that the keys now are comprised of multiple And these are just the bits stored sequentially after another. They're not not there's nothing really special about it.
And because it's now it's all declared in SQL, I know what the size of
54:25
the the the data type I'm trying to store is. So I know how to interpret like a a bite array as the first 32 bits is an integer for key A and the next 32 bits is integer for key B, right?
All that you know because it's in the catalog when when you created the table or created the index. So if I want to do
54:40
a lookup on uh key one two or A equals one and B equals two. So again I just jump to the first node.
I look at the first first key in my uh in my my composite index or composite key one is less than one and then two is less than less than equal three and I know how to
54:56
then jump down to the leaf node to find uh the thing that I'm looking for. So that's when you look up when you have all the answers the full lookup if you only have the first key and not the other keys like a prefix lookup then well in this case here I just do the the lookup for the first key that I do have
55:13
one is less than equal to one. I then jump down to this leaf node here.
But now since I don't know uh since there may be multiple keys or multiple key value pairs in my leaf nodes where the key A equals one, I have to scan along
55:28
the leaf nodes until I find the first entry where key A does not equal one. Right?
So in this case here I would I would come across this the second last key the next key in in this leaf node. The key A equals two.
I know that doesn't equal one. So I can stop
55:44
scanning there and not have to look anything after that. Right?
So I can do a sequencer scan along the leaf notes and find everything I want. And the last one, as I said, you can actually do the suffix search or Oracle calls them skip scans.
Basically says like I don't know what's in the prefix
56:00
of my key. I I don't have that.
I only have I only have the second part, right? So I have I don't have A, but I have B.
So in this case here, there's nothing really magical you can do. You kind of got to look at everything, right?
You got to look at all the leaf nodes. Uh but you can do this in parallel.
They
56:15
can all scan across uh leaf nodes until you find all the matches that you want where uh where b equals equals one because again it's sort of the the key is first sort of on a then b. So all the a's you know a equal one's going be on
56:31
one side then twos and threes and so forth and then the b values are kind of just they're sort of within each subset of of the a values like a group I like a group eye with multiple columns. Yes.
56:56
What do you mean reorder the keys? smart enough to re the question is um
57:13
if I do I have to specify in my SQL query my wear clause the keys in the same order that the index is defined on no again it's declarative SQL says I I do a look up on A equals 1 and B equals two right I can do B equals 2 and A equals Semantically they're the same.
57:30
Just because just because you wrote them in that order doesn't mean I'm going to run them in that order. Right?
Uh it's question the question is are the query engines actually smart enough to see that the good ones are right like uh most of the ones you can think of
57:47
like Oracle, Postgress, MySQL, all them are not they'll do it right. When we talk about join ordering uh there's a bunch of systems.
But join ordering you basically want to figure out the right way to join order join the join tables and that way you can have a huge
58:02
difference performance. There's a bunch of systems where that they'll go in the order that you define in the in the SQL query and they don't try to do reorder it for you.
The good ones say like I don't care how you define it. I'm going to figure out the best way to do it right.
So Oracle used to do that. Oracle
58:17
in the 80s, we'll cover this later, used to say, "Oh, yeah, no, we'll we're going to give you a query plan in the order that you define the tables because that's they called it a semantic optimizer." It was all bull. It's like, "Oh, here's the this is the order you meanted to be in.
So, we're going to run it that way." Even though that wasn't actually what the right way to do it is.
58:32
We'll cover this later. Yes, there's a lot a lot of systems for getting this right.
It's not hard. Join order is a harder one and some systems will will cheap out on that.
Okay. So all right duplicate keys.
58:50
So we've talked about primary key indexes. Primary key indexes have to be guaranteed unique, right?
Or pime the primary keys to be guaranteed unique. Uh you can also have secondary indexes uh where again you can define that the secondary index has to be unique on the keys.
In actuality though a lot of
59:07
systems actually still don't store things. they still will have duplicate keys uh because you may do multi- versioning.
We'll cover that later. But I may have multiple versions of a tupil that are all going to have the same primary key and I got to put this in the index in
59:23
order to find them. So, how do you handle how do you handle that, right?
Or think of like people if you have an index on the zip code for people's address uh mailing address. There's a bunch of people that live in 15217 uh 15213 in Pittsburgh.
So I can't have
59:39
you know how am I going to handle having duplicate keys in that C case. So there's two ways to do this.
The first one is the right one. The second one don't do this.
Right? So the first one is you basically store a hidden value or sorry hidden uh attribute in the key of
59:55
the index. That's going to be the record ID of the thing I want to point to.
Remember the record ID is like a page number and an offset. It's a guaranteed to be physically unique for every single every single physical value.
So even though I'm going to declare my index on the zip code, I'm actually
00:13
storeing the key, I'm also going to store the record ID and it's be hidden from you as as as the end user. You're not you can't see it through SQL.
And the reason why B+es are going to work for this because because we can do that prefix scan because if I have a index on A and B, but I only have the
00:30
value of key A, the first part, I can still do lookups on it. So now I can do my lookup on on the zip code without having a record ID and find all the entries correct uh the way find all the entries that I wanted.
The alternative is do overload overflow
00:45
leaf nodes. Um now we'll see overflow nodes to handle large keys which you you kind of have to do.
There's no way to get around this but if you want to handle duplicate keys you could overflow on the leaf nodes. Don't do this even though it's going to look very similar.
one is sort of this is all semantics
01:02
like one is sort of overflowing down versus across don't do the down one do the across one it'll make my second next slide all right so the first one append the append the record ID so the key now is going to be whatever the real key is
01:17
I defined like on column A but also in that same bite array for that key I'm going to show the record ID right and that'll be guaranteed uniqueness because there's only one physical tupil it could be at some physical address in my pages at an offset. So this guarantees uniqueness
01:33
even though I may have the same key over and over again. So now I want to do key so I insert six in my data structure my B plus D.
What I'm really doing is inserting six followed by the page number and the offset. And then now I just do the traversal down looking at
01:48
the first key for six. Then I land into you know whatever the leaf node I want and store the thing I want.
In this case here I I have to do a split, right? and everything everything works out fine if I want to do an overflow node.
So I already have key six in my data
02:04
structure. Now I'm going to insert six again.
So I traverse down I end this leaf node here. See that it's already full.
So, I'm going to add another page uh to it and just maintain like a link list going in the the vertical direction going down
02:22
violates that log uh you know login lookup because now I have to jump to this one and maybe keep scanning. It's but it's sort of like the the the chain hash table where I can just keep adding buckets my chain, right?
In order to accommodate when I have overflow. It's
02:37
basically the same idea. I'm going to insert seven so forth.
Same down here. So I'm so in this case here, I can absorb a bunch of writes into this overflowed leaf node here and I'm not changing any of the discriminator keys in my inner nodes up above.
Insert six, you do the same thing,
02:54
right? The reason why this is a bad idea is because if I need to now uh do a split, right?
Because I want to insert K say key 7.5 that should be that logically logically should go between seven and eight. Now I got to do a split
03:09
and now I got to re move all this other crap down below as well. I have to scan them all in.
Scan them all, bring them all into memory, figure out where they need to go and split them up. And they now may may have overflows on both sides or both both the new new pages.
Yes.
03:34
This question is for this one here because C already six already existed. Am I only certing the page number and the slot number and the the new key I'm inserting or all the keys has to be all the keys.
[Music] The question is do you change the BP for
03:50
all B+ for all the keys? Yes, Postgress does this.
Most systems have to do this. Okay.
04:05
So uh clustered indexes are basically the idea uh sometimes you you hear things referred as cluster index or cluster table just means sorted a sorted table and it's be sorted based on some uh some some B+ up above right you can't use
04:21
hash index hash table because that's not sorted but it'll be uh b+ up above and so if you do index organized storage like my SQL and post or my SQL and SQL light you get this for free because the leaf nodes are sorting the data uh b b b b b b b b b b b b b b b b b b b
04:36
b based on the the primary key up above in in in the in the inner nodes. In some systems you can you can create a table using you know heap storage so unordered tupils but then you can tell the data system oh by the way manage this make sure things are sorted based on this
04:52
other data structure here and in some systems they'll maintain it so if I insert new things it'll make sure everything's still sorted in other systems like postgress you can call there's a cluster command it'll sort your table based on some index but it'll do it once and then if it's updated and changed over time it doesn't maintain the sort order you got to run the
05:08
cluster thing Yeah. So the reason why this matters is that uh it's going to make a it can make our lookups more more efficient in many cases, right?
Can have to get things from disk. So if my table
05:23
pages are sorted in the same order that's defined by the index, then when you when I want to do a scan and try to find data, I can just again use my index to find to some offset within or some location at a table page and then scan along the rest of the the pages and I'll
05:39
find everything that I want. Where things can go wrong is that if the the the index is going to define one sort order, but then the tables the data is be sorted in another sort order.
I may have really inefficient access if I
05:56
naively retrieve whatever the page is that the the index is pointing to one at a time. So let's say I'm trying to find some some range of data between these pages here.
And so if you look at the the the pointers in my or the record ids in my my index the leaf nodes they're
06:13
basically pointing to all different locations in these different pages. So if I do a sequential scan just starting at whatever the the first the first key that I want and then just go get that data then go look at the next key in the order to find with the index I'm going to start jumping around at these
06:29
different pages over and over again unnecessarily. So if you think of like a really stupid system only had like one slot in or one frame in our buffer pool, I would read page 102, throw that away, then read page 103, uh throw that away, and then read page 104.
In which case, here I have two keys that are both in
06:46
page 104. I can and I can read it, you know, read that page, use that page twice, right?
So in a bunch of systems, a simple optimization is you'd actually first do a scan on the index, find all the record IDs and and the corresponding page IDs that you know you're going to need to access, sort them by the page
07:03
ID, and then now go retrieve them in that sort order. So that guarantees I'm only retrieving each page once and only once.
I still have to maintain some metadata to keep track of like what's the key sort order that came out of the index, but I'm basically going to scan the index first, figure out what I need,
07:18
then go get it. And you can start playing other games uh we'll see next class or in in a couple weeks where like if I have to do if I have like where a equals one and I have index on that and b equals two and I have a separate index on that I'll do a index scan across them both then do a
07:36
intersection of those two list of page ids to figure out what's the minimal set that I actually need then go get those pages post calls this a be bit bitmap heap scan so we can use these indexes to be help us be more intelligent about going in the data that we need to to avoid random
07:53
IO. All right, so let me get through the the node design choices and like I said, I want to plow through as many optimizations as we can.
These ones pretty straightforward. Um, and again, just to re reiterate how important this
08:08
data structure is just for humanity, right? Uh, there's a book written about 10 years ago on modern B+.
So here's like a data structure from the 1970s. here's all the modern incarnations and optimizations you can do to it uh to make these things go better.
And again,
08:24
it's so important that they wrote a second book uh that came out last year on more modern B+. And the guy that wrote this is we're going to see multiple times throughout the semester that he's going to vent invent the the iterator model of the volcano process query processing model.
He's going to vent the way we want to build query
08:40
optimizers. He's going to vent the way we're going to do parallel ax or par operations in our database system.
So this guy is brilliant and he's written the definitive guide on B+ trees. Again, notice though he calls them B trees instead of B+ trees, right?
But it's the same thing. All right, node size should be pretty
08:57
obvious that you can have different node sizes for the inner nodes versus the leaf nodes. Typically, you want to maybe have larger uh larger inner nodes than leaf nodes, assuming you're not storing the tupils themselves or just record IDs because then you can have a within one node, you
09:12
can have a a higher fan out and find things more quickly. And then of course the size of the nodes depend on on what the hardware looks like.
So the way the research basically plays out is that if your hardware is very very fast, your storage is very very fast like in memory, then you actually want really
09:27
small node sizes because you want to be able to go uh go acquire the latch on one small page, find what you want and what you can do quickly immediately jump out of that. But if your hardware is slower like a spinning disc hard drive, then you want your node sizes to be maybe one megabyte.
09:44
So in a system like Postgress, you can't vary the size of index pages. But in enterprise systems like in DB2 again you can specify different page sizes on a table basis on and on an index basis and I think they can even discriminate between the the inner nodes and the the leaf nodes and have different page page
09:59
sizes for them. So the merge threshold we talked about is like if it's if it's less than uh half full then I I have to do a merge or try to steal from my sibling right if you follow exactly that protocol then you roughly end up with the occupancy
10:15
rate of about 69 70%. So about 70% of your nodes are going to be uh full at any given time.
Right? So if you can delay the the merge process as long as you as long as possible, then you can avoid some some some unnecessary
10:32
uh disc IO because you're not in in a situation where I'm merging and then someone immediately inserts something and I have to split again. So you want to let things kind of go a little less than half full for a while.
Eventually maybe say, "Oh, I'm not using this space. Let me go ahead and clean it up." But it it'll allow you to insert uh
10:50
handle more inserts uh without having to do additional splits if you don't relax this right away. So it ends up being not exactly balanced anymore.
This is why the Postgress P3 implementation is called a nonb nonbalanced push tree. So overall it's balanced but sometimes it's okay to to relax that.
11:09
All right, the last question is how to handle uh vary length keys. So there's a bunch of ways to do this.
One optimization is just to store pointers. So instead of actually storing the actual var key in my nodes, my pages, I just have a pointer the record ID to the
11:25
actual tupil with that with that attribute I want because I already know how to store var length data in my my table pages that we talked about before. So instead of storing that in my tupil pages, I store a record to it.
Yes. So I said you still need the key to research.
You got to follow the pointer
11:42
to go get the key to figure out how to do the research. Yes.
Yes. Yes.
Go ahead. Yes.
Don't do that. So this shows up in
12:00
in memory memory data systems in the in the 80s. There was a system from Oracle called times 10 in the 90s that did this.
Uh the basic idea is like oh memory is so precious instead of storing these these key redundant keys because essentially the index is storing another copy of data right it's like a secondary copy of of the data that's in your
12:16
tables. So instead of storing the secondary copy of data in my my index if I can go get the data very quickly because everything's in memory then I'll just I'll just store the pointer to it.
Turns out though that random access still sucks in memory uh as well as
12:31
definitely on disk. So nobody actually does this at least I'm aware of like Oracle still supports this in times 10 uh by default you they you went back to B flush tree uh you could have variable length nodes uh basically allows the uh the the size
12:50
every single node could be arbitrarily different size usually you do like sort of almost like a slab allocator you'll have like a you know a 1 kilobyte node a 2 kilobyte node four and so forth right this one you can do But it only shows up in in academic systems because it
13:05
requires more memory management because now the frames in your buffer pool manager could be different sizes and you have to be able to handle that. You're basically doing reimplementing maloc uh in in your database system which may not be a good idea.
A a cheap out would be just just padding
13:20
so you know the max size of of any given key like it's going to be 32 characters. So no matter whether it's 32 characters or something less, you just patter with bunch of zeros at the end so that things always nicely fit.
The implementation that everyone usually does is that you just have a the same
13:36
way we had slotted pages uh where you could have the the take a page and sort of the data is growing from the end to the beginning. You can do do the same thing for your variable keys.
So then now in your you have an offset array to say for a given key here's the offset
13:51
within my page to find the actual key uh that I want. So the last one is usually what people do.
And then if you have really really large uh uh keys that don't fit, then you just have you overflow horizontally. And that's okay.
14:08
All right. How do you find the data within a node?
So the first most obvious thing to do is just do linear linear search. So I want to find key8.
I got to find it in my in my array. I just start from the beginning, scan across until I find it, and I'm done.
Again, it's not the full table or not
14:27
the full uh data set. So, this is okay, right?
The number of keys we have within within a a node is is not going to be massively large. So, yeah, it's linear scan, but um it's not too bad because it's in memory.
One optimization you can do if you have certain data types that
14:44
are fixed size, you can use SIMD to do vectorzed operations. Who here is taking 418?
All right. What's that?
taking right now. Have you covered SIMD yet?
Yeah. You're right.
SIMD basically means single instruction, multiple data. So think of like it's a single CPU instruction that allows you to do some
15:01
operation on multiple data items at the same time. So cyd is what you normally think about like one you know one single instruction single data like 1 plus one equals two like you take the register one for one register for the other one you add them together single instruction and it produces one one output.
CIMD
15:19
allows you to take a single instruction and take a vector of data and do some operation on and produce a vector as output. So to do something like this so say I have a uh you can use a 128 bit SIM instruction.
This is like this is Intel intrinsics or x86 intrinsics. So I
15:37
can take the first four keys four five and six. I can then do uh build a vector of just the key I'm looking for eight.
And then in a single instruction, I can do a comparison of the input vector with my target vector and it'll produce a bit map that says here's all here's all the tupils or here's the offsets of the
15:53
tupils that match. This case here, none matches, but it's a single instruction to do this comparison and produce the output.
And there's other CID instructions that then do you know find all the ones to see whether I have a match or not. In this case here, I don't have a match.
So I just slide over, do the next SID instruction on this. In this case here, I only have three keys
16:10
instead of eight instead of sorry, then a four. So I, you know, the last one's ignored.
And now I find a match. I see a one.
I know I find the thing I'm looking for. So again, instead of having to do scrunch one by one, I can do a smaller number instructions to do do this in parallel.
x86 has this. ARM has this.
Uh, risk 5
16:27
has has their own like a bunch of systems have vectorized instructions for this, but it only works if if you know don't doesn't work on strings. uh if things are variable length, you know, you want fixed length values.
Another obvious choice to do binary search because if everything's sorted, I just jump in the middle, look for key8. I
16:44
know key8 is is greater than seven, so I want to look at the next side, binary search over here, and eventually land in the middle, right? That's pretty obvious.
And the last one is rare. uh so no commercial system or real system actually does this and do interpolation
17:01
where we just do some math and say well I know I have a densely packed uh array of keys with no gaps like four five six seven eight nine 10 and the key I'm looking for is key eight so I can just do the simple math and say well if key eight is going to exist then if I if I just take the the min the min value and
17:17
the max value jump to my offset then I I know it's going to be here right It's not if if you land on something that's not eight, then you know doesn't exist. But like what if the array is not?
If you can't see, the question is the
17:34
question is what if the array is is sparse, not dense, it's not packed. You can't do this.
That's what I'm saying. It's a huge win if you could do this, but you can only do it in rare cases.
All right, three minutes left. Let me get through I'll get through four of
17:50
these. All right.
Again there again it's a data structure from the 1970s. Look at all the crazy you can do with this and make these things actually work.
And this is what real systems are actually doing. All right.
First one is pointer swizzling. This one is super common.
I don't know why it's called swizzling. I didn't make the name up.
It's the 80s.
18:06
Uh I think it's a reference but I wasn't there so I don't know. Um all right.
So when we do lookups when we follow pointers in our B+ tree what are we doing? Right?
We're not actually storing memory pointers. We're storing page numbers.
So, if I want to traversal,
18:22
find key greater than three. The first thing I got to do is look my look my key, my my root node, find, you know, it says six, I know I want to go down to this side.
Okay. Well, how do I get down to that leaf node?
Well, I'm going to have a page number in my in my uh in my root
18:37
node. Tells me what go how to go down there.
So, to get that page, I got to go to my buffer pool and say, "Hey, go give me page two. Give me back the the pointer in memory for the frame that has that data." Okay, once I get that, then I can do the reversal.
Now I got to scan along the leaf nodes following the
18:53
sibling pointers. Same thing.
I want to get page three. What I got to do?
I got to go to the buff manager. Hey, go give me page three.
Right? But because now we know it's a hierarchical data structure, we can have some guarantees about uh the order in
19:08
which things would get evicted uh if we're a bit careful about this. So instead of actually storing the page numbers in the pages, we actually store just the pointers to the actual data.
We got to pin those pages, make sure they don't get swapped out and replace with something else, right? Uh but now when I
19:26
when I when I have swizzle pointer, I basically have a little region in my page where I can store the extra raw memory pointers and I have to know that if it gets victed at the disk, I bring it back in. Those pointers are not valid anymore.
But now I can just do my look at my page and say here's directly to the location that I want as I do my traversal. and you get a huge win if you
19:42
do this. And the data structure of nice provides this nice guarantee we would say uh we can know that we can have it be I can't uh I don't want to evict anything in a root node or sorry in an inner node if one of its children node has been
19:58
swizzled or is in memory. So this makes sure we have some guarantee about how we order things to make sure that we don't end up with pointers that go to nowhere or pointers go to a frame that has no longer uh the data we want.
All right, the last one I want to talk about is um well I'll go two more. So
20:16
it should be clear that modifying the B+ tree when we have to do splits and merges is expensive. If it's inserts, we just keep inserting to one side if our node has space for it.
That's fast. That's fine.
Same with deletes, right? But the worst case scenario is that we do a delete or an insert and we got to reorganize the entire tree to make sure
20:32
it's balanced, right? So ideally it' be kind of nice if we can delay having to do these splits and merges to accumulate a bunch of them and then do them in a batch rather than having to treat them as you know single inserts and updates at one at a time.
20:48
So this is what's called a right optimized tree. I think Wikipedia calls them our fractal trees.
That was sort of like the brand name, right? Like like Kleenex is the brand name of of face tissue, right?
So fractal tree is the brand name. The generic name is also called a B epsilon tree.
And the basic
21:03
idea here is that in the same way you saw in my SQL for compressed pages, they added a mod log in front of the page, we're going to add a mod log in front of every single node. So now when we anytime we want to do a an operation, we don't have to maybe traverse all the way down to the bottom of the tree.
We just
21:19
find the first spot where we can put something in in a mod log and put and put it in there. So I want to insert seven, right?
So in a regular B+ tree, I'd have to traverse all the way down and then insert it on this node here. Instead, what I'm going to do, I'm going to put an insert entry in my mod log.
21:38
It's like the mem table more or less or not the me table like the how in the log structure storage we storing these log records like here's the changes I made without maybe actually applying them. So I'm going to put the insert seven in my mod log.
Now I'm going to do delete 10. Same thing.
And I I I put it in there.
21:54
Right? So now if anybody comes along and wants to do a lookup on 10 when you do the traversal you normally do it in a B+ tree but you check the mod log first and if the mod log contains the thing you're actually looking for the actual key that you want then you know you can stop right there because you found what you
22:10
want and you don't you don't have to do traversal. So to find 10 I land in the first root node I look in the mob I see delete 10.
Well I know that that point the last thing that happened to to key 10 was it was deleted because it's at the top of the tree. So I can go ahead and stop my search because I know it's I
22:25
know it's not there even though physically it's still in the leaf node here and I got to be careful anybody scans along and tries to find it right there's extra bookkeeping I have to do but in this case here I don't have to propagate the changes you know immediately so what you what you do to
22:42
what happens if the the mod log does get full like I want do insert 40 now I don't have any more space in my my mod my root node so then I got to propagate the changes down right and And then whether you just go one level down or all the way down depends on on the implementation.
22:58
So this is kind of getting the benefit of the log structure merge tree but in the context of of a breeze you're kind of blending the both the the best of both worlds. Okay.
This is very rare. I think the data structure was invented in the 90s but
23:14
very few systems actually implement this. There was a system called tok uh they got bought at Verona.
the guy the new professor William Kousmemell in in computer science his dad was one of the inventors of of that data structure even though he claimed not to be a database person. The best imitation probably that
23:30
was out for a while is this thing called Splinter DB from VMware that I think has been discontinued because the guy left to go to Cornell but relational AI and and uh Chromodb out of China is in the US Chromb is in China. Those are the two systems that I know that are pushing this pretty hard.
Okay. All right.
Quick
23:47
question. Yes, go for it.
See, why isn't this more popular? Because it's hard, right?
Because you have to, like I said, if I scan along the leaf nodes, I got to know how to find up above. Okay.
All right, guys. See you on
24:02
Wednesday. Hit it.
[Music] Aquat [Music]
24:27
the fortune maintain flow with the brain.