#06 - Column Stores + Database Compression (CMU Intro to Database Systems) | YouTube Summarizer

Transcript

00:00

[Music] I'm still associ. [Music]

00:25

Round of applause for DJ Cash. Uh, glad you make it a little late.

Yeah, I was busy getting money. Again, like yes, you're among I understand you want money a lot, but like focus on databases, focus on your

00:41

DJ skills, not just like trying to raise money. All right.

Uh, we have a lot to cover, so I want to get through everything. Um so today again before we or jump into course material again reminder that homework two and uh project one are out and they're both going to be due uh in the coming weeks.

00:58

So homework two will be due this upcoming Sunday. Let me turn on the lights.

Sorry. Um yeah homework two will be due uh the Sunday coming up and then project one will be due uh the following week.

01:14

those days don't match up. 21st, 29th, whatever whatever the Sundays are, go by the Sundays.

Um the recitation for project uh one will be tomorrow night over Zoom. Go check on Piaza.

And again, this will be a recorded session over

01:29

Zoom, but if you have questions, you want live answers. Uh we encourage you to join and ask them.

And again, this is not like you're you shouldn't ask basic things like you know, how do I compile my C++ code? It should be like I've started the project and uh I want to

01:44

understand a bit more about what this component means or what this file means, what does this class means. Okay.

So again, don't treat this as like like teaching you here's how to start the very beginning of the project. More like we we'll give you some uh some high level overview of what the arc algorithm is.

We'll go over that more detail than

01:59

I did in class and some other parts of the system. But then if you have more detailed questions about the implementation, this is the opportunity to do that.

Okay. It'll be recorded and we'll post it on uh the slides on on box later later uh later that evening.

Okay,

02:15

that's again that's tomorrow night over zoom not anywhere in person for the upcoming database talks as I said last class today and to today and tomorrow are our visit day from all our industry friends. So we have uh two sessions or

02:30

sorry three sessions in the morning tomorrow starting at 9:30 everything will be in the gates gates building. So you can see here again I know we asked you to select what companies you wanted to go visit.

That wasn't like a contract. You're saying I'm going to meet this company at this time.

It was more like we want to know what your preferences and availability were was.

02:46

So then we ran this through a stable marriage algorithm optimizer algorithm and it generated the optimal schedule that tries to meet the most people's uh requests and demands. Okay.

There was one big announcement last Wednesday when we had class. Actually while we were presenting in class uh one

03:02

of these companies got acquired. Anybody know which one it was?

What's that? Pink cap.

No. The guy actually speaking, Joyo from Single Store.

While he was talking, his company got acquired uh by Vector

03:18

Capital. Uh so as he said, Single Store has been around for a while.

It was originally called Mems SQL. Uh that's first how I met these guys in like 2012 2011.

Um we can talk a little offline what the initial incarnation of MSQL

03:34

looked like. Um they had one very very famous investor who turned out to do you know who uh did pretty well last week's announcement and that is Ashton Kutcher.

Believe it or not Ashton Kutcher was a early investor in memsql or now now

03:50

single store. I don't know how they met him.

I don't know how he got Ashton Coocher even knows anything about databases. Uh, but this this tweet is real from from 10 years ago, right?

So, yeah, congrats to those guys, although none of them were there anymore, right? Eric Finkele left.

Nikita, I mentioned

04:07

he left Single Store and went to Neon that got bought by data bricks for a billion. Uh, so whatever, they're fine.

All right. So last class we talked about an alternative to the uh the the tuple

04:23

oriented storage we saw before with slotted pages and we said like it isn't the only way you have to store store data right store tuples in in pages and we looked at the mostly the log structure storage scheme where instead of actually storing maybe the entire tupil uh and every time it gets updated

04:40

overwriting existing tupil we're just appending to these log files that got written out to disk and then the background there's some compaction mechanism we talked about level compaction or universal compaction to combine them together to reduce redundant information uh and and and

04:55

compress the sizes index organized storage the idea was it looks basically like the slide of page approach but instead of having this heap file where I'm jumping to unordered uh pages to find the data I'm looking for I'm actually going to use the a treebased

05:11

data structure like a B+ tree which we'll discuss in a few weeks and traverse down to the tree to find the data I'm looking for and then in the leaf nodes rather than being a pointer to where the data exists in a in a heap page it's actually the data itself right my SQL entity does this SQLite does this right and has again each of these

05:28

approaches have pros and cons uh but ideally we saw that they were they were better for doing update heavy or write heavy workloads where you're ingesting a lot of new information from the outside world right most of the queries that exist in the real world though are not going to be

05:44

write queries. Most of the queries are going to be read queries or select queries.

And in fact, for read certain read queries that are maybe looking at a lot of data or trying to do more complex things, these approaches actually going to be terrible, right? Even the heap file uh sorry the the tuple oriented

06:00

storage we talked before a lot of pages that's going to be terrible for analytical queries. And the single store talk that he gave last class is actually the introduction for this class this today's lecture because a lot of things he was saying maybe didn't make any sense on why you want to use a column store.

That's what we're going to talk about today. So, I'm going to briefly go

06:17

over again what the single store guy talked about last class and talk about the different different types of workloads that are out there for databases and what we uh you know and how we again see how we can design our database system to account for these uh design decision these workloads and come

06:34

up with more optimal schemes. And then the beauty of SQL and the relational model with the separation between the logical layer and the physical layer is that I can still expose a relational table interface to the application like I'll call create table and I'll write my SQL queries on that.

But underneath the covers we can use one of these

06:50

alternative storage models uh to come up with a more optimized uh scheme for for our data to make queries run faster and not just a little bit faster. I mean like orders of magnitude faster.

I mean, think of going like uh in the early days of column stores, I had a

07:06

friend who worked at one of the ones called Vertica. He told me that there was some Australian telephone company, they had queries that would take days and then when they switched over to a column store system, which we'll talk about today, it went to like minutes.

Their minds were blown and we'll see why that's the case here. So, we go talk

07:21

about different storage models and then we'll talk about techniques to actually make this even run even faster by compressing the data. So this is a bit of a repeat of what again the single store guy talked about again but I put I want to put in my own words.

There's roughly three categories

07:36

of database workloads. Uh there's OLTP, OLAP and last one is HAP hybrid transaction processing.

OLTP is what people typically start with. They building a brand new application like if you're a brand new startup you don't have any data.

You're going to build some kind of app that people can connect

07:52

to interact with and then you'll be ingesting new data from updates from the outside world. That's an OTP application, right?

You're you're doing uh you're you're taking new data from again from the outside world, but the amount of data that you're taking for every single operation, for every single query is going to be small. Like think

08:10

of like posting a message on Reddit, a comment on Reddit, like you typing in a box and hitting send. The amount of data they're storing for that one comment request is actually not that big.

Like your username, a time stamp, and then the text field. Or think like, you know, buying

08:26

something on Amazon. You put something in your cart that updates the database.

It's like your your your user account information, a time stamp, the item you want to buy, and then the quantity. It's not that much data.

But we're going to do a lot of this, and we want to try this very very quickly.

08:41

OLAP is once you have a lot of data, existing data, then you want to run analytical operations on them. Basically asking questions about that data to extract new knowledge based on the aggregate data that you have.

So going back to the Amazon example, OTP

08:57

would be like again someone buying make buying one item, right? For every person buying one item, that's the small amount of data per person, but the total amount of data that they're generating across all the orders, across all people using Amazon is actually quite a lot.

So then now when you want to run analytical

09:12

queries, they ask you can answer questions like what's the number one item bought in the city of Pittsburgh on a on a day this day in September when the temperature was above this this degree or that degree. so forth, right?

That's an analytical query because you're asking a question about looking across uh multiple entities or multiple

09:29

entries in in the database or in the tables. And so these queries are going to be more complex than we do in OLTP, right?

Be joining multiple tables, doing window functions, aggregations, all the kind of stuff you did in homework one. You typically see them in the in the OLAP workloads.

09:47

HTAP or hybrid transaction analytical processing. This is like the holy grail for systems, but it always doesn't always pan out where you want to be able to do uh fast transactions or do fast OP operations.

You can just new data very quickly, but then you also want to ask questions about that data where it

10:03

exists in that database without slowing down the the rest of the workload. Single store uh is trying to push this.

Uh pin cap tidb is another one do this. This one is can be a bit hit or miss because oftentimes it's it isn't the best OTB system and it isn't the best

10:19

OLAP system and if you're like in a big company the people that care about OTV they want the best ones they'll pick something else and the people that want the best OLAP system they'll pick something else. Um right so but this is a market segment that does exist AI is kind of separate this from this

10:34

and it's kind of it's kind of what comes after OLAP so ALOP will ask you you can ask questions about what was this what was that based on the data you have the AI stuff that's sort starting to come out is starting to ask questions about why things were a certain way like why was this item was the number one bought

10:51

item in in Pittsburgh at this date in time right and what aspects of that make it interesting for people in a certain de demographic and so forth. Those are the kind of things AI uh uh cares about and we'll see how we handle that later on in uh we talk about vector indexes

11:08

but it it smells enough like OLAP that we don't have to show it as a separate category here. So a really simple quad chart to understand the distinction between these two sort of categories of workloads is looking at whether the workload is right heavy versus read heavy and whether the operation or the query that you're

11:23

trying to execute in the database system is it going to be simple or complex and complexity is not really a a scientific term or something we can measure a query in uh based on it's usually something like how many tables does it join does it do aggregations how many group eyes how many how many nested queries right

11:40

in OTB queries they're pretty simple like select star from table where from table users where username equals Andy you're going to get like one thing doing doing a lookup on an index so you can roughly divide the the workloads like this so on one end on the

11:56

bottom corner you have OTP these workloads are considerably more writeheavy than OLAP because again you're ingesting new data from the outside world although most of the times even in OTP system most of the operations are read queries and then the queries you're actually running are pretty simple like go select Andy's record uh bank account information go

12:12

get you know Andy's orders from the last year. OLAP is the other end up here.

Uh and the workloads are considered they're typically almost always read more read heavy and the queries are more complex. So OLAP is is OLB is an older term.

It goes back to like the 80s. OLAP came out

12:29

of the 90s. It was actually invented by uh this very famous database researcher called Jim Gray.

He won the touring award for databases in the 1990s. He invented a lot of stuff we're going to talk about in this class.

uh like like two-phase locking and other techniques as we go along. Um he famously got lost

12:46

at sea in 2006 when he was sailing his boat in the San Francisco Bay and he was so famous they actually moved satellites to to go back over the the San Francisco Bay to actually look for his his body or his ship and they never found him. So he's considered lost at sea.

Truth is though when he invented this OLAP term he was actually doing it on the behest

13:03

of another company uh that was trying to sell a new product and say hey look here's this brand new category of systems called the OLAP. Uh so he wrote a paper like, "Hey, look, there's this thing called OLAP.

This is why it matters. Uh and then when people found out he got paid to write the article, he got retracted.

Uh other than that, other

13:18

than that little snafu, uh I never met him. He was he was considered a good guy though.

All right. And then age chat would be somewhere here in the middle, right?

Again, this is not a scientific chart. This is trying to say like again how to roughly think about the different workloads.

So for today's class, we're going to use an example actually from Wikipedia. So

13:34

this is actually based on the Wikipedia schema. It's open source.

It's written in PHP. They run off MySQL.

It's one of the largest MySQL clusters I think around. Actually, they might have switched to Mariab by now um when Oracle bought them.

But this is roughly what the schema looks like. So, we're going to have a user account table.

Again,

13:50

that's that's your login information. We'll have articles or pages for that define like, you know, here's an article in Star Wars or whatever.

And then we'll have a revisions table where we're going to put all the the new versions of the pages that come come along, right? Or new versions of of an article, right?

14:07

And what I'm showing here is actually the content is is in is inside of this table here revisions and actually the real schema they put it as a separate table right because they don't want to have a giant blob in in the in the uh in the page itself but we can we can ignore that but you can see here there's foreign key references all around.

14:24

Wikipedia is a sore subject for me uh because a few years ago somebody actually wrote a Wikipedia article about me um but then they put this line in that I used to have my bio somewhere. and said, "Well, I was born and raised in the streets of Baltimore," which is true.

I was born in Baltimore, uh, but obviously not in the streets. Um, and

14:42

then Wikipedia got they flagged that and says in the streets, uh, it's not scientific or encyclopedic enough. And then it got flagged or it got taken down.

And then whatever. And then since then, I get emails basically every three months from like these scammers saying like, "Hey, give us money.

We'll we'll

14:57

write your article for you." Right. Um, all right.

So, if anybody's watching this, was I born in Baltimore? Yes.

Was I born in the streets of Baltimore? No.

Right. Uh that that should should go without saying.

Okay. Yes.

15:19

The question is why are we separating why are we making a distinction right now between OLAP and OTP? At this point I'm only contrasting the workloads and then we'll see how we'll design a system for those different workloads.

And then at the end of the class, we'll finish up and say, okay, well, do I have separate

15:36

systems and if how do I get data from one to the other or can I have a single system handle both? So, we'll cover that.

I think that'll answer your question. Other questions about me other than me being from Baltimore.

Okay. So, again, I said this at the

15:53

beginning, the relational model is beautiful because it says, hey, this is what the data should should look like, how to represent at at a high level. doesn't say anything about how we actually store and represent that data as physical bits or bytes we're going to store on the on the underlying storage u

16:09

uh medium. Right?

So there's nothing about the relational model that says just because you declare a table and a table has these attributes that those attributes have to be stored continuously in on a on a disk page or in memory. In fact, we already saw that we could break that before break that

16:25

notion when we had the oversized uh values, right? We talked about like the toast table or the the external storage for uh like in postgress if I have a really large text field I don't store that in my in my slotted page I have a separate page where I store that and then store a pointer to it and then now when I run queries on it underneath the

16:42

covers the data center is going to stitch that that back together by following the pointer but use the application you don't know that things have been separated so it turns out for some workloads particularly OLAP workloads storing things continuously is actually going to be not the best way to do

16:58

So for OTP as I said the queries are simple and they're accessing a small amount of data and this again this is what people normally build when they first build a a new application like get if you're a new startup or whatever new project this is what you end up building. So here's some three sort of simple uh examples of OTP queries in the

17:14

Wikipedia. So one is go and get the the latest revision for a given page by looking about the page ID.

So there'll be an index on page ID because it's if it's the primary key, right? There'll be an index on that and we can do a lookup.

Uh another one is someone actually logs logs in. So we'll update the um the the

17:33

log last login uh time stamp with the current time stamp whatever host name they use to login in and again we'll do a lookup based on the user ID. So again that'll be another index lookup that's very fast.

And the last one is like inserting a new entry into the revisions table. Right?

So if we store things continuously

17:50

continuously for for a single tupil well most of these queries all these queries here are touching you know uh well the bottom two are touching one tupil the top one can touch at two tupils because whatever the current page is or for a given page and then whatever the latest revision is an OLAP query looks

18:08

something like this where I'm trying to get the uh count the number of times someone has logged in for a uh for to, you know, logged into the service where their host name ended with the.gov

18:23

um uh IP address or or name space, right? This is actually example from a real scandal 15 years ago where a bunch of uh Congress people were having their staff who are on government, you know, salaries log into Wikipedia and update their the entry for the congress person

18:39

saying like, "Look how great they are. Here's them doing, you know, charity work, right?" Right?

So you're basically paying government employees to go put out propaganda on Wikipedia. So you could use this query.

This this is the basically roughly the query they they did to find it. So in this example here again now we're not looking at a single

18:54

record. We're not looking at one person logging in.

We're looking at all the entries of login to try to find a match here. Right?

So a bunch there's a bunch of people got busted for this. Joe Biden did this.

Uh Mike Pence got busted for this. Right?

19:10

uh basically all of them are doing it. So the thing we haven't talked about yet is the storage model which again is a way we're going to physically represent data uh that that's in our database in tables.

And it's basically going to tell

19:26

us how we're going to organize things either in memory or on disk. And as I said that based on the workload the different way we could or different storage models we could use it's going to have wildly different performance because it's going to depend on what the data is actually needed and how it's

19:41

going to be used by by the query. So there's two basic approaches.

The N area storage model or the row storage. This is probably what we're most familiar with and this is what we've been seeing in our examples.

And then the column store one is the one we've been uh alluding to where actually we're going to store things instead of a row

19:57

based store it as a column based. We'll see why that matters.

And then the last one, this hybrid storage model or also known as packs. This is the way most systems actually implement their column stores.

So if a systems comes out and says, "Yeah, we're a column store." They don't actually mean the second one. They're going to mean this third one here.

And I so I'll

20:14

explain what that looks like, what the difference is. So the Nary storage model or NSM, this the academic term, but if you say I have a row store, this is what they mean.

Everyone will understand what you're talking about. This basically says that for a given tupil in our database and in

20:30

a table we're going to store all its attributes continuously within a page assuming they fit right we have to spill to an overflow page or oversized page that's a separate issue. The idea is that all the data for given tupil are we stored right next to each other within that page

20:47

and that the next tupil that we're showing the page doesn't start until we finish whatever the tupil we're writing out the data for. Right?

And this is going to be great for OTP workloads because the kind of queries we're going to run are either update queries where we need

21:03

to have the whole tuple updated anyway or insert but it's inserting the tupil in its entirety or our select queries are going to be only accessing one tupil and often times they need all the data within that within that tupole. So you want all the attributes.

21:20

So I'm gonna skip this uh go through quickly but this is basically the slide of page that we store before. I just gonna I'll I'm showing this so we see a distinction later on when we talk about the column store.

So within our database page, we have our header. Then we have the slot array at the top.

And again, now as we scan through down through our

21:35

our the data we want to store up above, right, we're just going to write it from the from the end uh to the beginning and add pointers and offsets for the the the different locations of the tupils. And we keep going as as we scan scan down over and over again.

Right.

21:52

Right. So this this is just you know rehashing from the things we talked about right until you fill it up and then you run out of space.

So let's go back to one of these queries on the Wikipedia. So say here I'm I'm trying to get the user information for one account.

So this is somebody logging in. So I'm going to pass in the uh the

22:10

username and then like a hash pack password, right? I don't want to store when you ever see data breaches and they always say like your password got leaked because they store raw passwords unencrypted in the database.

Don't do that. Do that.

all that encryption stuff outside in the application and then the database just stores the encrypted data,

22:26

right? But this is basically the query you would run.

So now for this username, I'm gonna have some kind of index that I'm going to use to help me find the the the record ID that I want. Now I'm not going to explain what this index is.

That's in that's in two weeks uh when we

22:42

start talking about indexes. But this thing think of like a hashmap or a B+ tree or some data structure says for a given username here's the record ID where to go find it.

And then now I'm gonna I'm gonna my database system is going to go look in its page directory and says, "Okay, well, you want you want record one, two, three, four. That's on

22:57

page five. That's where that's location in my my database file." I bring that into memory and I know how to then use a slot array to jump to the offset that I want.

And lo and behold, there's the tupil that I that I was looking for, right?

23:13

And I'm happy. And then now if I want to do an insert and same thing I I use some I I have some free spacemap that says okay here's a page that has free data.

All I need to do is go make sure that page is in memory jump to some offset and just write it out from beginning again.

23:30

But now let's do that OLAP query before where we say give me all the uh count the number of times people have logged in per month from a uh.gov uh host name. So in this case here, there isn't a lookup on any any single single predicate I want to uh single primary

23:47

key or uh discriminating key. I got to look at all the pages because I don't know where any.gov host names are going to be.

So I'm going to scan all all the pages and I'm each I'm going to bring them in one by one. Again, now again this is a row store.

So the first thing I got to do is run my wear clause where

24:04

I say find me all the host names where the that end with.gov, right? Right.

So I got to look at all these values in here and then once I find matches then I got to run my uh extract clause to convert the last login into uh the host name sorry the the month number month

24:20

name. So then I can then compute my aggregations.

So then I got to look at the you know this data over here. So what's the problem with this data these the other three attributes?

24:36

Did this query need it? No.

Do we have to read it from disk and put it into memory? Yes.

So because this is a row store, whatever whatever is attached to the query, whatever columns attributes a query h or sorry tupil has, they come along from

24:53

the ride because they're in the page with all the other data that I do need. Right?

So in this case here I'm not showing the size of the data but like you roughly say about over 50% of the data that we needed uh sorry 50% of of

25:09

the data for a tupole was actually not needed for this query at all but I still had to go to go to disk and go read it. So this is why row stores aren't going to be great for doing analytical queries because most time analytical queries I don't need all the attributes.

You don't

25:24

see select star queries in analy analytical queries very often. or analytical workloads, right?

Shouldn't you shouldn't run select star anyway, but in general, you don't really see that unless someone's trying to do a bulk export and even then there's faster ways to do that than select star. So for

25:40

the row stores, it can be great for fast inserts, updates, and deletes. It's great for queries that need the entire tupil.

I need all the attributes for for a small number of tupless. And I can do a bunch of tricks like the index ordinary storage, all the clustering techniques we'll cover later on to make this go really really fast that I'm only

25:56

reading the exact single page I need to go get the the data that I want. There's other tricks you can put a bunch of data in the index as well.

Uh so maybe I I only need to traverse the index even if I'm not using index organized storage and all my data is in the index. I don't even need to touch the tupless.

We'll

26:12

cover that later. This is going to be bad for analytical queries because we'll have a lot more pressure in our buffer pool because now we're copying in pages with data that we don't even need for this particular query.

So the alternative is the decomposition

26:27

storage model. Think of like two opposite ends of the of the the spectrum.

So the the n area or the the row storage like putting all the data for a tuple contiguous each other in a decomposition storage model is the academic name or the the column store. We're actually going to split up all the

26:43

data for each tupil and store their attributes separately. But now we're within our pages, we're just going to have the data for one attribute within the table.

All those are going to be contiguous across multiple tupils. So this is going to be better for

26:59

analytical queries because the if I'm only going to access a small number of the of the attributes for a given uh given tupil, I don't bring in pages that have data that I don't need. I only get exactly the data that that I do need.

27:14

Of course, now what's the challenge of this? What's the problem?

Faster reads, slow for what? Writes.

Because what do I need to do? If I insert a tupole, I have a thousand columns and I got to break it up to a thousand different pieces and write out to write a thousand different pages.

27:30

And again, he was sort of asking about this. Uh we'll see how to handle that later in the uh in today's lecture.

So let's go back to our table here before. So we're going to store the essentially the the all the data within each column are going to be fixed length

27:47

uh fixed length values stored in basically simple arrays. So I'm going to take this first column A.

I'll have my first page. There'll be some header that says what's in the page.

Uh and then I'll have a null bit map that's going to correspond to all the data within my uh

28:04

with within this column for all the tupils. Right?

Remember in the the the row storage model every tupil had its own header inside that own header it had its own little bit map to say whether you know the the values where certain values are null or not. So now I don't need to store a header per tupil.

I just

28:20

have a header for the single column and that's enough to account for all the data within this. And I'll do the same thing for the next column and then the same thing for the the next column.

Right? So we go back now to our example before.

28:35

So we take our Wikipedia data we had right for the the the login information the user information and again instead of storing as a row store where all the data is contiguously contiguous across the tupil we're now going to split it up so that all the host name data will be

28:50

in one page all the the login information will be another page all the password information will be another page right but now they're all se separate pages so now if I run that query before where again I'm trying to find all people logging in with the government host name so the first thing we need to is like, okay, I got to run this wear clause here

29:07

and I need to look up on the the host name attribute. So, I just go get the one page or the the set of pages that have that host name information, go bring that in into memory in my buffer pool, and then rip through that and find all my matches.

29:23

Again, I'm not jumping over data that I don't need. I'm not jumping over uh username information or the user ID.

The pages I'm bringing in only have the host name. So I can do that very efficiently and rip through that very fast.

So then I collect what tupils match uh my my

29:40

predicate and say okay well the next thing I need to do is run the uh aggregation commit the aggregation and run the extract function that's on the last login uh attribute. So I go fetch that that page that has that information and then now I know I know how to then jump into these different offsets.

I'll

29:56

explain how to do this in a second to find the data that I know match my ho my host name from the the wear clause and then here's all the tubles that I wanted. Yes.

Does this mean that size of this can be

30:13

different sizes? The question is does this mean the page sizes in this scheme or the storage model will be different different per attribute different per uh page itself or what do you mean?

30:29

show each column. Mhm.

30:51

He's asking basically he's asking would it be the case that the the database page size for an attribute within a table will be different across attributes? No.

It it always needs some fixed size because it has to be because we want to put that in our buffer pool. If a bufferable pool has, you know, 8

31:07

kilobyte frames actually in the OLAP world, you want to go bigger. So like 32 megabytes frames.

I don't want to have to start figuring out how to pack things in from different page sizes. So in general, these will all be the same page size.

Now that means of course some some

31:22

attributes will have more pages and that's fine. Yeah.

So how do I do this magic of going back here? How do I how do I figure out what tupils actually are matching uh based on

31:38

the predicate? Well, there's two basic ways to do this.

One is use fixed length offsets uh where you you know that like if I'm scanning through the data and I find matches my for my wear clause, I I keep track of where my offset is in in the

31:54

array. So I'm at offset two, offset five, offset six.

So then now when I go look at another uh another attribute I look at its pages again it's just an array all the values of fixed lengths I know how to do simple arithmetic to jump to the offset that I

32:10

want to find to have the corresponding tupil that matched uh whatever the other whether whatever the previous operation was. The other approach is you actually embed tupil ids in uh in the columns themselves.

So think of like for every

32:26

single value I have like a little 8 bit or maybe 16 bit uh tupil ID so that then I have some kind of other index allows me to jump to find the data that I'm looking for. So these could be variable length if they want to do but these things always be uh I would have an index to map it to this.

32:44

I'm only showing the second one here just as to say like you could do this. Some people somebody did try to do this.

Don't do this. Um, this is what Data Algro did.

They basically took Ingress uh in the this is a 2000s, kind of

33:00

hacked it up, make it look like a column store, but but because it was a roster, not a true column store, they had to play this game to make it actually work. And then Microsoft bought them for like $200 million.

And then within three months, they realized that was a huge waste of money and threw all the code away and started building uh the

33:16

parallel data warehouse as a real column store. Actually, that was Jignesh Patel, the other data professor.

was his adviser, Dave Dit. Microsoft hired him to go take the guys that were doing this and make it actually be production ready.

And he's like, "This is total crap." And he he told me he told him, "You're wasting your money." And they

33:31

threw it all away, right? So don't do this.

I'm just showing you could do this. We want to do this one over here.

But now the question is going to be, okay, well it only works if things are fixed length. For integers, that's fine.

Flows fixed length. Fine.

33:47

Why is it so bad to have? The question is why is this a bad embedded ids?

Because now I have to store an extra number for every single value that I'm actually storing. Then I need to maintain a separate index in my file that says here's how to jump to, you know, offset three because these things can be whatever arbitrary length that

34:03

they want to be. Sorry to know where to jump.

I need to write because because because if I'm doing a match, if I'm scanning through say column D and I find a match, I'm scanning through the column, I find a match here, I need to know, okay, well,

34:19

this is three so that when I jump over to column A, I know how to get to the three is allow me to stitch the tupless back together. Don't do this.

All right. So again, if if we require all the values in our column to be fixed

34:34

length, how does this work? We have strings.

Well, we'll cover more about this in in a few more slides, but the basically the trick is we're going to compress the data convert anything that is variable length to fixed length. So, we don't have this problem.

So, that means any

34:50

string is going to be converted to a to a fixed length number and then we'll have a basically a dictionary lookup to say, okay, for this government this number, what's the original value? So, we can store that as an auxiliary data structure on the side, but then as we rip through our columns, uh we're only looking at integers and that's be way

35:06

way faster. The alternative is you could just pad out the the strings like if if it's a 32 32 character varchchar just always put a bunch of extra spaces at the end if it's not 3 32 characters so it always fits but that's going to be wasteful as well instead the answer is

35:22

going to be dictionary compression again we'll cover more about that in a few more slides so to round out the DSM to summarize it all right the benefit we're going to get is that we're going to avoid wasted IO because we're only going to get the the bare minimum in data we actually need to

35:37

process the query. We're going to bring that in and not bring in any other attributes that we don't care about.

The queries themselves are actually execute faster because now as we're ripping through the the columns, we're kind of just looking at contiguous

35:54

data like as as strides in memory. And that's be great for modern CPUs.

I know we said we we're going to talk about CPUs too much, but like Intel CPUs or modern supercaler CPUs, they don't want any branches. They don't want any jumps.

So if I can avoid if clauses, if I have to jump over things,

36:10

my CPU is going to be way faster because I'm not going to have stalls in my pipeline. So if all my data now is all contiguous and it's all in the same value domain, like I'm just looking at like looking at integers, then I I can rip through that very very quickly and we'll see how we handle that uh in

36:26

lecture 14. Furthermore, I can actually compress the data much better than I can in a row store because now all the data within a page are going to be in the same domain.

Right? It's going to be like think think of like the the if I'm recording the temperature for this room if I just have

36:42

my data pages just record temperatures. They're a bunch of integers.

I can compress the hell out of that way better than I could if like it had integers and strings and floats and a bunch bunch of random stuff. Like it's thinking like trying to compress a JPEG doesn't if you try to run JPEG a JPEG through like gzip or whatever doesn't do a very good job

36:58

because it's kind of a bunch of random data that's already been compressed. But if you open a text file and put a bunch of zeros in it and run that through gzip, it's going to compress the hell out of it.

That's basically what we're going to get with our column store. Of course, nothing come nothing nothing's for free.

It's be slow for point queries doing single uh single

37:14

tupal lookups because now I got to stitch all the data back together from multiple pages. Insert up deletes are going to have the same problem, right?

Because you basically have to be do this stitching and and splitting over and over again. Okay, any questions about column stores?

37:35

So this is going to be great for uh again if we have queries on tables that are very wide meaning have a lot of columns and I'm only need to touch a bare minimum number of two of of attributes

37:50

within uh within that table. Best case scenario is an OLAP query that touches one column, right?

But truth be told, they're not that those kind of queries are not that common. And so at some point, I'm going to have to stitch my pupil back together.

And so if my data

38:08

is completely broken apart across multiple pages, if I break everything up as columns, then even though I may filter out a lot of data at the during the queries, I have a billion tuples, but I end up only needing a thousand tupils. I may have to go get all those

38:24

thousand for each tuple a thousand different pages to put that tupil back together again. That's if I break everything up as completely as the column store.

So what we really want is something that kind of smells like a column store. We get that benefit of not bringing in

38:39

pages that we don't need or looking at data that we don't need. And the data that is in the same same column or same attribute is close to each other.

So we get great compression. But if I have to go get data from the other attributes for that tupole, they're close by, relatively close by where I can go I

38:55

either already fetch them in memory uh or it's not a big leap for me to go get go get them. So this is what the hybrid storage approach storage model called PAX supports.

So PAX stands for P partition attributes across. Uh this was actually

39:12

there's a paper came out I think 200 2001 um the sec the first author is Natasa Alamaki she was the database professor here at CMU before I started uh and then she left for Switzerland uh she's brilliant she left for Switzerland in 2007 and then I was hired to replace

39:29

her um so this claims it was the paper looks like it was came from CMU uh but this actually wasn't the last chapter of her thesis she did at the University of Wisconsin with with Dave Dit the guy who said, I just told you that told Microsoft they wasted $200 million buying that company. Uh so she he was

39:45

her adviser. Uh he's brilliant.

Anyway, so this approach basically is the idea is that it's it's going to look like a column store. Uh but the data that uh the data for a single tupole is going to be close enough to to the to each other

40:01

so that I don't have to make a very long uh I may have may have already have brought that data into memory even though I I may not need it all of it. So again, so most systems when they say they're going to use a column store, almost 99% of the time they're they're

40:18

doing this. This I can only think of one or two systems that that that don't do this, but everyone else is doing this.

And then these open source file formats like Parquet and Orc uh that you might be familiar with. And now there's modern versions of these like Nimble from Facebook or Vortex from Spyro DB guys um

40:35

where these these are all essentially doing doing this. So if we go back to our uh our diagram here again so what we're now going to do we're going to going to horizontally partition or horizontally split up the data based on some some uh some criteria

40:53

like the number of tupils or the size of the data, right? there's some way to say okay here's here's gonna be some some some chunk of data or sorry some some region of data and then now we're going to then write them out in is within that region to be

41:08

a single column store sorry in in a columnar format but again within that sort of region all the data for a single tubal will be will be located close to each other so when you create this pax file uh you end up writing the the header at the end uh and we're going to

41:24

do this because think of like we're trying to generate uh you know one gigabyte file so I don't actually know what all the data is going to be in before I actually scan it. So I'll keep track of things in memory and then when I'm finished I write out the footer that has all this metadata.

Right? So I'm showing it showing it appearing first.

41:41

This is actually the last thing you would write when you create one of these files. So the first thing again I'll I'll break up the say the first three tuples I'll put in some some chunk.

I don't want to call it a block of data. uh trying to use generic terms in or in in parquet

41:57

they call them a row group that's considered the the canonical term basically it's again it's just some some set of tupils I'm going to put together in in a single uh I don't again I don't want to use the word cluster in a single region within the file again in parquet it's like I think it's like one million tupils in orc it's like by default like

42:14

100 megs of data there's different criteria to decide how you how you partition these things up so again within now my row group I'll I'll have this another header that says here's all the data within it and then within the inside of it you'll see I put all the data for attribute the column A together

42:29

all the data for column B together and all the data from column C right and I'll call these sort of these these column pieces within the row group we call these they're called column chunks and I'll do the same thing for the next one right and each row group is going to

42:45

have their own again metadata keeps track of like I'm compressing this way or here's the here's the dictionary you can use to decompress this data Right? So if you go look at like any presentation from data bricks or snowflake or other companies, right?

When they talk about park organization, it's roughly going to look like this.

43:01

And they're talking about packs, right? Is this clear?

So again, don't think of these as in in like in the world when we talked about like the the page sizes, it was always some multiple four like 4 kiloby, 8

43:17

kiloby, 16 kilobyte pages. In the OLAP world, I'm reading large amounts of data.

So, it's usually these things are like organized in the size of like S3 buckets, like 8 megabytes or larger. I'm trying to get get a lot of data all at once.

And so, even though I'm saying

43:32

this row group here might be kind of big, much larger than 48 kilobytes um and maybe I'm bringing in data that I don't actually need, it's still going to be a big win if I have to stitch things together. deciding the number of number of rows in a column.

The question is what goes into deciding

43:48

what the number of rows in a column chunk. Uh maybe it's number of rows in a row group.

I said like the default and paret is like either parquet is like 1 million and it's just whatever they picked it and then I think or goes by like data size. So as I'm writing it out if I wrote out like 100 megs say okay

44:03

that's it make a new row group. But you you can tweak those.

Yes. Yes.

44:22

[Music] Your statement is um your statement is like the row metadata is wasteful. Yeah.

The statement is that the the the

44:38

metadata across row groups could could be uh repetitive and then we have a lot of wasted data you for dictionaries. Yes.

Um but the metadata is is not that big compared to the rest of the data. Like if it's a kilobyte and your data is like

44:54

100 megabytes it's nothing. Yes.

Yeah. Yeah.

Good. Good.

Okay. So, the statement is and he's correct.

This is my fault. So, I'm saying this is a PAX

45:11

file organization. It is not page organization.

Uh yeah. So, so you could think of like here's here's like a here's a file and then within that I have different pages and those pages will all be sort of fixed length, right?

Within within the row group. the

45:27

metadata stuff at the bottom is usually it's whatever the size is, but you might still organize that in four kilobyte pages. Uh but yeah, you you think of like I'm not defining what the size of the pages are within a column chunk within a row group, but the end of the day when it lands on disk, it's still

45:43

going to be four kilobytes, but the database might might perceive it as I think what eight megabytes or one megabytes or something like that like a larger size. It's still everything is you always have splitting pages.

Yes. But the way typically in a paret file systems address it, it would be at

46:00

the at the at the row group level. So then you you have to re read the row group header because that's going to tell you what the offsets are for the column chunks.

So you can do things like if I only need uh column A from this row group, I could go use look read this

46:16

that tells me where the starting point is for each row group. And then I could jump to get the the the header for that row group.

And within that I can then get the offset to the column chunk that I want. You could store that down here but then that makes this thing kind of bigger.

Now we're getting implementation

46:33

details but in the case like parquet this is implemented as protobuff and protobuff you can't deserize incrementally. You have to decilize the whole thing.

So if I have all my metadata inside this and this thing's like 10 megabytes I got to decompress 10 megs even though I only need like 10 bytes out of it. That's that's in the

46:48

weeds. Yes, the newer file formats replace protobuff with flat buffers from Google.

That that's the right way to do it. Okay.

All right. So, let's try to plow through compression real quickly.

So,

47:04

the end of the day, the IO bottleneck for this class here, what we're talking about in this semester, the IO bottleneck is always be the the so the IO cost is always the main bottleneck when we actually want to run queries. In the advanced class, we'll start worry about cache lines and and other things, but for this class, we'll assume it's

47:21

always disk, right? And especially if if you're running in the cloud and the disk is like some some some box running on over the network, that can get slow.

And again, that's going to be a problem. So what we want to be able to do is that we want to compress our pages when we write

47:36

them out to disk so that when we read them back in in compressed form the amount of useful data we're getting for one IO increases. So if like a page is uncompressed and I can put 10 tupils in it but if I can compress it I can put a 100 tupils in it.

Now for the same IO

47:53

with compressed data I can get 100 tupils and bring them to memory. Now I can have I have the space for it for memory to deal with it.

some cases you actually don't need to decompress it. We'll see how to handle that.

Um, but the main trade-off we're going to face here is sort of classic computer science where it be, you know, uh, speed versus

48:10

or compute versus storage. So, I can pay more CPU cost to compress my data and decompress it when I need need to access it in exchange for having a lower storage overhead and lower IIO cost uh, to get get data in and out.

48:25

So the things we're going to want in our in our compression scheme in our data system is number one we have to make sure that our values or sorry our compression scheme is always going to produce fixed length values because we need that in order for the the the offset mechanism to work to address

48:40

tupils in within a column because if now if I'm taking like 32-bit editors and I compress them and now they're like some are 16 bits some are 24 bits I have no way to know how to jump to offset one two three to find the data that I'm looking for. I got to then use an index and I don't want to have to do that.

48:56

all our data needs needs to be fixed length. Variable length data we're going to store as a separate uh separate pages or separate storage pool and that we can run whatever compression scheme you want on that uh and let that be veren.

We don't care about that. Not entirely true

49:12

but for our class yes. Another thing we want to be able to do is postpone the how postpone when we actually have to decompress data in order to be able to use it or give it back to the user as a part of the respon result of the query.

So that means that

49:29

we want to be able to run queries directly on compressed data and then not have to decompress it to figure out what's actually inside that compressed data. And the last one should be sort of obvious, but that we want to make sure that our whatever compression scheme we're using is considered lossless.

49:46

What does that mean? Like lossy versus lossless.

Yes. You said, yeah, basically if you if you compress data and then you decompress it, I get the exact same data back.

I get exact same bits. That's considered a loss loss lossless scheme.

A lossy

50:03

scheme would be something like MP4, MP3, right? I can take a video file or a sound file and compress that in such a way that it's not perceptible as humans either through our eyes or through our ears because we can't detect whatever you know things that are throwing away, but it's not an exact copy of the data

50:19

we have before. So, we don't want to do any any uh lossy compression.

We only do lossless. You can do lossy compression on data systems, but this is something a human has to tell us to do.

So the example I always like to use is say I'm recording the temperature in this room

50:34

at 1 second granularity for an entire year. But maybe a year from now I don't need to have know exactly what is the temperature at this time you know this exact moment this given second.

I maybe can bucket the data within you know 10-minute aggregations.

50:50

So if I do that and delete the one second data that's a lossy scheme because now I'm throwing away the original data but for my use case that might be good enough. So data business will do lossy compression for you but you have to tell it you want to do that.

There's even other techniques where you can actually

51:06

keep the data uh in its original form but then run lossy algorithms on them to compute queries called approximate query execution. You get estimates of things that sometimes they give you bounds but again we'll worry about that later.

For this class, this lecture today we we can't lose data.

51:23

So there's four different choices we have to compress things. The first is the block level and the page level, right?

We're basically taking a a a page of data and we're going to compress it. The tupil level, we're going to take the entire tupil itself and compress that.

And you can only really do this in a column store. The attribute would take a

51:39

single value within a single column for a single tupil. We'll compress that individually.

We saw that before with uh the the the overflow pages for like large large v charts and text fields like postgress will store in this thing called the toast. It's separate pages

51:54

but then runs like gzip on that and compresses that but let's only do that for a single attribute and the last one is going to be col columnar compression and then we'll spend most of our time on that but that you know that will get the biggest win if we're using DSM or packs so pretty

52:09

much every system today that it's using a column store is going to give you is use this bottom one the some systems will give you the the third one the second one is pretty rare I know like one system uh out of China called Terrar They got bought by Bite Dance and the

52:26

guy was unhappy with having it at Bite Dance and he has a new startup. I forget what it's called.

Uh but they they do tuble level compression. His website has him complaining about a parking ticket or something.

I it's in Chinese. I can't read it but I can place it on PIA.

Um and then block level we'll talk about real quickly. Uh this is rare but this to me this is interesting too.

52:43

So for this we're going to use what I'll call a naive compression scheme or general purpose where it's pick your offtheshelf favorite algorithm gzip zstandard uh LZ4 snappy whatever just take your block of data run it through the correction scheme and you get a

52:59

compressed block out and it's going be variable length uh we'll have to account for that in a second but the in general this is going to be just take whatever data you want and throw at it right um Oracle of course has their own compression scheme that's patented So nobody else can use it called Ozip. Um

53:15

but again the trade-off's going to be as any compression scheme is how fast we can encode and decode and how good our compression schemes is going to be. So let me show you what my SQL does with NB.

So my with my SQL you have compressed pages on disk and these are

53:30

going to be at at some multiple the size of these pages be some multiple of eight. So by default inbs is 16 kilobyte page sizes but if you compress them it'll be within one of these ranges.

So if it's like if you compress your page and it's like 900 uh 900 bytes, they'll

53:46

just padd it up and round it up to be 1 kilobyte, right? Pick whatever the greatest size is.

And then on top of every page and the header, there's this thing called the mod log where they it's like the write ahead log where you're going to record any changes that you make to the page but that you haven't applied to because you wanted to keep

54:02

the data compressed. So now if I go need to access a page I just like before I my buffer pool goes out or use the discer goes out and gets it from disk brings it into memory and it starts off in compressed form.

So now if I have a a write operation that wants

54:19

to make a change to the tupil that's in this compressed page, assuming I have an index to tell me how to how to get there, then in some cases I don't actually have to decompress the data. Like if I'm updating the login field for someone logging in on Wikipedia, I don't need to know what the what the old value

54:35

was in order to make that make make that right. I just overwrite whatever the old one was.

So I can put a log record in the mod the mod portion says hey for this tupil update it with this new value for this attribute I can do much my rights my mod log at some point the mod log gets full and I'll have to decompress it and apply the changes but

54:52

for some some change for some updates I don't have to but now if I want to do a read if the sorry yes it's the it's the write ahead log or so it's the remember how in the LSM we just had these little like records said

55:07

update this put that right Same thing right ahead log is we'll cover this later that yeah it's it's basically the operations that get the the operations that change the data. Yes.

Sorry. So does the mod only exist when you have a log storage or could you also

55:23

have a mod? The question is does a mod log exist only if you have log storage?

No. NB is not log log storage but it has it here.

It's and we'll say this again when we talk about uh like a delta store like it's a common approach where you have uh

55:41

you have like an area we can put fast inserts of changes that actually may be applying to them to the data because that's really fast to do that and then you have some background process fix it up that's a common approach in systems yes so the model cannot exceed

55:58

level I you said the word level. I don't don't think of log structure storage.

Just think of like every page has its own mod log and I could put changes in there and then some point if the mod log is full I have to decompress and apply it and clear out the mod log.

56:15

It's just a way to absorb updates to a page if I can without having to decompress it. The question is will this stall writes if I have to write the mod log to to

56:31

apply the mod log? Absolutely.

Yes. No free launch.

Yeah. All right.

Yes. Yes.

Question is is the modlock store disc? Yes.

I want to keep this data compressed as long as I can. So if if I

56:48

had if I'm going to flush it out, write it out to disk without applying the changes, I'll go ahead and do that, right? Why not?

Okay, so now if I have to do a read, if the read is like go get me for this tuple, like my example, I I was updating

57:04

that login field where I just do a blind write without actually reading the data. If the read can be serviced by the mod log, fantastic.

I just I just access there. But if if I have to actually look what the values are inside the tupil, then I decompress it, apply whatever changes I have in my mod log.

And when I

57:20

decompress it, it's always going to end up being 16 kilobytes because that's the default size inb. Apply my changes to my mod log.

Still keep the compressed version around. Uh because if I didn't make any changes, maybe I I I don't want to like if if there was no mod log updates to apply, then why uh then I can

57:37

just always throw it away if I need to free a page. But the idea is again if I had to know something was inside of it then I decomp uncompress it.

Yes.

57:58

This question is is there a way to have the compressed page only compress the data that that you need. So in this scheme of my SQL, no.

That would be choice number two. The tupal level compression.

Like I said,

58:13

there's only one system that does that. It's a you basically want to have you want to be able to do incremental decompression.

GZIP, Snappy, all those guys can't do that. Like you have to you have to sort of decompress the whole stream to jump to send anyone offset.

There are

58:30

algorithms to do that. The tupal level one that Terra one is the only one that I know that actually did that.

I don't think everybody else does. I hold up hold up.

Uh SQL server can do it too, but I forget how they do it. And I don't know why they give them by default.

I don't think they say how they

58:45

do it. SQL server in this in this this fork of Rox DB called Terra.

I can follow on patia with links to that. All right, I don't want to spend too much time on this because it's my SQL and they they're the ones that do it.

I just say like if you want to run snappy gzip this is how roughly you could do it and

59:00

actually I actually think this is a good idea. Postgress can't do this, right?

Postgress can't do any any any compression except for those toast tables those separate overflow attributes, right? All right.

So the big thing to point out is in this case here the compression

59:16

algorithm we're using whether it's gzip or snappy or or dstandard right it's it's considered an opaque box to the database system meaning I have a bunch of compressed bytes I the data system doesn't know anything about what those bytes actually mean the only way it can get back to the original data is if it

59:33

runs the decoder runs the decompression algorithm to put it back into its original form All right. The compression schemes also don't understand the high level meaning or semantics of the data.

So it doesn't know like it should put maybe pack

59:49

certain pages together and compress them because they're going to be related to each other. It said it just says here's the data you want to compress.

Go for it and here's the output. Right?

So what we really want to be able to do is have our data be able to operate on compressed data without decompressing

00:05

it. So this is a high level example.

I'll show you how we'll do in a second. But the basic idea is this.

Say I have a simple table with names and salaries. I want to run some compression algorithm and convert this data into compressed form.

I'm showing this with two tupils because it's because it fits in

00:21

PowerPoint. Always think in extreme.

So what if I have a trillion tupils, right? I can store this now this in much much less space than I would if it was uncompressed.

And then now if my query shows up select star from users where name equals Andy. I'm not saying gonna say how I I'm going to do this yet, but

00:37

what I want to be able to do is convert whatever constant I'm doing a lookup on for a given tupil like name equals Andy, somehow convert that into its compressed form. So now when I do my comparison, instead of comparing two strings, does Andy equal Andy does Andy equal mesh, I'm

00:54

taking whatever this xxx thing, assuming it's a it's like an integer or something, and now I'm doing a faster comparison on those, right? And if I'm doing integer comparisons, there's instructions on CPUs to do that really really fast versus like having to jump over bytes to do string string matching.

01:11

So this is the goal. This is what we want to be able to do.

And because we're the database system, we can control everything. We can do this.

Whereas like an off-the-shelf compression scheme can't do that. Even though they're actually going to be doing a lot of the same techniques that I'm going to talk about here in in a second like gzip and

01:26

snappy, they're basically doing the same thing. They don't expose any of that to you on the outside.

So they're doing dictionary encoding but they don't expose the dictionary to you so that you can then do more fine grain lookups. So going back here we're going to be doing this again for in the column store

01:42

because now that our data within a single attribute is all contiguous to each other they're all with mean the same value domain mean we're going to see a lot of repeated values a lot of values that are very close to each other and we can exploit that in our algorithms.

01:58

So here's a rough smattering of the the main compression schemes and I'll go through these one by one. Um the you know obviously the the usefulness of these depends on the data set depends on the workload.

In general the dictionary encoding one is going to be the most common. um like I think orc they

02:14

dictionary code everything uh in in their files and obviously for for var length you need these but a lot of these other ones if if they can fit the the domain you're trying to work in you can get a big win much better than you can with dictionary coding so run length coding rle is when you

02:31

have repeated values um within a column so instead of storing that value over and over again and repeating it we're just going to store hey I have this value I'm at this offset within my column and then here's how many times it's been repeated and I'll just store

02:47

that triplet in succession rather than the the original values over and over again. So obviously to make this work you got to sort your data in such a way that you can exploit this uh because otherwise you may end up ballooning the the the size of the compressed data.

03:03

let's say I have a really simple table that has uh some people information where I have some ID of the person and then a flag to say whether they're dead or not like a boolean flag yes or no. So as you can see in this boolean flag here is dead I'm having a lot of repeated

03:18

values. So what I can do is in now instead of storing the original value over and over again, I'll just store the some value what offset I'm in because that's going to help me jump around if I need to find uh you know for a match at a given offset and then the the length

03:35

of of the run, right? So if I have queries like this, select uh select select the number of people that are dead and the number of people that are alive, right?

I can just operate directly on the uh I can just operate directly on this is dead part

03:50

here in compressed form and compute that very very efficiently course now the challenge is going to be I have alternating uh values so here we have no followed by yes followed by no so that's the worst case scenario for us because because now that the size of the run is one for those three three tupils

04:07

or three three values and the size of the compressed data is actually larger than the original data. So if I sort the the the table based on the the column I want to compress and I got to update all the other columns because I make sure my offsets match now going across

04:24

horizontally. Now you see I I have all the yeses followed by all the nos and now I can store this this column as a single or just just two values.

All right. So, so this this again I'm only showing

04:40

what nine tupils here thinking in trillions or billions. This will be a huge huge win and we'll see in a second we can we can combine rle with other compression schemes like these things are actually multiplicative.

I I can compress in one form and compress it again on another

04:55

form. Yes, the statement is uh doesn't this mess up I don't have my tupal ID.

This is actually this is not an internal ID. This is like the the this is like the the the applications.

I have this column

05:12

called ID. This is like the social security number, email address.

I'm just showing ID as a simple number. All the other columns should be sorted based on whatever this this the columns you're sorting on.

Correct. And he's right.

This might be

05:27

good for one of the columns but suck for the other ones. Absolutely.

Yes. So, how do you pick what the best sorting scheme is?

MP hard. All right, next one is bit packing.

05:42

So again in SQL you you call create table you say this these are the columns I have you have to define the data type for them right and oftent times people declare things to be much larger than they actually need like I have like an age column right I'll define that as a

05:58

integer that's a 32-bit integer or four by integer in in SQL but if most my values aren't going to be hitting up the max size of a 3-2 integer I got a bunch of wasted space storing a bunch of zeros that don't mean anything for the tuples, but because they declare it to be 32

06:14

bits, I'm going to store 32 bits. So, if the database system can recognize that you've over subscribed or overallocated the size of the the amount of storage space for your your data, it can actually throw things away.

So, if I'm storing all these values here as again

06:30

as 32 bit integers, you see we have these the visual side 256 bits, but the only data that matters is this side over here. Everything on this side is all wasted space with a bunch of zeros.

So with bit packing, you basically say, uh, I'm going to store this in a

06:46

compressed form. Throw away those other 24 bits and only store the eight bits that actually matter.

And now we get this down for this example here. We go from 26 bits to 64 bits with no loss of precision.

And they're all fixed length because they're

07:02

all eight bit integers. So now I can easily jump to whatever the offset is I need for a given tupole.

What's the challenge with this? What if someone comes along and inserts a value now that isn't just eight bits,

07:19

something larger? So, this is called patching.

Uh or if if you're using Amazon Redshift, they call this mostly encoding. And the idea here is that most of the values are going to be stored uh I can bit pack into a smaller form.

But for the outliers that are maybe too big, I just have a little

07:35

special marker to say, "All right, this value is actually not in line here with the rest of the values. Go jump to some uh patch table that will tell you what the real value should be." So if I add this large number 999999, uh it doesn't fit within eight bits.

I'll store everything else as 8 bits.

07:51

And then again in in Red Shift, you call this mostly eight. And then I have this patch table here that says if you scan along and find this this bit sequence xxx like think think of all the bits or set true or one then I know that that offset I go look up in this table and I

08:07

find what the original value is. So in this case here instead of storing this as 26 bits I store all my values as eight bits but then I have that lookup table which I can get down to maybe just uh 48 bits to store that.

08:22

Yes. This question is how do I differentiate between a value that's a sentinel to say go look at the patch table versus the original value.

So for that one you just if you set all the bits to one then you

08:38

can't let anything be that value. You have that be special cased correct his statement is and he's correct.

If you had a value that was actually that special case, then you

08:54

have to store that in the patch table. Yes.

But a trick you can also do is like and here I'm showing offset 399, but if you have that repeated value that's the sentinel value over and over again, you actually don't you only need to store

09:09

that once in the patch table. You reuse the entry.

So it's not like you need store that repeatedly. How do you store the offsets?

Just a fixed length array.

09:24

Okay. Uh, bit map encoding.

Uh, for this one, instead of storing the actual value themselves, we're going to store for every unique value you have in your domain, we're going to maintain a bit map for it. And then you just have a one set within that bit map at an offset whether the value at the tuple of that

09:41

offset is that value within your bit map. I'm going these kind of quickly.

We only have 10 minutes left. Um this would be fantastic if the cardality of the value is low meaning the number of unique values it's going to have for a given attribute is low uh like my one

09:56

example is dead yes no like anybody's either dead or not dead right there's only two possible choices and this is going to work really great for that so we go back to this right is dead yes no so I'll I'll maintain one bit map for my yeses one bit map for nos and as you see

10:13

at the different offsets it's all going to be one or zero in in either one, right? And then now if I uh if I you know run my queries, I can just rip through uh you know find me all the matches where the the somebody's dead.

I only need to

10:30

scan through the is dead yes column and I don't need to look at look at the other bit map. So the original size of this was 64 bits.

Assuming I can store these instead of yes no as as single bits. I can get that get this down to 16 bits.

Yes. Where would you ever do that?

10:46

Like isn't it better to do just for one bit that say zero or one? The question is why would you ever want to do this?

Is it better to store is dead in this example? Yes.

Even if there's more values better with one number rather than a bunch of

11:04

this question is uh if there's four unique values then you just need basically should should you do like something like RLish Like you saying do rle or this basically store like if I have yeah the statement is could you

11:21

just represent the bit sequence as just a number store the number yes I for two columns does I'm trying to show an example example yes but but I just don't understand why this ever be a good idea so I said why would this ever be a good idea uh well if if the thing you're

11:38

storing isn't isn't binary well because you're busy keep Yeah. Yes.

So like isn't it always smaller to just have like one small integer where like

11:53

zero means a value one value this directly instead of having the sparse? His question is is it better to isn't it better to have a small number represent uh a value within my my colonic compress

12:09

rather than do one hot encoding? Uh well no because I can do I can rip through bit maps very very quickly much faster than I can rip through integers right find me all the act find me all the simple example here find me all the this is not good because I could store

12:26

is dead as as a as a in a bitmap column right so you're saying it's easier to compare one bit like say like correct yes because there's a bunch of bit shifting tricks I can do that in modern CPUs that I can take in a whole

12:42

vector of bits and then a single instruction and find all the matches versus like integer it would would take a lot more. Your statement is bitmat would be

12:57

faster. Yeah, for aggregation.

Yes. Okay.

My example here, yes. No, it's binary.

So you could just store just this array here would or this array here would be the same thing as storing is dead. show you two examples because if it's on PowerPoint, but if you have more

13:13

uh it it would not be a good idea. All right.

Um let me quickly show why it's a bad idea. So if you have a a lot of columns, sorry, a lot of unique values like number of zip codes, then the storage space you just do the naive scheme would

13:28

be terrible. All right, so you don't want to bit map encode everything.

Uh there are tricks to compress this further using called roaring bit maps. We don't have time to cover that because think like compressed bit maps for It's a way to compress sparse bit maps.

So I have a bunch of zeros roaring bit maps

13:44

the right way to do that and a bunch of systems use that. All right, let me get through the last two compression schemes.

Delta encoding and um and dictionary compression. So delta coding the idea is that if I'm storing values that are very close to each other uh repeatedly then I don't need to store

14:00

the actually the actual absolute value uh for each of those measurements or whatever I'm trying to record. I just store a diff or delta from either the previous value or some global value at the top of of beginning of of the page and then now the delta I'm storing is be much less than storing the full value.

14:17

So again, if I'm storing the temperature in this room uh at a one minute granularity uh and or hopefully not this room 99 be a lot, but like if I'm recording over and again the same temperature over again, the there aren't going to be wild fluctuations in the temperature uh from one measurement to

14:34

the next. So instead, what I can do is just or same with the time.

The time is always incrementing uh by one one minute. So instead, I'll just store at the top of the the column the some base value and then I'll store the delta from the the previous value.

So now as I'm

14:51

scanning along, I know how to apply those changes to get back what get whatever value it is that I wanted, right? So we can compress this further because if you notice in this case here, the time column, what do we have?

A bunch of plus ones. So we don't we don't need to

15:09

store one one all over again. We can r that again and now we get the multip multiplicative compression where I restore we have plus one repeated four times.

Right? Some of these compression schemes are are are composable.

So the the benefit we're going to get is

15:24

the original data in this example here is 320 bits. If I just do delta encoding I get 128 bits.

But if I do rle plus delta encoding I can get it down to 96 bits. Again we're talking bits here.

Always think in larger scales. This will matter when you when you get the gigabyte and pedabyte scale.

15:40

In this example here with delt encoding I have at the beginning of the page that's the base value I'm going to use as this the the offset when I compute these things. There's another approach called frame of reference and basically you have a global min value that these you use across multiple pages but the basic idea is the same thing.

15:59

All right, the last one I want to get through is dictionary coding. And the idea here is that we're going to replace frequent values that we see in our in our data set within a column with some kind of small fixed fixed length integer code.

And then we'll store those integer

16:14

codes uh continuously in memory. And then we'll have a separate data structure called the dictionary which is oftentimes just a you know fixed length array sorry an array that with we jump to offsets um that if I ever need go

16:29

back to the original value I can use that dictionary to do that conversion. So typically in most implementations are going to have one code dictionary code per value.

So, if I have like a one megabyte block of text, I can convert that down to a a single 32-bit integer

16:45

um and then I have a dictionary to store that entire block of text later. You you could start doing like more fine grain things like Huffen coding.

Uh but as far as I know, no system does that in production. This is also basically this again this is what LZ4 and Z standard and all those other questions are doing underneath the covers.

They just don't expose the

17:01

dictionary to you. So this would be great for uh both doing point queries and and range queries because I can rely on that dictionary to do fast uh comparisons on the data in compressed form without having to go decompress it.

17:18

So look at a simple example like this. So here's my original data, right?

I have a bunch of integer strings. They're all different lengths.

So I'll then when I compress it, I I convert the name column into these compressed integers. Then I have this dictionary size structure over here that for any given

17:34

value here's the code corresponding code for it and I have a way to actually go both directions of my dictionary. So for a given code give me the original value and for original value give give me the code.

So I need to support both operations and this dictionary is actually be sorted based on the the

17:50

values that we're compressing because now we can still apply certain uh comparison operators when we run queries on the compressed data and still get all the order preserving guarantees we want because the the data has actually been sorted. So now I can use some tricks like this.

If I have uh

18:06

final users where the the name ends or starts with the prefix a and d, I can take my query, extract out the constant I'm doing lookup on, convert that into uh or basically apply that like clause to my dictionary which is be way smaller

18:22

than the the original data convert now my boundaries into uh to the the to you know to dictionary codes in this case 10 and 20. Now I in my my rewritten query I can rip through the name column with that between clause and

18:38

now just do comparisons on integers which is way faster to find all my matches. So I still had to do some some string lookups but I did that on the dictionary and if I have a lot of repeated values and this is fantastic because I'm not going to have to look at all the data and then I can rip through the the original data uh very

18:54

efficiently. So this is what the order preserving guarantees of of maintaining the the dictionaries and sort of the order gives us.

So if I have a query like this uh like select name from users for name equals like a and d right I'm still have to do a scan on the original table

19:10

because I got to get back uh I I still have to get back the the original value uh for the for the tuples that match the in this case here if I have only get the distinct name that one I only have to go look up in the dictionary because I need

19:25

to find whatever that one match that I have and and then I'm done. Yes.

delete. The question is how do I handle how do I handle deletes?

Do I have to handle how do I delete from the value from the

19:40

dictionary? Deletes are easy because you could just delete the value from the dictionary and it doesn't have any gaps.

Inserting is more tricky because if I have to insert something between a like a 10 and 11 and I don't have a value in between, I got to recompress things. The great thing about OLAP is that the the the data is mostly immutable so it's not

19:57

really an issue. If you see the phrase dictionary compression, does it mean order preserving by definition or are there dictionary compressions that are not order?

The question is if I someone says you're using dictionary compression, does that mean by default they're using order preserving? Typically yes.

Are there

20:14

compression schemes that are not order preserving? Yes.

But most of them are. I sort of highlight that benefit we get.

Okay, we're out of time here. Um, so I'll pick up where we pick up on the pick up on how we handle OTB and OLAP together next class and then we'll have

20:29

the guest speaker from Yugabte and then again the recitations tomorrow night and then all the database uh company talks are all tomorrow morning as I post a piaza. Okay, hit it acrobats over tracksicks

20:50

over tracks. [Music] over [Music] the fortune.

Get the fortune

21:05

Maintain flow with the grain. Get the fortune Maintain flow with the grain.

#06 - Column Stores + Database Compression (CMU Intro to Database Systems)

Summary

Transcript