#01 - Relational Model & Algebra (CMU Intro to Database Systems) | YouTube Summarizer

Transcript

00:00

[Music] I'm still ass [Music]

00:28

15.5 6.5 All right. All right.

Before we get started, again, apologize for running late waiting for us uh whenever time and we had everything set up and we weren't working at the last moment. So, I apologize.

Uh so, quick shout outs to people who help us get where we are today. Uh if the Greek out in

00:44

California, JL in Seattle, my main man easy in Brooklyn, and then I also want to recognize the people that aren't with us anymore. Uh so, it's been a year since DJ got tables passed away.

RIP to him. And then DJ Mushu is still in lockdown in Cook County.

We hope to get

01:00

him out uh this year. DJ GPL is still in Pittsburgh.

I still think he still is dating the same girl as before, so he's doing fine. Um but uh the good news is that we have a DJ again.

So give it up for DJ Cash.

01:16

So DJ Cash is not just a DJ, you're also a producer, right? Right.

>> So show us a little something what you can do. Go for it.

01:40

[Music] about this like all those uh samples we can clear whereas before we can always play everything we wanted on the air. How you doing that?

I'm doing great. How

01:55

are you? >> Well, I was died last semester.

That sucks. But I'm not dead.

I'm here. >> Everything's okay with you.

>> Um it would be nice if I had a lot more cash with me. >> What?

>> Um I'm just looking for that money, man. Just here to get that money.

02:11

>> Okay, that's a weird thing to say. Okay.

Uh we'll we'll take care of it. We'll be fine.

Okay. We'll figure something.

Uh you should have told me this before. That's weird.

All right. Sorry.

Um All right. So, uh, today's lecture, we're going to go quickly through the, uh,

02:26

some course logistics, uh, and then we'll go jump right into all the course material again. So, if you're not enrolling the class, you're currently on the wait list.

Uh, I think that if I checked this morning, I think we're like 28 or something like that. I think the

02:42

total enrollment is 130 or 140, something like that. Um, so we don't control the weight list anymore because in the previous years, we've had too many students trying to get on uh, and the weight list got too long.

to the admins in uh SCS control list. So they will add people off the wait list as spots become available.

I can't promise

02:58

you whether you're on the wait list, whether uh you just go get in and this is not official, but it doesn't go by your position for reasons I can't get into, but there's a lot of background politics between different programs. Uh and it doesn't always go by the uh your position.

Okay. So the good news is that

03:15

if uh if you can't make it this semester, then it's offered every semester. Uh so if you can't get this year it'll be in the spring.

Um if you want to make sure that when you if you do get off the wait list that you're caught up to speed uh project zero which I'll talk about in a second should make

03:30

sure you do that and follow the lectures because that project will leave the determination whether you're not taking class or not as well even if you are bored. All right so this class will be about the designation of database management systems right how do you actually build this this software but what's going on on the inside.

Um, so

03:47

it's not a course about how to use or administer a database system. So some people say like why am I in upgrading system like I just want to use Oracle just use my SQL post whatever that's not this class that's taught over in Heines College and information systems and this is really about computer science level

04:03

understanding what's going on inside the software to understand what the uh what these systems are actually trying to do for you. It's okay.

If that's what you want, go go over there. What's going on there?

04:19

All right. Um the all the course logistics uh to the syllabus and the schedule are now posted online.

So please go read that. Make sure we understand what's be expected of you as a student in this class.

All the discussion and announcements for the course will be on through piaza. If

04:35

you're enrolled, everything should be everyone should have been added uh as of a couple days ago. I think I can refresh it for the new students.

Uh and the same thing grades scope that'll sync with Canvas. So everything will be the roster there will be added are you enroll there.

Uh and then final grades and the

04:51

total grades throughout the spectrum that'll be done through through campus as well. So we say this every year if you're not a CFU and you really want to be taking the class uh because we post everything online.

Uh we know there's a non team student ratio. Use that code.

Make sure you put your university as part of your university. uh and then you

05:07

can follow along and get all the things we have uh in the class uh except for obviously posting PASA. If you're not a student in the class, please don't email the students or the TAs or myself.

Uh don't post your solutions on GitHub. If you are in normal class, you don't want to be in this grade scope.

You should be automatically added to the real one.

05:23

Okay. All right.

So, for in the lectures today are throughout the rest of the semester, I get very excited when I talk about databases because I I love databases. They're the best thing in the world.

Number two in my life. Uh, I'll talk about number one in a second.

Um, but

05:39

when I get excited, I start talking very fast and I know that can be hard for international students. So, if I'm going too fast and you need me to slow down, just raise your hand and tell me like, I don't understand that.

Can you repeat yourself or slow down? Okay.

Uh, and if you have any questions lately that goes on in the lectures, again, please

05:55

interrupt. Raise your hand and say, "I don't understand what you're talking about on that slide as we're going along." And I'll stop and repeat myself because if you have questions, then it's very likely that somebody else is going to have questions, too.

So please go ahead and interrupt me. So what I won't do is that at the end of this end of

06:10

each class I won't allow you to come up and say hey on slide one 123 you said this or what did that actually mean? So immediately after the lecture you can't ask me questions about that lecture because I want you to interrupt me as we're going along.

Okay? So you can ask me about anything else after the name of the class like how to get a jobless how

06:26

to deal with the cops or whatever you want like that you can do but I won't answer questions about the lecture. Okay?

Because again if you have questions other people do. I don't want this to be sort of a back and forth collaborative.

All the projects in the special will be on this uh educational data system we've been building for several years now at

06:41

Carnegie Melon called busub. Uh it's in C++ 20.

You can probably get by C++ 17 if you're not familiar with the the new features in the 20. Um but the main thing is going to be in C++.

There is no class at CMU that teaches you C++. So you kind of have to pick it up on on

06:57

your own. Um, and so if you think if you don't know C++ and you think you're going to learn it as we go along, that's probably going to be a bad idea.

Uh, especially when it comes to debugging the system. Um, because you think about you're trying to learn C++ and travel

07:13

databases. It can be challenging to do this at the same time.

So every year every student comes along, there's always students saying, "Hey, I know C++ and I know C, I know Java, I can pick up C++." And that's not always the case. So, uh, we had this assignment at the beginning of the semester that we

07:29

implemented today. I had to fix up some things in the latest version of Clang last night, but should be out, uh, in a few hours.

Um, called project zero. It's basically implementing a simple data structure that has to be multi-threaded.

Uh, and you don't get any grade for it

07:44

because it's not meant to take a long time, but it's it's meant for you to come to realization what you actually know enough to get by in the class. So, you have two weeks to complete it.

you have to finish and get a complete 100% uh in in that two weeks or you'll be asked to drop the course because again

08:00

when we start doing more complicated things like a you know multi-threaded data structures C++ is going to be very challenging for you if you you don't have the experience okay this also force you get access to your dev environment all that good stuff as well so any questions about project zero

08:16

we'll post information about this on the side and the code will go out later today and again we try to people to uh get through this student rather later before we get past the ad drop deadline. So now I have to show this every time in the

08:33

beginning of the class warning about plagiarism, right? So everything you do for homeworks and the project should be your own work.

Um means we don't want you to copy from other people on the internet or other people uh in the class. Um, but I'm going to say you are allowed to use generate AI tools flaw

08:49

check pick your favorite one to help you write the the project right because this is the future like it's stupid to say don't use these tools be ask you like not use debugger to debug your code be stupid like this is the way the world is now um so you the lines get blurred you

09:05

know what's actually plagiarism versus not plagiarism when using LLM and if you're not sure ask us use your best judgment but by all means you should be using these tools to help you write The tricky thing is going to be you you can have it you know check will generate a bunch of code for you but if you don't understand the fundamentals what we're

09:21

trying to teach in this semester and understand systems you're screwed because you don't know whether the thing it's going to put out is going to be correct or not and we'll talk about the leaderboard in once we release project one but there's ways to get extra credit and by based on how fast your code actually can actually run and I'll be

09:37

honest I play and claw other ones and it's the sometimes correct. It's definitely not going to be performant versus stuff, right?

Writing assembly versus writing C code in the compil. Okay.

09:52

The other thing we're doing this semester is that we have uh on every Wednesday, we'll have inclass lectures that resume from all our friends in the data industry. So we'll have these little flashing lectures that'll cover quickly some some particular data system or project that's out in the real world to get you to understand what uh how the

10:09

things are talked about during the semester relate to uh to real systems. We're also having all these companies come give uh uh recruiting talks in in two weeks or three weeks in September.

Uh every class will be invited to companies things and we can talk to them

10:25

about internships and full-time positions. That'll be on September 15th and 16th.

We kind of restricted this just to be the data classes this this semester and maybe previous semesters and then also the advanced class teaching being taught at the same time. So it's not like the career fair with everyone showing up uh you know they

10:42

took a shower that morning they're trying to hand their resume. This is specifically just for databases and we'll post on kiosk about how to do this and then we've also asked the companies to provide us uh with information about their intern and fulltime positions directly for these students and that again that'll be

11:01

all right if you want to go beyond the things we're talking about in the class uh we're also having a seminar series starting in uh September this is optional for students uh and every semester we always try to do a different topic this topic or this semester will be about data lake systems. If you don't

11:16

know what that is, don't worry about it. We'll cover it later on.

Uh but this is sort of the hot trend right now where people are building cloud-based systems where they do SQL on top of files and Amazon S3 cloud storage. So iceberg is probably the most famous one of all these.

11:32

Anybody know who bought iceberg recently or last year? Data bricks.

Anybody knew what data bricks paid for them? I think it was two billion or a billion B.

We're talking about a shitload of money here. Um so uh bunch of people are

11:50

probably going to come talk about what they're building. Uh and again this will be on uh on Mondays after class at 4:30 and it'll be on Zoom on YouTube as well.

Right. I mean this option for students if you you're like me you really love databases uh you know spend more time and this is one way to do that.

12:06

>> Is there a reason why we don't have data bricks on these slides? question is do we is the reason why we don't have data bricks in the slide we do uh twice I've already said one they bought iceberg they bought tager there's one more in there from data bricks everybody know what it is >> delta lake yes

12:22

>> absolutely yes so delta lake was uh databicks's version of iceberg uh and then they bought the iceberg guys the story is they the snowflake was trying to buy by the iceberg guys and then they offered him 600 billion uh and then dat

12:38

Swin and offered them like you know two billion might have been a billion might be off but still a lot of money and then uh snow snowflake put out Polaris which is the catalog version it's iceberg compatible hoodie came out of um Uber I

12:54

think uh and so iceberg came out of Netflix Uber came out of uh Hoodie came out of Uber and then now that's being commercialized by a company called OneHouse um you'll see throughout there's a lot put a data system out there because there's a lot of money in

13:10

this and every year everyone's trying to build a new system to try to solve some problem that existing systems don't solve and sometimes it's a good idea and often times it's a bad idea and what this class is going to teach you how to spell out which one's the bad ideas and not because I said so because you understand the fundamentals of

13:25

consistent trying to do okay all right uh now this doesn't relate to the class but this is something because we're going to put this on YouTube um I sort of mentioned before that, you know, DJ uh Mushu is is in is in jail in in

13:42

Illinois. Uh there's actually somebody else we know who's actually in jail right now in prison.

Um there's this guy named Preston Thorp. Uh so if you read Tech Fun Hacker News, you might have seen this a few few weeks ago, right?

And he's got a blog article on um

13:58

torso's website about how he's in currently in jail right now. He he got caught dealing pink or smothering pink in uh from um you know up in in New England.

Uh so if you read this blog article, you got this little blurb here. It says that uh while he was in jail, he

14:15

started watching these classes and he became fell in the databases and he actually currently in prison right now. We are trying to find a lawyer to get him out.

Uh and he's doing databases, right? That's awesome, right?

So I realize that there's other people maybe in the same boat. Again, everyone's here

14:30

and hopefully don't end up being in prison. But it's not just through.

Uh so this semester starting, if you're not uh if you're currently in prison or you know somebody that's in prison, you send us an email and we'll send you a package up version of this course. Uh it'll be

14:45

like, you know, if you don't have a computer, we're pronouncing things, but like you'll get sent like a thumb drive, the videos. So obviously we can't send you this semester because this semester hasn't happened yet.

So it'll be like in the previous semester. Okay.

And we thank our friends at Convex uh which is a database company uh for for sponsoring

15:02

this. Okay.

All right. Any questions about about the logistics of the course?

What's expected? Again, the course sale list tells you the breakdown for the grades for the projects, all the the assignment list when all the projects will be released and how much released and when they're actually going to be due.

And then you post all the questions on the other. Okay.

15:21

All right. Let's talk to this.

So, as I said before, I love databases, but they're the second most important thing in my life. Uh, number one is my wife and my biological daughter.

Uh, number two is is databases. Like everything else doesn't matter.

And you'll see this

15:38

throughout your life that pretty much everything is going to be not just computer science and tech. Everything can be thought of as a database problem, right?

So, I don't have any friends outside databases. I don't talk to my family because they're Trump supporters.

Like it they they don't like databases. It's just obsessively about databases.

15:53

So what I'm going to hopefully impart from you in today's class and throughout the rest of the semester is how awesome databases are and why they're so important and why again you should pursue what I would call a database system lifestyle which is every day you wake up how can I work on databases or how can I run my database so in today's

16:10

class we're talk about data systems background but I'm making mistakes being database and database systems because they're equally important and then I'll talk about the relational model which is the best data model you could have for a database system we'll talk about relational algebra which is how you can actually end up write queries and the

16:25

fundamentals of building blocks of how you write queries and manipulating databases and then I'll finish up talking about some alternative data models that are out there you might come across in the real world and why they are inferior or just subsets of what the relational data model can do. Okay.

And

16:41

as I said I love data so much I get really excited. So please tell me to stop and slow down if you have questions.

Okay. All right.

Let's make try to start with easy question. Anybody name a database?

Just shout that out. >> Postgress one.

What's that?

17:00

>> SQL >> like SQL server >> SQLite. My SQL.

Yes. >> You said you said SQLite.

That's fine, right? For my SQL.

>> Click house. Yes.

One more.

17:16

>> Microsoft. >> Microsoft server.

There's actually two there's two product there SQL server and Microsoft SQL server different things. Yes, right there last one.

>> Reddit monitor. We'll cover that later.

17:31

Um you're all wrong. So what did they list?

Database systems. I asked for a database.

So again this is the word about databases. We can be very pedantic about what what our definitions are because we have to understand what

17:47

we talk about when we say what's a database and a database system. You have to go to data systems to manage it.

So you're not wrong if you go to a bar anywhere and you know your tech people and say hey I'm using this database they'll know what you mean but I understand what the data we actually want to store and what the system is

18:03

we're going to build actually managing right so databases can be a sort of a collection of data that's interrelated in some way uh that's meant to model the real world is so we listed click post my single single like again these are all

18:18

data systems that we use to host and store and query a database. A database would be like the list of students that are enrolled in this class.

You have a name, you have an email address, right? You have a date of birth and so forth.

And what is that trying to do? Trying to

18:34

represent you guys that are taking this class in the sort of digital world inside of a database, right? And as said before, these can be the most important computer application that you'll see for the rest of your life.

Even if you don't take this class and go off and actually build data systems at all these various companies that we just mentioned, uh I

18:51

guarantee you for the rest of your life, you're going to come across data data systems. And what this class can hopefully teach you is like when you you send a query to your data system and it's slow or something weird and doesn't behave the way you want to behave.

This class will help you understand why is that that's the case. The end of the

19:07

day, what is computer science? Computer science is just like taking some inputs, doing some kind of manipulation on it, and then producing some output.

That's a database at a high level. What is a possible possible database?

What is LLM? It's basically a database.

Everything is going to be a database,

19:23

right? Compilers, I mean, it's kind of rules inside that thing, but it seems some input producing output.

You could argue compiler is a database. A game's a database.

Everything is a database. Trust me.

Okay. So, let's look at a simple example and

19:39

we'll see one way we can actually implement a data system actually to maintain it and run it. And then we'll see why sort of the straw main idea is a bad idea and we'll see again we'll help motivate why they actually built something more sophisticated why we need to do this course.

So let's say we're going to

19:55

build a clone of Spotify or iTunes and we want a database that keeps track of a bunch of albums that are out there, music albums, and then the artists that are on them, right? So what basic things we need to track is like the name of the artist, maybe the year that that they that the album released, right?

And then

20:11

what artists appear in different albums, right? You think about what what are we trying to model?

The data is trying to model a real music store where you actually go in and see records and CDs and tapes and things like that. But here we're trying to do it in inside of the database.

We're trying to keep track of all the attributes we have for that for

20:28

those real world entities inside of our database. So what's one easy way we could we could do this?

>> Yeah. Go ahead.

Yes.

20:44

>> He said put an RC in each row. >> Yes.

So basically a way more specific to say let's just say we have a bunch of CSV files comma separated value files text files you have on your laptop or your your computer and I have one file for the artist and one file for the

21:00

albums and as they said I'll just have every line is going to be a different artist every line of that file you know separated by a new line character is going to be an album and then I'll have commas separate what they are have one one file for artist and one file for app

21:17

Right. So now anytime I want to query this database, this is a database.

Now I'm just going to write some code that to open up a file call f the each line by the new line, split by the new line and then split it by the comma and find the data I'm looking for.

21:34

Right? So let's say that I want to write a query that says give me the year that Jiza went solo.

Right? So I have this artist uh uh file and I can write some pseudo Java code or JavaScript code that just opens the file reads line by line

21:50

parses each line and then I know that the the first offset is going to be the name of the artist Jiza. So the first offset when I split them on the comma is Jiza and I know I have the match and I just output the the the second line

22:05

record offset one. You notice here I'm I'm the in the CSV file everything's treated as a string.

If I want to produce the actual integer of the year that came it went so well I I cast it to an integer. So is this a good idea or

22:23

I've already said maybe why >> yes >> you said it's a linear time every time you look at one entity. Yes.

We never know where the next line is.

22:39

>> They said you uh in order to find any line you got to parse the previous line to a sort of related a linear scan or sequential scan to the file. >> There's no >> there's no typing.

Yes. >> The logic and variable field names are

22:54

hardcoded. >> They said the first >> you said the logic and the name or the off the fields are hardcoded in the code.

Yes. >> One more.

Yes. No consistency guarantees.

>> He said no consistent guarantees. What do you mean by that?

23:14

>> He said no acid. >> Give us give us like 10 weeks before we get to that.

Uh maybe another way to say is like there's no guarantees that the data I put in is going to be correct and no guarantees that when I write the data that's going to be safe and no

23:30

guarantees that uh the the data across the two different files will be uh consistent meaning like I I could put a album that has an artist that doesn't exist in the artist the artist platform that's basically assess

23:48

right so you can sort of think of like the different levels like the integrity of the data maybe like if I put data in is it going to is because I'm trying to model the real world can I manipulate and and change the data in such a way

24:03

that I can put it into an invalid invalid state that doesn't match what the real world looks like right so as I said before how do we make sure that the uh if I if I add an album I put an artist name that artist name actually

24:19

appears and it's correct in the artist file or what if I have an artist has multiple albums the Wuen clan must have a bunch of albums how do I make sure that the the artist name isn't like you know you know I don't sometimes use a hyphen don't use the hyphen so I have

24:34

two different versions of the Wuang clan even though logically I know they're the same thing but the actual bits I'm storing are actually going to be different so I don't think they're not going to match my sort of simple example here I was assuming that there's a one to one correspondence between an artist and an

24:50

album and an album would only have one artist because there's only one field for it or people people put out mixtapz all the time or with collaborators do you have multiple artists on the album but I can't represent that in that database then what happens again if I delete an

25:06

an artist but I don't think the album that they're they're involved in now I have basically a dangling pointer the imotation going to suck as well right how do I find a record as we already said I got to parse every single line and then jump to the offset hardcoded in my my program to find the

25:23

data that that I'm looking for. That's going to suck.

But then what I want to write a new application that I want to use the same database and this one's in Python. Maybe I write the next one in SQL or Rust, right?

I got to run all the same logic I had before to parse these files now in a new program and how to

25:40

make sure that these things are in sync. Be really hard to do because one's in they're different languages.

And then what happens now if I have two threads try to open up the file and try to write a new record at the same time. What would happen?

Who would win? Right?

Is it the last writer wins or should the second guy

25:56

fail? Right?

That's sort of related to the assets we just talked about. We'll cover that later.

Then the last one that's super important because people get really pissed off if you lose data. Like think of your bank account.

If you put money in, you know, deposit money in the bank and they lost

26:11

that deposit, you'd be pissed, right? Because it's money.

So the bank does a bunch of stuff in their database to make sure they don't use those records. So in our example, you know, it's not the end of the world.

We miss an album, but people would get pissed off pretty quickly if the application didn't work

26:26

as expected, right? So what happens if we start pending new records to the file and then then we crash the computer crash the power?

What should happen? What should you see when you come back?

What happens if you want to duplicate the data so that instead of having on

26:43

one machine and that one machine goes down we lose everything if we want to you know running a real website a real service it's got to be always available so I want to start duplicating the data and how to make sure those are things are in sync and so this is be basically why you

26:59

don't want to write a database system yourself in your application code people do it all the time it's a terrible idea you want to use a database system because these this is you know highly vetted software part that is uh where people spending a lot of time you know not worrying about the high level things

27:14

like you know how do I actually store uh represent a music music site but actually how do I store data and make sure you don't lose anything right so you data manage systems are going to be important piece of software that going to be sort of the bedrock for pretty much every application that

27:30

exists today now that said just because you're using data system doesn't mean the people building that system have done that correctly and you can guarantee that you're not going to lose any data or have other problems. This blog article actually came out two days ago.

Uh so there's a company out of the

27:45

UK called Surreal DB and without going into details. Basically they they turned off uh safe rights by default makes themselves look like that faster than they actually really are.

So they're they're storing data very fast for you but it's not actually safe safely being

28:02

stored. So you can crash or lose power if something happened and you can corrupt your database and you lose data.

So I'm just I'm pointing this one out because this was two days ago. MongoDB had a long history.

We'll cover throughout the semester, but they were kind of playing the same kind of games. But this is not a new trend.

This is not

28:17

just pointing these guys out. I'm just pointing out to say that again what this class will do is teach you why what they're doing is a bad idea.

Uh and you should really understand what the system is is claiming what what they're trying to claim that they can provide and see whether it's actually real or not. This

28:33

is something when you're hiding something in the documentation that nobody realized they need to turn on until it was too late. All right.

So now make distinction between the the data systems that we mentioned at the beginning versus the database. The data benefit system is this category of software that is meant

28:51

to be for storing application data in such a way that it makes it easy to store data and to analyze it and ask questions about that data at some later point. So a general purpose database system like the various systems we've all mentioned today like click house my

29:06

postgress and so forth right these are these are designed such that any application that comes along could start storing their data in it without uh having to write everything from scratch right you you define schema we'll cover

29:22

in a second you you tell the what you want your data to actually look like recording some data model which we need the the next slide um and then these systems will actually store this for you. As I said, this is not something you want to build yourself.

Almost always the case from you almost never

29:38

want to build this yourself. Um, often times people say like, oh, there was no new system that met my needs.

I had to start building my own. 99% of the time that's actually not true, right?

You can get very very far starting with Postgress or even something like SQLite. And that should usually be the first

29:53

choice uh for building any new application. Actually, if you get nothing out of this course, remember your first choice should be Postgress.

They have amazing front end, terrible back end. I'll explain why for that.

Okay. All right.

So, a data data system is

30:11

going to provide you with uh expose you what's called a data model, which is a high level abstraction that sort of specifies how you're going to represent data in the database. Right?

So, relational model is is one example of a data model. The document data model in

30:26

in Mong, that's another example. Cover that in a second.

Right? Think of these like the rules that that you would rules that can define what types of things can exist and their relationships between each other in in a database.

30:41

Outside of like computer science, you can think of like a um a data model is like the the rules for architecture. Like you're going to build a building, right?

A a building would have things like a room and a doors. like these are the types of things you're allowed to have within a architectural diagram or

30:58

or specification for a building. And then a schema is going to be a description of a particular database or a collection of data uh according to some some data model.

It's just like a way to define here's the things I'm

31:13

actually going want to store in my database according to the data model that I'm required to specify things as. So going back to my building example, you can think of the the the schema an actual blueprint diagram of a building, right?

Say here's where the doors are.

31:29

Here's the size and dimensions of the room, right? And I can take that schema and I can I can make different versions or different um uh different instances of that house or whatever that building is according to that blueprint.

But the data model defines what the blueprint is

31:45

allowed to have. It has windows and so forth, right?

So the relational model is one example of a data model. It wasn't the first one.

We'll cover the the struggle ones in a second. Um but pretty much today most data systems are going to be the follow the relational data model.

32:01

pretty much all the systems everyone mentioned except for Reddus and uh are relational database systems for simple things like key value systems. Uh think of like it's most simplest data you can have database you can have.

You have a key call live value

32:17

associative array like a hashmap you can drop in program languages a key value pair it's pretty basic you see this in like a lot of caching systems uh key follow or a chunk of data you kind of cache um for several things it

32:34

looks like this there's a whole another category of systems called new SQL systems raise your hand if you're ter getting less and less each year this is good Um I was I was not saying I was very impression but I was I was um I was

32:52

going to say an early critic of NoSQL systems and turns out they're right pretty much all of them are adding model now but um you'll see these wrapped databases they're not dead yet they're around the dock database JSON database XML databases these are ones people mostly think about in the world yes

33:09

>> so is key value not no SQL >> question is is key value not SQL. I mean there is not a academic scientific definition of NoSQL.

You could throw them in there as well. Yes, those would be like Reddit would be a no

33:25

system. That would be a key dive.

But when most of the think they usually think the things like that, right? All these other data models we don't we don't care about except for the documents that we'll cover in a second.

Then there's uh array databases

33:41

like you know one-dimensional vectors, two dimensional tensors uh and matrices. You typically see this in machine learning workloads and scientific workloads except the vector stuff I'll talk about I'll talk about at the end but this is very common now in databases to build rag applications do semantic

33:58

search and your favorite things. We'll talk what that looks like.

Um and then there's these guys at the bottom higher core network semantic and relationship. These are all the ones you've seen really old systems from like the 1970s, 1980s.

Uh high pilot was the first one. Network is another one that's around

34:14

like you would nobody would no new start saying I'm going to build my database or I'm going to build my application running off a highle database. Like that's like IBM IMS they built that to keep track of the all the parts of the moon that they use.

That thing is still around. IBM still makes a ton of money

34:30

on it. Every bank still runs it.

uh but they put a relational veneer on top of that so it doesn't look like it did back in the day. Uh so think of like 1960s 1970s early 1980s.

We don't have to worry about these except we'll see why they're a bad idea and then it'll help motivate why we want to do the data

34:46

model. For this whole course it's going to be a relation data model.

Uh because again pretty much in the same way that like 1 plus 1 equals 2 is the basic for arithmetic and math. Relational data model in my opinion is is the basic building blocks of how you want to build

35:02

any given. The only time relational data models will make completely sense is when you start doing matrices because you can model them in a relational database or stop them.

Not always the best way to do that. Again, we'll cover these later on.

35:17

All right. So, let's go back to 1960s.

None of us none of us were born yet. And none of us were alive, but people were building, right?

So in the early early or mid 1960s people started donating um the

35:33

some of the first systems the first one was actually built by General Electric GE in 1965 uh system called ID IDM integrated data management uh it's written in assembly uh they built this to to to host the

35:50

some Seattle chamber companies uh you know their data to keep track of all the logs or something Um but IDS so IDM was the early ones IMS mentioned they built this for the Apollo moon mission before NASA was actually called NASA and so the

36:06

in these early systems they they so before SQL formational model it was very hands-on about how you would actually write code to query the database I think they were like this even before C existed you know people were running assembly they do everything and then cobalt came along and it was sort of a

36:24

higher level language to a higher level in assembly but it was still pretty low level for how you'd actually interact with database and so there's this paper uh from 1973 I think or 72 or 73 by Charles Bachmann

36:40

he won the touring award for databases in 1973 um because he built IBM and then uh IDS and Anthony kodil and it was his way of how he thought the how database should look like so the name of caper is the the programmer as a navigator so He

36:55

has this whole little step step-by-step process here is how someone would interact with it, you know, could write queries against the database system. I don't care about the details of it.

Just pointing out there's one it's sexist everything, but two also like he can do this, he can do that. The paper talks about like you know you as a programmer

37:12

being like capernicus or like sailing the seas as you navigating through your database to find all the data that you want, right? It's a terrible idea because it's essentially you as the programmer have to understand exactly how the data is being stored in the system and write code against that those data structures.

37:29

So it's got this little blurb here talking about the some BS like the synergistic usage of the collection of which gives the programmer a great expanded powers to come and go within a large database while assessing only those records of interest. This sounds like a terrible idea.

37:44

But why? Right.

Well, so again, this guy was a big proponent of something called Kodasil. Nobody here has ever heard of Kodasil.

Probably never heard of code ball, but this is what how they were defining it in the early 1970s on what data should look like and how you program against them.

38:01

So what does it look like? Well, I I've never written code.

So check that. So say you want to write a query that says give me all the artists that appear on DJ's tribute album that we want to put out this year.

Right? This is basically what it looks like.

We're running a bunch of nested for loops to look at all

38:17

the artists and this artist they all the albums all the albums and artists basically writing traversals in the data system of these explicit data structures to find the thing you want. You're essentially telling the data exactly how you want to navigate the data to find the thing you're looking for.

38:34

You think it's a good idea or bad idea? I've already said it's bad idea.

Why? depends on which data they say it depends on what data

38:49

structure you're using underlying system. Uh I mean it's one of those AI wasn't exposed but in this case here for code of sale it's high level enough that the don't know what data structure is but you need to know that there's data structures or collections you have to

39:05

traverse. So in my example before I had what?

I had two two tables or two files. I have artists and albums.

Is it better to do the for loop on the artists first or the albums first?

39:20

You don't know, right? Well, maybe you know because you always ask how many how many lines or how many how many records are in each file.

But that may be true today, but what about tomorrow or a year from now? And then now this is the code you're writing queries on.

And now maybe

39:35

whatever assumptions you made at the beginning when you wrote this query is not the same as it is now. But because you hardcoded exactly what execution plan you want, this computer has to run this.

This is a bad idea for a whole bunch of reasons. We'll see in a second.

But the

39:51

equivalent SQL query would be something like this like joining the artist and album cable by just the the album name. Notice here it's if you never seen this before, I hope you have.

I'm not telling

40:06

you what how to actually pick the answer I want. I'm telling you what the answer I want.

And then the data system gets up their responsibility to go figure out how to actually generate that answer for you. Right?

It seems obvious now, but back then this was like a mind-blowing idea

40:22

and codill was considered the hot thing. So I don't want to know too much details of this but there was this meeting of the minds in 1974 at University of Michigan in Ann Arbor where all the code guys came along were there and all the relational data model guys were there

40:38

and they thought about it and it's all public everything's written down what they were uh talking about uh and they were pointing out all the reasons why a a hierle or navigation data model is inferior to something like a relation data model and this was considered

40:54

considered a major pivotal point in databases where now people realize oh data model is the right way to go forward but at the time there weren't any systems that could actually implement this and then after this point people started building them so the guy that proposed the the rich data model is

41:10

named Ted Cod uh and so he was at at this conference along with the code of sale guy I just mentioned Charles Bachmann uh as well as Jim Gray the guy locking built one of the early systems at uh at IBM called systemr we'll cover next

41:26

slide as well as Mike Stoner the guy who built ingress the later built postgress you ever wonder why postgress is called Postgress because it's post ingress he built first ingress and he built Postgress as the following right there have been four Tony worthy databases

41:44

these the four right here firstly Cod Bach and Gray are all dead still kicking it he's like 81 he's awesome right uh he's helping was trying to find a lawyer for the guy getting out of get him out of jail. Uh, right.

But this was like a

41:59

big big conference, big big deal back in 1974. And so the Kota guys were saying, "Oh, we should model because the paper came out in 70 1970s early 169.

You couldn't build a system like this. It's way too complicated.

Nobody nobody can reason about relation database." And

42:16

then the the relational database guys are saying all the crap they're talking about code like navigating your database that's a terrible idea. It's going to make your data really brittle and it again you're basically hard coding exactly what the query is going to be and again it's you know 30 40 years here

42:33

50 years later uh and relational data model guys have won because you all listed relational databases at the beginning no one listed a a uh cod they still exist still can pay money to somebody maintain it for you but no

42:49

startup is going to do system all right so what is about the coreation data model. Three basic key ideas that are really simple to understand.

The first is that the way you define your data through the data model is just these highle

43:06

uh collections of data called relations is in sets and that you would then represent the uh you then represent the connection between data at the logical level through values. Say like this album is

43:22

part of this this artist this album by saying like keeping track of like identifiers between those two relations rather than low-level physical pointers like this memory address or this disc offset say things are related. Everything is done at sort of a high

43:38

level logical level. You also be able to specify constraints now in your data to say what data is allowed to be stored in it to make sure that nobody's allowed to put invalid data.

Like in my file before it gives a file disc. I could open a text editor and start changing years and certain email addresses and that would

43:55

break all my programs. A relational database won't me from doing that because it'll say this column with this attribute has to be this data type and doesn't let you store anything that would viate that not always true.

Some systems will see like single let you store bunch of crap in there you

44:10

shouldn't have but data model is is sort of separate from implementation. And then the last one is the way you would manipulate or query the data is through a high level API that you declare what answer you want defined through relations.

Think of like

44:27

sets and then I don't care about the actual implementation of the database system in my query. I say this is the answer I want.

Now it's left up to the database system to decide the most efficient query plan, the most efficient way to store that data and execute that query

44:43

for you. So any assumptions you made during you know the time you wrote that query as the data changes over time and the same query shows up if I wrote it in SQL or whatever high level language I want then the database system say oh my data actually looks like this before it look like that so here's actually the

44:59

right way to run the query instead of a faster way to run the query rather than before here's the best way to store the data you're actually storing uh that you want right now rather than what what I what I thought it looked like before so we have this nice independence between the physical level and the logical level

45:14

which I'll show real quick in a second allows it a bunch of freedom to the data system implementation to do the most efficient way to store things question all right so let's talk about this independent thing again so this this is in the 1970s

45:30

computer science was still in the early days so this is like this is obvious to us now but back then this was mind like in IMS that that sit from IBM they mentioned about the API in IMS you declared I want to store this this table and it's either going to be a tree data

45:46

structure or a hash table or hashmap and then depending on what those those two data structures I chose I got a different API because I can't do range stands on a hashmap and I can't do uh uh certain kind of lookups on on on the on

46:01

the tree but then if I decide oh that was a mistake I actually want to store my tree structure and in a hash structure the reverse I got to go back and rewrite all the code to reflect that change that the API has changed But in a relation data model in SQL right language relation relational data

46:18

model I don't have that problem because I don't know what the data structure is and it's saying I want to access this data at the logical level of of a relation or a table right so at the lowest level we have this thing called a database storage uh and then above that we would have the

46:34

system keep track of like the physical schema like I have a bunch of files a bunch of pages I have a bunch of extents I declare on on this this storage medium. I'm keeping track of where all the information is and where the bits are stored and how they're being laid out.

And then above that, I have the logical scheas that define what my

46:52

tables are, what my collections are, what attributes they have, their names are, what their data types are, right? And I have a nice separation between the two of them.

So if I change one versus change the other, uh, I can change one without necessarily changing the other. I can change the physical layer without

47:07

changing the logical layer. I go even crazier.

I can have another level called the external schema. This is a way to expose or extract way further what the actual logical schema would be.

So that maybe so that I can do certain things or have uh certain data

47:24

exposed a certain way without actually changing the logical level or the physical level. So say I have a uh a table of everyone's uh every student in this class.

Maybe there's a column for like your password or something like that. But I don't want to expose that password column to every single

47:40

application. They want to access that table.

So I can declare a view to say this is what you're about to see. You hide certain things.

It's just another level abstraction. And above that is what the application is going to see.

You would access it through a high level

47:55

declarative language something like SQL. So you think of this this this point here in our graph that's the physical data independence because I can declare my my schema that here's the columns that I want but I'm not defining how I actually want to store them and the data

48:12

is free to change this this lower level all at once and it doesn't break your application and then logical data will be up here because so we won't cover this too much in this semester but it's basically a way to say yes I have this schema but for certain applications they see certain things and certain applications

48:28

should see other things and I can change all things without uh without having to rewrite any problem. All right.

So I've been going on the last 10 minutes about how great data model is what it actually what it actually is. So at its core, it's going

48:45

to represent everything as relation. Just think of relations as as like a set an unordered set where we keep track of the relationship not between the different collections.

There's actually the relationship between the attributes within a single single collection. Right?

So you have a student, a student

49:01

has a name, student has an email address. The relation represents those relationship between those those attributes for that high level entity as a student.

Um in the previous slide what what part of of this uh hierarchy is a database uh

49:19

database management system question is what in this slide here what part of the system handling everything before the application everything right we'll go throughout the semester but like down here maybe say like is

49:34

like the operating system opposite is terrible we always get in our way we hate it is that uh we need it to to survive not always but we need it like to do certain things for us but we almost as a data system we always always want to do

49:49

things without the operating system so you can think of like maybe the OS is kind of in here but it's always going to be a big problem for us we try to get around it we'll cover that as well but this everything below the application down is a data system

50:05

uh if I sometimes you can put the whole application to the ACC2. Uh not always a good idea.

Those are going to write write code that way. But you could do that.

It could be the whole thing. For this message, we'll say it's the application.

50:24

All right. All right.

So relation is going to be unordered set of data. Uh and then we'll say is a tuple.

Sometimes people call them tpples because this was around a couple. I say tuple.

um a table is going to be the set of the attributes within a

50:40

sole entry within the the the within a relation right and then for every act you could have it would be something called domain that specifies what the range or allowed types allowed value instances allow to happen so if I have an integer 32 integer the domain

50:58

would specify like 0 to 2 to 32 minus one or if it's unsigned right if it's a name it string field and so forth, right? It's basically defined with this.

We'll have a special value called null that's going to represent uh remember

51:15

every domain if it's allowed you can't have null but it's a way to represent that the data is unknown right and it be very similar to nulls in C code but it's at the logical level where

51:31

there is a value we're storing for null but we can't reason about anything about it so we can't ask questions about what that value actually is if it's null because the answer is unknown Does it make more sense next class? We'll see it.

But the main thing is like you think of like a relation as a tuple

51:47

or table with n columns. And most of the time people refer to them as relations.

People refer to them as tables and columns and rows. But when we talk about different storage models, what a column is and a row is doesn't always line up exactly like this.

But

52:03

for now, this is fine. And so in relational model we have important primary keys is a way to us to uniquely identify a single tuple in a relation.

I think it's a unique value that uniquely identifies that that a

52:18

single record and you can define unique values in other columns but it's not always considered like the the identifying that is going to be the primary. So in this case here for our artist table we have Wuang clan we have notoriiz but

52:33

certainly there's bands that come along that have the same name as the previous bands right sometimes we try to be tricky and you know remove vowels and make sure the names don't match but often times it doesn't exist so in this case here using the name would not be a good primary key because two artists didn't have the same name right so often

52:52

times you see uh or databases introduce synthetic primary keys or identifying columns Sometimes they're called uh they're called identity columns. Uh things like it's a unique number assigned to uh you know to to some record.

often times when

53:08

you see your websites look at the URLs it'll say when you see like a number represented with an order number almost always gonna be something like this it's a counter every single

53:24

identification you prefer natural identifier uh we didn't find what a natural key is natural would be like >> for example the name >> yeah the name would be like a natural key but in this it it would be a bad example here. Uh at CMU a natural peak

53:41

would be that your email address unique. Um sometimes there's programing frameworks like OMS they'll automatically generate you these identity columns.

Um I always just use that right I think it's fine it's fine

53:59

right so in this case the primary key now is the ID column because we're going to say every time we insert a new record just think there's some counter call sequence we add one to it every single time we add a new record so it's guaranteed we can also have foreign keys and this

54:15

is how we're going to allow us to identify the relationship between different pupils within different collections or relations these cells So going back here for artist and um an artist and artist album artist and album right so now I've added the the ID

54:30

column that's uniquely identifies each each record as it primary key but now in the case of the album I want to keep track of what artist appears on it and for certain albums this is fine if there's only one artist right I could have a mapping between I basically store

54:46

the the artist ID within the album itself and I know how to map that back But as I said, the challenge is going to be is when I have albums that have multiple multiple artists on them like in this case here the same. So to represent and model this, I can introduce new sort of cross reference or

55:02

uh dependency tables that allow us to basically store the relationships between records or tupils across different collections or different relations. So in this case here I would have a uh I would have a separate table now keep track of the artist ID and album ID and I would I have a record in

55:19

this table any time an artist appears on a given album. So this allows me now the freedom to represent to have multiple uh multiple artists appear on multiple albums um and not assume that there's a one to one correspondence.

Basically I can do

55:34

many many uh representations through and now I don't need to store the artists ID in the album itself. I just represent that as separate.

All right I'm going to skip constraints for the sake of time. I apologize.

55:51

Basically, it's a way to specify what values are allowed in uh additional constraints beyond the data type. So, I can say things like no name can be null.

I put a not null there. I have other constraints that say like no artist can be born or can go solo after before 1950

56:08

or 1900. It's a bad example.

We can imagine other things like no email address at CMU can can not contain andrew.cu.edu something like that. You can declare this when you find the table.

You also have global assertions missing every single time you insert a record. It runs

56:26

a SQL query here and here to check this out val because it's slow. Uh but it's a way to model more complex constraints and data restrictions and the basic one

56:41

for >> question. What is the foreign key?

Sorry, thank you for stopping. So foreign key is going to be the uh the foreign key would in this example would be over here.

So you have a relation that's defining values but you're saying that the for a given column or attribute

56:58

it really is a value that has to exist in some other relation here. So you would say I would say I'm declaring this relation called artist album.

I have an album ID and artist ID. for any album ID in this column here, that album ID has to exist in in this column or this

57:16

table. And that way I avoid the problem like if I delete a record here, then I don't want to have a dangling pointer where I have now an album that has an artist.

I have an artist on album doesn't exist anymore. The data sync would know, okay, you're trying to delete an artist, but I know that I have

57:32

a foreign key reference to the artist ID here, and this has a foreign key reference to an album over here. So that prevents me from deleting one and not the other.

>> Yes. >> The question is what is the primary key for the uh for this table here relation

57:49

here. So anytime I have the underline that's the that's the primary key.

So it's actually the combination of the two columns together is the primary key. >> Can you do it in SQL?

Absolutely. Yeah.

Why not? >> Question is can I can I have can I

58:05

declare a primary key with multiple columns? Absolutely.

Yeah, >> why not? >> Yeah, >> we can see that next class.

>> Okay, so what I'm talking about so far again is

58:21

the high level uh concepts of relational models, how to define a relation or a table, right? Again, the ter the terms used interchangeably like in SQL you say create table, right?

you refer to things as tables, but in the in the sort of

58:36

research literature or in the original definition of the paper of the relational data model, they weren't called tables, they were called relations. They're basically the same thing.

So we now we have to find what our relations look like. Now we want to actually run queries on that.

So this is called the DML, the data manipulation

58:53

language. Even though also too we're not going to be uh we have read only queries like select statements, they still fall under the category of of the DML.

manipulating the data to perform it somehow. So there's basically two ways to do this.

There is a procedural uh language

59:12

and a nonprocedural language or declarative language. So what I'm going to teach you today will be relational algebra which should be a procedural language where I'm defining the exact steps I want the data system to use to execute my query and I'm actually defining the order in which

59:28

I'm going to execute them which actually goes against what I said before where you don't want to be doing that you want to have things be high levels decide the best way to execute that and this is what so tech cod defined model with algorith at the very

59:43

beginning and then the high level languages that allow you to do the have that declarative interface that came later on the the declarative way using what's called relational calculus right SQL basically maps like this so we'll cover

59:59

that next class you don't really need to know relational calculus unless you put into a query optimizer which is not this class for the most part uh we used to teach relational calculus in this class but you go out in the real world nobody uses it so you don't really need to know it but you already need to know relational algorithm that's going to be the the fundamental building blocks of

00:16

how power and should be pretty straightforward to understand right so relational algebra the original definition from pod seven operators

00:34

listed here we go through each one one by one uh we'll say at the end this has been extended over here for things you actually need in in different runfield queries There's no notion of sorting in these operators. There's no notion of uh doing aggregations like think of like the

00:49

average of a right that all came later as extensions but the original ones these seven here and you can think of like you can build more complex queries from just these seven operators by chaining them together where you allow the output of one operator be set as the input to the

01:06

next operator and you can use that to build uh larger more complex next right we'll go through each one by and I'll show you what the corresponding SQL looks like. And then that'll segue us into next class.

All right. The most

01:22

basic operator you can have is called select statement. Uh and this is just like you're you're filtering out the tuples you don't want based on some predicate.

You're defining the the first order predicate logic to say these are the tuples that are allowed to be emitted as the output for this select operator. So say we have a simple

01:38

relation here R. It has two attributes a ID and B ID.

So I can have a select operator that says give me all the records from the material from R where a idals a2 right and then there's some the data is

01:55

somehow filtering out that data according to that predicate producing producing the output I can do more complicated things I can start combining together through junctions and disjunctions I can have multiple predicates be evaluated so in this case here I'm saying all the the r

02:11

a A ID equals 2 and B ID is greater than 102 and that produces the output again in SQL it's basically equivalent to the wear right and it's just it's just fully in logic to find what predicates or sorry

02:27

what tool match according to the values of whatever is in my expressions one more time so that's pretty easy to understand right I can also have projection projection allows me to manipulate what

02:45

the output of those attributes have select them on uh in such a way that that I need them the way that I want them produced as part of my output. So I can rearrange the ordering of the attributes, right?

I can I put one column first before another one. I can remove columns I don't I don't care

03:01

about as part of my output. Remember I said filtering out things like the the the password of a table.

I don't want somebody else to be able to see that. I can do that through low projection and I also manipulate the the values of the attributes produce to derive new

03:17

values according to whatever it is that that I want for my application. So you define something like this.

This here I have my uh select operator where I'm printing out all the uh all the records with a a equals 2 and then I find my projection that says take the b

03:34

attribute for any that matches that to my select that's being fed as the input and take the b attribute subtract 100 from it and then produce the the also the the attribute a id. So again in a select statement

03:50

projection is just this front part here the after the select you're defining the projection list defining the attributes and what comes as the output. >> Yes.

>> So the previous slide was called select like the title of the previous slide was select. >> Yes.

Really we were talking about the where keyword previous is the word

04:07

select and relational say it is um is the select operator English operator is that different than the select operator in in SQL think of like a select as uh the select

04:25

algorith it's a way for you to yeah to you Think of these relation algebra as the building blocks to build a larger language like SQL. So you need a wear clause and so the relation operator and the select

04:42

operator will give you that wear clause. You need projection out list project operator will give you that and then the from clause will be joins as a follow that operation instead of

04:58

that operation. question is uh instead of using pi if I say sigma for select with the reordering of the attributes >> the same comm that work with a select operator

05:14

>> if I replace no uh question is if I if I instead of using a projection if I do a select operator with the same expression is that still valid Well, in this case, no, because I

05:31

have a comma separated list where I can't evaluate whether a a tuple evaluates to true for for that expression list. If I only had B I less minus 100, what does it mean to evaluate that from boolean logic?

B B B B B B B B

05:46

B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B The idus 100 it equals whatever that value is. But is that true?

Technically, yes, because it's not zero boolean logic. So that that would always value true.

But the semantics are not the same. You just can't swap them out.

06:02

>> Yes. Yes.

06:29

>> Give me like for more slides and then we'll come to that. The question is basically I made a big deal about how you want to compare a language like SQL.

Now I'm showing you like you know the defining the order of the steps based on relation algebra. relation alpha will be the building blocks for how we then build a recover language alpha.

Another

06:46

way to think about this is like when I build the execution engine for for database the thing that actually runs the queries it's going to look a lot like relational alpha because I'm going to find a project operator. >> Yes.

07:05

Uh the statement is so are these all these operators commutive? Not always but sometimes yes.

>> Uh actually you know relational they're always uh

07:20

if you start projecting things that you haven't produced output yet that they're not communive but in general yes they are communitive. Um the point I'm trying to make is like when I build the execution engine to run a SQL query it's going to the operators the actual implementation of the execution engine it's going to local

07:38

but you we'll see decision in a second but to to the point maybe point out is like if I took the relation algorith as my input and actually started executing it I would execute exactly the way that's defined algorith which is not what we want when and have all the freedom we

07:54

want and that's what SQL does not SQL is not going to have that problem. >> Yes.

>> So this SQL that's on the screen is that is that a selector? >> Question is what is this?

What is this select statement here? >> Uh it's a high level sorry it is a

08:10

select statement in SQL that contains a projection called projection call and you can think of like a bus 100 is equivalent to the pi operator with the this expression above that. So that's not being

08:27

>> uh question the like the output here >> does it correct any output of a single statement it's going to be a table or relation any output of a relational operator integration operator it's going to be another relation so you can daisy chain them together and produce take the

08:42

output of one produce that input from another okay uh union for you know basic accessory your teach as a basic mathematics prerequis

08:58

you uh you can union together two relations and concatenate all that it uh it removes two things. So for this in union operator intersection uh and difference you have to have the same attributes in in the two relations

09:13

because otherwise it doesn't match up because the output needs to be have the output relation of the operator needs to have the exact number of the same relation same attributes as as its inputs. You can do this in SQL.

Basically

09:28

something like this, right? I'm going through this fast.

I apologize. Intersection is basically the same thing, right?

They have to have both input relations have to have same attributes. Same for difference.

Again, this is all basic set theory. This should not be anything new.

You can do

09:44

the same thing in SQL. Reason I'm rushing is I want to get to joins and and product.

So up until now, what have you done? We've taken one relation as our input, manipulated and taken as the output and send it along to a

10:01

the next time I look up. And then for union and intersection and difference, I'm combining together multiple relations, but I'm not really leveraging any of the uh relation between the values within those those relations.

I'm

10:17

not leveraging the foreign key stuff before. Um kind of running out of battery.

There we go. So with joins with the cartisian product the idea is that we can now combine them together and do some some lookup to say what tools should actually

10:33

satisfied uh in my when I combine these records together. So cartisian product is literally just mashing together the two uh two attributes the attributes two relations producing a new a large has all the attributes from one relation all

10:48

the attribute the other relation right you can do this in SQL you pass the cross join uh or you don't specify cross join you get that by default right but this is not actually what we want because we want to say things like for a given album give me all the artists that are on that album you don't want to get

11:04

all combinations of artists and albums that's kind It's a bunch of noisy ideas that I don't care about. So this is where the join comes in.

And you can kind of think of join as like a cartisian product. You have a filter to say I have a or have a select predicate say these are the the

11:20

tuples that should match according to some value of one relation and the value in another relation. So I can do things like this.

Um I don't specify what the join calls it. It's called a natural join.

And in this case here it's going to take all the attributes that that are in one relation. all the other relation

11:36

that have the same name, the same type and I look for matches and I lose my output, right? And I'm gonna include all the I'm going to include all the columns that are two relations as it were the product just I remove all any of the

11:52

duplicates here. So I only get the I only get the the address that are unique from one column from one relation and the other relation.

So you can declare this single through a natural join. I don't recommend doing that because it's implicitly doing a look up between the different relations

12:07

to see what where the matching attributes are. You said you want to use a join clause where you specify either what attributes you want to join on or actually what I would recommend is explicitly saying what the join predicate should be.

So it it knows how to do that match

12:24

>> in this case question is this in join? Yes, >> we can cover joins later.

All right, we're almost out of time. So, I'm going to quickly just say there's other operators that exist.

Uh the I'll finish this last few slides and

12:40

that'll segue into next class. But the as they pointed out the relation relation oper I showed you so far that's specifying the order in which I want things to apply.

Right? And so I want to say give me all the uh

12:55

join relation ration S and do a look up for uh filter on the the B equals 102. I can have wildly different performance uh of these query.

If I do the join on RNS first then do the filter versus the uh

13:13

do the filter on S first and then do the the join. Right?

Always think of extremes. something I say throughout the entire semester.

If I have like think of like a table with like five records, think of a table with a trillion records. So, does it make do I want to do a join between

13:29

RNS of of two tables with a trillion records and then filter out the ones I actually want because what if only two match and I I did a bunch of join a bunch of tables I didn't I didn't actually I didn't actually need and I'm better off maybe doing the filter first then do the

13:44

join. So this is what the high level language of of something like language what SQL is going to do for us.

I'm going to say this is the answer that I want and I'll figure out that's what actually is. So I'll stop here and say this is where we're going to go next class now we're going to say instead of

14:01

writing procedural code Python lang we're going to be writing SQL. Okay.

That's money over

14:19

[Music] the fortune. Get the fortune

14:35

maintain flow with the grain. Get the fortune maintain flow with the grain.

#01 - Relational Model & Algebra (CMU Intro to Database Systems)

Summary

Transcript