#02 - Modern SQL ✸ dbt Database Talk (CMU Intro to Database Systems)

🚀 Add to Chrome – It’s Free - YouTube Summarizer

Category: Database Management

Tags: DataDBTQueriesRelationalSQL

Entities: BigQueryData WarehouseDB2DBH CashDBTDrew BannonIBMLarry EllisonOraclePace RickRedshiftSnowflakeSQLTed Cod

Building WordCloud ...

Summary

    Class Announcements
    • No class next week in person; will be recorded and posted on YouTube.
    • Project zero and homework one are due on September 7th.
    SQL and Relational Databases
    • Relational model is the fundamental way to represent data in databases.
    • SQL, developed by IBM, became the standard query language for relational databases.
    • SQL is continually evolving, with the latest version adding features like property graph queries.
    • Despite claims of its replacement, SQL remains integral in data management.
    SQL Syntax and Features
    • SQL supports aggregations, group by, and having clauses for data analysis.
    • Nested queries and common table expressions (CTEs) allow complex data retrieval.
    • Window functions enable computations over ordered data sets.
    DBT Overview by Drew Bannon
    • DBT helps companies manage and transform data within data warehouses.
    • It allows version control of SQL code and visualizes data lineage.
    • DBT is widely used for creating derived data sets and improving data quality.
    Actionable Takeaways
    • Start project zero and homework one early to manage deadlines.
    • Understand the importance of SQL in data management and analysis.
    • Learn the basics of SQL, including nested queries and CTEs, for effective data querying.
    • Explore DBT for managing data transformation and lineage in data projects.
    • Stay updated with SQL's evolving standards and features.

    Transcript

    00:00

    [Music] I'm still ass. [Music]

    00:25

    All right, let's get started. All right, give it up for DBH Cash again.

    [Music] Awesome. Uh, all right.

    So, quick reminder, I'm going to be gone next week. I'm going to London to see your boy, uh, Pace Rick.

    Oh, my man.

    00:41

    Yes. So, I'm so no class next week uh in person.

    I'll record it while I'm remote and post on YouTube uh and I'll say hi to your your friend, whatever. Uh, so all right.

    And then project zero, that one out on Monday. That'll be due on September 7th.

    Homework one will be

    00:56

    released later today and that'll be also be due on the same date. So again, project zero is the C++ project to prove that you know C++ and you can you can debug multi-threaded code.

    Uh so please go you know start that sooner rather than later because you'll set up your dev dev environment and make sure everything's working. All right.

    So last

    01:12

    class we were a bit rushed but we talked about how the relational model was the superior data model uh for databases and it pretty much is the fundamental way you would represent any you know any major application database or data uh in in a system and then we also talked about how relational algebra will be is

    01:28

    the building blocks for us to uh manipulate and query data in a relational model database in a relational database system and so today's class is really about sort of we won't go we We're going to go back and look at relation algebra again. Now

    01:44

    we'll talk about SQL which is in a higher form of a query language than what relational algebra is. But then when we start actually talking about how to implement the system itself and and actually execute the queries uh that we'll talk about today that's when relational algebra will come back and sort of fitting okay now you see how

    01:59

    these these are the building blocks for building the system. Okay.

    All right. So so uh let's go back to the 70s um and the very beginning of of SQL.

    So as I said when Ted Cod wrote the original paper on the relational model in 1969 1970 it was just the relational model

    02:16

    and relational algebra he didn't go into anything in details about the implementation of the system that could could run a relational model and he didn't say anything about what the query language should be. Um, and so the other people at IBM picked up the paper and said, "Hey, this is actually a really

    02:31

    good idea." Even though IBM was making a ton of early money on a non relational model system called IMS. And so they decided to start building a prototype in England.

    Uh, and the guys that invented SQL first developed a query language

    02:46

    called Square. uh and it's it's considered what is is considered to be the first relational query language but nobody actually uses it because the syntax is pretty esoteric and you there actually wasn't a keyboard or any application to let you write code in it.

    You write things in this vertical

    03:02

    uh in this vertical manner. This is the classic example of getting like the average sales salary of employees in a department.

    Right? So again so so this was square.

    Nobody actually did this. uh they threw it away and then the a bunch of the people that were working in in England they moved to uh San Jose in

    03:18

    California at IBM research and they started building the so the first prototype or the second prototype IBM uh or for building a relational data system called system R it's very influential a lot of seminal work came out of this we'll see many times throughout the entire semester where I'll say like hey

    03:34

    this is a kind of cool idea and here's a bunch of new systems in the modern era that actually implement it well it turns out IBM invented a bunch of that crap in the 1970s Right. And so uh they first developed a query language called SQL spelled out s u s eql in 1972 and this was chamberlain

    03:52

    and boyce uh and originally stood for structured English query language. Uh but then when IBM later then released a commercial product based on relational model and SQL in the 1980s.

    Uh then they got sued because there was some other some other system or some other language out there was already using SQL and they

    04:09

    changed it to be SQL. Right?

    So sometimes you'll see people refer to like MySQL as MySQL. Uh but a lot of the older guys uh back in the 70s around the 80s refer to it as SQL because that was the the original name.

    So IBM released

    04:24

    uh they never released system R. They then later released other commercial systems like system 38.

    That's not really around anymore. SQLDS is is around but it's kind of in maintenance mode.

    But DB2 in 1983 was the big one that IBM put out and that's still out today. I guarantee if you run any

    04:40

    transaction online to some website, some bank you're you're running on uh it's chances are very likely some banker using uh DB2, right? There's other uh commercial systems the 1970s.

    The first one was probably Oracle which was

    04:56

    founded by Larry Ellison. If you don't know him, he's the third richest person in the world.

    He owns a Hawaiian island. Uh all paid for by databases.

    Again, this is why this class is important. All right.

    So SQL uh was originally again this proposal from IBM but again IBM was

    05:11

    the juggernaut of of computing back in the 1960s7s and 80s. So whatever IBM did that's what they said you know everyone said okay that's the the standard.

    The Oracle guys lucked out because they basically copied what IBM was doing the Berkeley guys that built ingress they

    05:26

    had their own querying language called Quell. Uh but then when IBM put out DB2 SQL was became the de facto standard.

    um it became a anti-standard in 1986 uh within the United States and then an international standard in 1987 and then they've changed it to be the structured

    05:42

    query language for SQL. So even though SQL is old, I just told you SQL was invented in 197273, uh it's not a dead dead language, right?

    It's actively being used everywhere. I keep telling you while you know throughout the entire semester knowing

    05:57

    SQL and writing SQL is super important and they put out new versions of it all the time. The latest one came out in March 2023 uh where they added a bunch of new features like property graph queries uh and multi-dimensional arrays.

    You ever heard of systems like Neo4j, Dgraph, there's a bunch of other these

    06:13

    graph databases. Now you can do all that SQL.

    So the great thing about SQL is like it evolves over time and as applications change uh and what people want to do with data changes. SQL uh gets added.

    They add new features to SQL

    06:28

    to to adopt to the times. And the challenge is going to be, and we'll see this throughout this lecture here, is that the SQL standard is going to say one thing, but nobody's actually going to implement it exactly.

    Every single data system is going to have their own quirks, their own little extra stuff that sprinkle in. Sometimes they're

    06:43

    ahead of the standard and they try to get whatever their version of a feature is into the standard. And sometimes they the standard comes out and then they add it later on, but they still take their own uh liberties, if you will, with it.

    And we'll see what some of the biggest offenders are are today in our demos. So

    06:59

    if you have a data system today and you say you're going to support SQL the bare minimum you need to have say I I have a SQL database is what was defined in the SQL 92 standard. So this is your select insert update deletes uh your create tables create indexes the the transactions the very basically you need

    07:15

    to say you know SQL and I have SQL is that you follow the SQL 92 standard but today's class we'll go a little bit talk about the newer stuff. Um and so people come along every 10 years and say SQL is a terrible language.

    It's not great. Certainly has its uh rough spots.

    Uh but

    07:32

    every attempt to try to replace it in the last 50 60 years has failed. So right now we're in the phase where all the AI guys are saying, "Oh yeah, SQL's going to die.

    You use LMS for everything." Yada yada yada. Um so this is the guy working on a system called Vespa.

    He's basically saying natural language and replace SQL. Um, this guy's

    07:50

    like a crypto bro and he's trying to sell some uh uh some video demo or some video uh like code academy thing, right? Where you can use chat dbt to replace everything with SQL.

    Um, so I'm not here to say that SQL is not going to be uh

    08:05

    replaced in my lifetime. It it may be, but it's very unlikely.

    It was here before you're born. It'll be here after we die.

    Um, so it's not going to go away. But the would you be the when you sit down are you going to immediately write SQL in many cases oftentimes?

    No. Right.

    Yes.

    08:26

    The question is what is what would replace SQL? I don't know.

    So we actually we uh we did a seminar series last semester. We had all the people that are trying to replace SQL talk about how that would replace SQL.

    The closest one that I think could not replace SQL but extend

    08:42

    it is actually Google's PIP SQL syntax. And I can give a little demo of that in a second.

    What's that? You read Google?

    Yeah. So So that thing's Oh, you wrote it.

    Okay. Yes.

    So there you go. So that's probably the closest thing, but it's still fundamentally it's it's a an enhancement to SQL.

    09:01

    What's that? Question.

    What keeps it going? Again, going back to this, it evolves, right?

    Like here's all the different versions like say 15 years ago what was the hot database MongoDB because of web 2.0 0 whatever the buzz word was at the time you were storing everything as JSON well

    09:17

    the SQL standard now add support for JSON 30 years ago 20 years ago XML was the hot thing well SQL sports XML the statement is is the flexibility of the SQL SQL will provide you a basic building

    09:33

    block that you can then add more advanced features and do additional things right um now again I mentioned like are they doing things is it is it perfect no one of the biggest complaints is that the select clause clause becomes comes before the from clause, right? Well, in the original proposal from IBM

    09:49

    in the 1970s, the from clause came first then the select. But then they did a user study at the University of Arizon State University and all the students whatever they tested said, "Oh, we like the we like the select before the from and that's why that's the way it's been for 50 60 years because of those people,

    10:05

    right? We'll we'll see some data systems we we can play up we can play around with u and see how you can change things and sometimes you don't always have to have in that order but in general you do.

    All right. So again I like I've said many many public statements before that like

    10:22

    uh SQL is is is the most important uh programming language you need need to learn as a student. Uh because again no matter whether you're actually building a database system or just using one the rest of your life you're going to come across SQL.

    And maybe the LLMs can provide you a first draft of of a query

    10:37

    language. Maybe it'll end up being like assembly where there's the high level things you can write and then that gets transpiled down to SQL which is what a lot of replacements of SQL are trying to do.

    Um but everybody has to learn SQL. It's very very important.

    And so to give you how uh an example how serious I am

    10:54

    about learning SQL uh the audio up here. Oh, there's no audio.

    That sucks. Um, well, this is my 5-year-old biological daughter.

    Uh, and every morning we wake up at 7:00 a.m. and she has to write sequel uh before she can have breakfast.

    11:10

    Um, is that considered child abuse? Um, going to love me more since I wrote.

    I tell her mom loves her more when she writes SQL and it works. Okay.

    Um,

    11:26

    all right. So, what what does the relational language give it?

    What is SQL going to give us? There'll be sort of three basic components.

    The DML, DDL, and DCL. DML is the data manipulation language.

    When you write a select statement or update, insert, update, deletes. Those are all going to be DMLs.

    The DDL will be the definition of like a

    11:42

    table, an index or a view. Uh, and the DCL will be like a like controlling who's allowed to see what data or what counts and so forth.

    There's other things we'll cover later in the semester like transactions, like how to call begin transaction and roll back and things and things like that, but we we'll we'll we'll worry about that

    11:57

    later. One important thing we we got to understand also too is that although the relational model is based on sets, right?

    We looked at unions and and intersections that that was basic set theory. SQL is going to bas be based on bags.

    So sets are unordered collections

    12:13

    of data where you don't have duplicates. Bags are unordered collections of data where you do have duplicates.

    Right? Lists are uh lists would be a order collection that has has a defined order.

    Bags and sets do not. So that means that

    12:29

    if if the in SQL they're going to make this choice to allow duplicates because it requires extra work to remove the duplicates and so if you really want to remove them you have to add extra maybe you know keywords like distinct and other things to remove the things that you don't want. So by default we'll have

    12:45

    duplicates uh and most of the times it's it's not a problem but if you do care about get remove duplicates there's SQL allow you to do this. All right.

    So, we're going to do a quick crash course on on the sort of the basics of SQL, but I want to cover the things that are that are that are on the more modern in the

    13:02

    um uh in the in the spec and we'll look at some examples of what how real systems actually run these things. And then uh for today, we actually have one of our first flash talks for uh for the semester.

    So, the the guys from DBT are going to come give a talk about their system. Who here has ever heard of DBT

    13:17

    or used DBT before? One.

    Okay. It is one of the most widely used database tools in in the world today.

    Uh it's basically if you're a data scientist or doing any major data analysis, chances are they're running on dbt, right? It's it's pretty

    13:33

    significant. And guess what?

    It's just just based on SQL. SQL with templates, right?

    So they'll show you uh they'll talk about what what they're doing over there and then they'll come again uh come give a pitch for recruiting uh later in the semester. like one of our best students from 445 last year.

    She's

    13:50

    there right now. All right.

    So today's class, the example database we're going to use uh will be just a really simple representation of a university, right? There's a student table, there's an uh course table and then student takes students take those courses and and they get grades, right?

    So we'll just use

    14:06

    this as a running example throughout uh throughout today's class. All right.

    So the very first thing we can want to do is aggregations, right? I'm assuming everyone here has seen SQL before.

    Of course, you know how to do basic selects, inserts, update, deletes, right? We don't need to cover those things.

    But now we want to start

    14:22

    extrapolating new data or new information from the data we've already collected and put it into our database. So aggregations are the basic form of how to do this.

    The idea is that you're going to take multiple tupils or multiple values across multiple tupils and we can uh coales them into a single

    14:39

    scalar value that computes some aggregate function like an average minax uh the count. Um so other uh other systems support things like standard deviations or geometric means the basic idea is you're collapsing down multiple values and producing a single output.

    14:55

    Right? So the aggregation function can pretty much be used uh only in the output list or projection list of a SQL query.

    Not always the case because of nested queries, but for now we we can ignore that. And so you can do really basic things like you can say uh count the

    15:10

    number of students that have an at CS login, right? And then in our in our projection output list here, we see we have a a count function that's computing just again every for every student that matches that predicate.

    I think of then the the the filter from from last class.

    15:26

    My projection output is is going to maintain a running tally of the number of two students that it counts. But notice I have count with the login field in there.

    Does that make any sense? Does that sound right?

    Well, what does it mean to be counting login?

    15:42

    Well, it turns out for some functions like count that doesn't actually matter what you put in there. I can do count star produces the same answer.

    I can do count one produces the same answer. You can put count one plus+ one like whatever.

    Okay, because in the end of the day all it's really doing is just counting the number of tools that get

    15:57

    matched here. Right?

    So that that's a basic aggregation. If we try to do things like uh have multiple attributes in our projection list in addition to an aggregation, we can have some problems.

    Right? So in this query here, we're

    16:13

    trying to get the average GPA of a students enrolled in each course. And obviously now we want to include the course ID, right?

    Right? Because we want to know what what course the average is being computed for, right?

    But is this correct? Shaking his head.

    No. Why?

    16:29

    Well, he says it's group by that that'll be the answer. How to fix it?

    But like what does it mean to compute the average GPA of all the tupils, but then spit out some kind of course ID, right? What's that?

    16:44

    Statement is there isn't a canonical one that should be part of the output. Correct.

    Right? So, this is invalid.

    Most database systems will throw an error uh and say you can't do this. SQL light will let you do this.

    My SQL used to let you do it. Doesn't let you do it anymore.

    Um and so the way to get around

    17:01

    this would be you just put uh you can put it any value in there. This is something new that SQL added and it produces some output.

    Is this correct? I don't know.

    But like from the database systems perspective, you told it what you wanted and it computed it for you,

    17:16

    right? Uh so he brought up the way to solve this is do group eyes.

    So group eye is a way to uh partition or cluster the data as we're scanning along so that when we

    17:33

    comput our aggregation the aggregation will be will be derived from each group each cluster. So again the query is I want to get the average GPA for each course.

    So I I add my group by clause with the course ID and then say this this is my original table. Now when I uh

    17:49

    scan through I can I can look at the course ID and basically keep track of you know where the boundaries of these are. And then when I compute my aggregation I know to compute the average per per group.

    I didn't say how implemented this right.

    18:04

    The data system is doing this for us. You could either build a hash table.

    You could actually sort things first, then do the group by. We'll cover how to do that later in the semester.

    But at this point in in the class, just trying to say like, hey, look, I I can I can group things up and the JS will figure out how to how to make sure I produce the right

    18:20

    calculation. Right?

    You can go more sophisticated with group buys with what are called grouping sets. There's also thing called rollups and and cubes.

    We we don't have to worry about that. But say you want to do something like I want to compute

    18:36

    multiple aggregations uh on my data but I want to be able to do this in a single query. Now I can do the union operator that we saw last class.

    I can add that to SQL and do a bunch of separate queries and mash things all together. Um but if I can tell the data system that like this is

    18:52

    exactly what I want uh and kind of put everything together in a single package or a single query then although we said it's just be declarative and we don't have to tell the data system what actually we how to actually execute the query we can be nice to it and actually produce queries for it that it has a

    19:09

    better time reasoning about and crunching on. So you have group eye and then you specify that you have a grouping set and then now within my grouping set clause I can specify the separate group eyes that I want and each of these will be executed as if they were a separate query but in really

    19:25

    what's happening is it's scanning through the data once and computing this answer for you and sort of mashing everything together all at the end. Right?

    So the first group I say group everything by the course and the grade then group everything by just the the course name and then get give me the overall number uh for the total. So this

    19:41

    is trying to count the number of students in in each course by each course and by each grade uh by each course and by total across all students. What about what?

    Sorry. Okay.

    Yeah. Okay.

    All right. So again,

    19:58

    just think of like and these are like groups or clusters. So here's the red one, right?

    There's the blue one. Here the bottom and then the the overall total is at the top.

    And then for the cases where the the output doesn't have the the column that's being grouped on,

    20:14

    they'll put a null there. It's kind of funky, but again, the one reoccurring theme we'll see with queries is that we want to reduce the amount of back and forth between we have this the database server and the application.

    So we can give as much information all in one one single query, give it to the database system. It can ideally do a

    20:31

    better job at running this more efficiently than it would from multiple round round trips. Is this clear?

    All right. Let's say I want to do some additional filtering after I've done my aggregation with the group by.

    Well, SQL

    20:46

    supports that with the having clause, right? And the basic idea is that I want to be able to filter things out based on whatever is being computed as as the as part of the aggregation, right?

    So in this case here on my wear clause I want to get filter out all the the the the students are sort all all the courses uh

    21:04

    where the average GPA is greater than uh or less than 3.9. So I want to get the ones that are 3.9.

    But what am I trying to do here? I'm trying to reference average GPA but that's being produced as the output in my projection list for my select statement.

    So at this point again think of like the data is actually

    21:20

    executing this. You're scanning through the data.

    You're trying to apply some filter. you don't have this average computed yet.

    So, how do you know whether to throw away a record or not? So, you can't do this.

    Instead, you want to do this through a having clause. But now, I can put what my filter should be

    21:37

    at the bottom. Now, in some database systems, they're really strict about what's down here in the in the having clause.

    Right? Again, I have an alias for the average GPA.

    I'm calling average GPA. AVG GPA.

    In cell systems, you can't do that either because it doesn't know about what the

    21:53

    alias up above is at its point as it's scanning. And I sort of said SQL is not supposed to be about thinking about how things are actually going to be executed in like what what order of the operations, but in some cases it it actually is that case or some cases it it is that way.

    So some systems won't

    22:10

    let you do this and instead you have to repeat the actual average the aggregation you want at the bottom here. I post requires this.

    Yes. So the from where clause is.

    The statement is uh the from where

    22:26

    clause is effectively a join. Uh oh, sorry.

    Yeah, sorry. Um, yeah.

    So, right here I have from enrolled as E, comma, student as s. Yes.

    So,

    22:41

    no, let's not talk about teacher product. So in this in this syntax here because I didn't explicitly say join.

    Uh this is like the old way of writing it without the explicit join. Don't do what I did here.

    You should always include join. Uh but you the database will be

    22:58

    smart enough to recognize that I have my eid E. SID equals S.

    SID. So it would know okay well this is from this table here.

    This is from that table there. I got a join.

    And it's an interjoin. It'll it'll it'll do that before the cartisian product ideally.

    Yes.

    23:14

    [Music] All right. So this question is I have the average aggregation in both the having and the select output.

    Is the system going to be stupid enough to run

    23:30

    both of them? Hopefully no.

    Uh but there's no guarantee but ideally yes. Like most systems will be smart enough know not to do that.

    Just for whatever reason some systems don't let you reference the alias things up above

    23:47

    other questions. Okay.

    Right. And it produces the output like this.

    All right. So that's the basic for aggregations.

    Um and again with some slight nuances about what the having clause looks like and

    24:03

    what the grouping sets may look like. Uh in general most data systems will support all the things I just showed you here today.

    Right. Next we'll talk about different things you can do starting different data types.

    We'll start with strings that should be mostly the same across different data systems. Then we'll come into dates and times and that's when it

    24:19

    all falls apart, right? And the although the first homework for you guys will be based on ductb that's really because it's a great system and it runs on every single laptop or every single machine you have.

    It still runs on the the Android machines. Uh even though they're running an old version of yubuntu here,

    24:35

    right? Uh there isn't another other than the SQL light there isn't another system we could give you that could run everywhere.

    Um the but I'm not trying to say like okay you're learning duct DB SQL syntax. It's really I care about do you understand what you can possibly do with SQL at a

    24:51

    high level because when you go out in the real world or use different data systems again all these nuances that we'll talk about here they're going to be every system going be have their own little quirks and be slightly different. So if you understand the fundamentals of SQL then you can take that and go understand how to uh you know uh apply

    25:06

    that knowledge to to different data systems. I would say it is uh easier to take a SQL dialect um from one system and and apply it to another.

    Now it's not going to be a one to one like it's not going to be I'm not trying to say you take SQL and just like do search and

    25:23

    replace and magically runs on the next system. There's a lot of people who spend a lot of money trying to convert from from one system to another.

    I'm just trying to say that like it would not take you a a a lot of time to get up to speed pretty quickly with a new system. Like going from Python to Rust,

    25:38

    that's a major lift. Going from my SQL to Postgress, again, mentally less so.

    Okay. All right.

    So, there's string data types. Of course, uh the SQL standard says that any string you store in a database will be case sensitive.

    uh and

    25:55

    you can represent them or you you the constants of them as single quoted strings. Most systems follow that.

    My SQL is going to be the the the the dark horse here. Um where they're going to be case insensitive for whatever reason.

    Uh and they're going to support both single

    26:11

    and and and double. So let's see what what this what happens when you do this.

    Can we all read that or should I make the screen uh is that better?

    26:28

    It's too dark. All right.

    Wow. All right.

    Let me log in with my laptop here. Um because it's a pain to type on the surface here.

    All right. So, I said that strings were were uppercase.

    Sorry. In

    26:45

    in the SQL standard strings are case sensitive and uh you represent them with uh with single quotes. So this is Postgress.

    So I can do select star until my lost connection. Awesome.

    There

    27:00

    we go. So look up the student table where uh name equals RZA,

    27:21

    right? Nothing shows up.

    Make sure I have capital. See, so that doesn't work because it has to be a capital A, right?

    Why is this going so slow? Sorry.

    27:37

    You know it is. I think I'm up uploading your your videos from last time.

    Let me get rid of that. Sorry.

    Much better. All right.

    So, if we now try this in my SQL,

    27:55

    right, that works. But I can also do this, right?

    That seems that seems kind of weird, right? I can also use double quotes

    28:14

    and that works. If I go try this the exact same query on other systems, it should not work.

    So this is Postgress. Sorry, doesn't work.

    Okay, I do that.

    28:33

    Right? In Postgress, the double quotes is is meant to represent a uh an escape like name of a column or table.

    But if I have a table with a space in it, I can use double quotes to get to take care of that. Over on my SQ or sorry, SQL light now doesn't like it is case sensitive,

    28:53

    but it lets me do the the the double quotes in duct DB. um it's case sensitive here it's complaining it's because it's looking for the name of the column so I

    29:08

    have to escape it with single quotes and then it is case sensitive we'll come back to click house and firebolt and and SQL server in a second but again just trying to show you that like a basic thing like hey strings they're they're going to be different

    29:26

    so uh there's basic string oper operations. There's uh like if you want to do wild card matches on the string, SQL standard specifies a uh the like clause and then it doesn't follow maybe the general uh more common you know

    29:42

    regular expression syntax you might have seen before with like stars and dots and so forth. Uh the percent sign is meant meant me meant to represent any substring including empty strings and then if you want to put a match a single character you do underscore.

    So this is trying to get all of the students uh

    29:58

    that are enrolled in 15 followed by a wild card all the 15 courses and you you would use the like clause for that. If you're not if this is case sensitive you want to be case insensitive you would use I like.

    Yes.

    30:14

    His statement is are the field names and table names always case sensitive in SQL depends on implementation. Postgress will automatically lowercase any table name you give it.

    Other systems don't do that. I have no idea what what the SQL

    30:29

    standard sets right. What's that?

    David is if you use uh in post if you use double quotes to name a column does that um will that give you escaped? I

    30:45

    don't know. Let's find out.

    Create table. Sorry.

    create table you want to use xxx like that

    31:00

    with int field um yeah all right so postgress took it uh and and maintain it so what if I can I do this can I do select star from uh Wi-Fi

    31:17

    sorry select star from xxx. Yeah, I didn't know about that.

    Yeah. So his point by default unless unless you quote it in postgress you you

    31:33

    will not get uh you you will not get the uh case sensitivity. All right.

    Um right so again if you want regular expressions the SQL standard specifies some this clause called operator called

    31:50

    similar to post supports this uh duct DB supports this many other systems do not instead they have their own regular expression syntax um again a good example where SQL SQL standard says one thing systems implement other things

    32:05

    there's a bunch of built-in uh string functions you can do to manipulate strings that's one of effect like substring, lowerase, uppercase and so forth, right? Uh and most systems have the same implementation of these these functions roughly provide the same same

    32:20

    semantics. Um to something really basic like concatenating string.

    This is where things go wrong as well, right? The SQL standard says specifies that you have a double uh double bar to say I want to concatenate two things.

    Uh in SQL

    32:36

    server, it's the plus operator. Um, and then in my SQL, I should test this get more recently in the newer versions, but they don't have either either one.

    You have to run a explicit concat function, right? Again, this is uh this is painful.

    It's

    32:54

    annoying, but it's not it's not the end of the world, right? You you can easily navigate this from going from one system to the next.

    But let's talk about dates and times, right? Obviously, we in our database.

    Yes. Question.

    33:18

    Yes. So question is when I say most systems don't don't do support it or don't support it do not do not support its uh so the SQL 92 would give you like the basic select select insert update

    33:35

    deletes the things we're talking about today and like they're so far everything we're talking today is back in the SQL 92 standard and so they're they they they're off slightly on this one thing there's no system in the world that I'm aware of that implements exactly the SQL 92 standard

    33:52

    as as as written. They're always going to be slightly off.

    My SQL used to be the worst offender. They've gotten better in in the last 10 years.

    So

    34:09

    that standard his statement is um when I say the SQL 23 standard came out who actually supports that again nobody. So property graph queries the only system I know

    34:24

    that supports that is Oracle because Oracle was pushing to get that in there. I'm pretty sure Postgress doesn't support it.

    DuctTb doesn't support it right. uh the other commercial vendors, same thing.

    It's not to say they won't, just they haven't got to it yet. But it's like, will they implement it exactly as written?

    Probably not because

    34:39

    for whatever reason, right? Again, my SQL used had all sorts of weird stuff.

    And then we had the guy that invented my SQL gave a talk with us um was it was last semester and I asked him at the beginning of the talk, it's not recorded, like hey, what's up? Why do you do all this?

    Right. And he's like,

    34:55

    because we just did it was the 90s. I don't know.

    This is what he said. Right?

    It's it's oftentimes it's many times it's like some guy sitting there like, "Oh, I'll do it this way." And then everyone's like, "All right, good." Right? We'll see this when we talk about lateral joins.

    Uh lateral joins was in

    35:11

    the standard. My understanding is the Microsoft guys didn't know it existed and they made their own called cross apply.

    It has basically the same semantics as lateral joins, but they don't call it lateral joint. They call it crossly.

    Right? Again, there's so many implementations out there, nobody actually follows it.

    35:28

    All right, dates, times. This is where it gets rough.

    So, uh, the SQL center specifies a basic date and time type. Uh, I think it also has, uh, times with time zones, right?

    By default, you get whatever the time zone is on the on the

    35:44

    data system where it's running when when you insert a record. Uh, if you're lucky, maybe UTC is a time zone always.

    Not always, not always the case. Um, if you explicitly want to know what the time zone is, you want to store that in the type with it.

    But again, not every system will support that. This is my

    36:00

    favorite demo to game every year. Uh we're do something which seems like we should be really simple to do.

    We're going to compute the number of days from today, August 27th, 2025 to the beginning of the year, January 1st, 2025.

    36:15

    Right? It seems pretty pretty simple, but as we'll see, every system is gonna do something slightly different.

    How to get that answer? All right.

    So in um you see all right so we need to get

    36:32

    dates. All right.

    So how do we get the current date? Well in the SQL standard there's a now function and that'll give you out the the current current year as a time stamp right or the the full time stamp.

    So we go along now to our different system. This is my SQL.

    My SQL

    36:48

    has it. That's great.

    Go to SQLite. SQLite doesn't have it.

    Okay. Let's go to duck DB.

    Duct DB follows the grammar from Postgress. Like they literally have the like a copy of the Postgress grammar and they're using that.

    So a lot of times what what Postgress has Duck DB

    37:03

    will have. Let's go over to Click House.

    Uh Clickhouse has it. That's good.

    Let's go to Firebolt. Firebolt's got that.

    So that's good. Um and then last one is SQL Server.

    uh for reasons of the 1980s I don't

    37:19

    fully understand you have to write go after every query um because once they do like a batch right this is where this one client other clients uh there's other clients that they don't have to do this for the one that Microsoft gives you you always have to write go all right so they complain they don't have

    37:35

    now all right well there's another way to do this in this SQL standard I think you can also get a um you can get something called current time stamp All right. So this is Postgress except Postgress

    37:51

    doesn't have it. They don't have the function but they have the keyword.

    Let's go to my SQL. Sorry.

    38:07

    They have the function. They have the keyword.

    Let's go to SQLite. SQLite doesn't have the function but it has the keyword.

    Duct DB is going to have the doesn't have the function but it has the keyword

    38:22

    just like Postgress. Click house has the function does not have the keyword.

    Firebolt has the function

    38:38

    has the function does not have the keyword. And then uh SQL server I got to write go uh does not have the function but it has the keyword.

    All right. Well that kind of sucks

    38:54

    because like again nobody has the same thing. Um so this turned out to be really easy to do in Postgress in a bunch of systems.

    So what I can do is uh take the date as a string. So 2025 uh

    39:11

    uh on the 27th and then cast the beginning of the year cast these as dates and just subtract them and make it 238 which is actually correct. So what am I doing here?

    This cast function is taking the string and I'm

    39:26

    telling it I want to cast this data this this constant string I'm giving you cast it as a date. For whatever reason you got to put as oh you can't even see that.

    Sorry. Um, please say something next time.

    Make sure this looks good. All right,

    39:42

    there we go. And probably it's going to be make sure I get over here.

    Okay, there we go. All right, so again, I take the string, I call this cast function, I say I want as a date, right?

    So that's fine. Uh what I like about in

    40:01

    one example of an idiom in Postgress that I like a lot is that instead of calling that cast function, I can put two colons at the end of any any piece of data and I tell the data type I want to cast into it. Right?

    So that's nice. So I I can cast my strings to a date.

    So

    40:16

    that works great. So let's pop over now to try to do the same thing in uh in my SQL.

    I get this 726 again just as a refresh your brain it's

    40:33

    there's no there's not 726 days in a year uh it's 2338 days since the beginning so what is this number I didn't know either some rando in YouTube in the YouTube comment of all

    40:49

    places posted what it was and like oh he was right um so what this is this is the first number seven is the current month of August 8 subtracting the uh January one you get seven. Then you get the same thing for the the number of days.

    So

    41:05

    what's happening here is that my SQL is casting the date as an unsigned integer like this. We're just mashing together the year, the month, the day.

    Uh and so it's really computing this,

    41:22

    right? That sucks.

    There's your 726. So, uh, turns out there's actually a date diff function in, uh, my SQL where I can easily get back 23 238 days.

    Okay. Well,

    41:38

    let's go back to SQL server or sorry, to Postgress. And of course, they don't have it either, right?

    All right. So, let's see how to do this in um in SQL Server.

    We'll come back to the other ones. SQL Server you can cast

    41:54

    things as a date. Got to write go, right?

    And I get a date like that. And then they actually have the date diff function.

    Here I go. And I get 238.

    But notice now I have this new thing here. I have this get date,

    42:12

    right? Where that's that's kind that's different.

    Didn't see that before. So let's see whether anybody has that.

    My SQL doesn't have it. doesn't have it.

    Right? So, you see where I'm getting at.

    All right. So, let's try this now in um in Clickhouse

    42:32

    actually to prove that this works in um prove it works in ductb and um in firebolt. So, let's go to ductb.

    I can do it the postgress way. Boom.

    I get the right answer. I can do this in firebolt.

    I do it the postgress way. I

    42:48

    get the right answer. All right.

    Right. So fireballd and postgress are are doing the same thing.

    So that's good. If you go back to our boy click house.

    So we try this before that produc that produced the right answer. That worked out pretty well.

    Um

    43:04

    chatbt gave me this right but it's complaining that it doesn't know about this today function. Okay.

    Well I've never seen that before. I said it was now.

    What is today?

    43:19

    Uh, they don't have it. Does my SQL have it?

    Nope. All right.

    Well, the documentation of ClickUp says they have it, but

    43:34

    it's actually case sensitive. It's lowercase today.

    Then you get that. Right.

    Right. Exactly.

    So before I think we did now, right? That's uppercase.

    Okay. Can I go now?

    43:51

    Yeah. So now can do uppercase lower case but today cannot.

    Awesome. All right.

    Last one. SQLite.

    So they don't have date diff. Uh so if I try to do it the

    44:08

    if I try to do it the postgress way, I get zero. That's not good.

    Um, so I actually came up with the same answer that chatbt came up with where you convert the

    44:24

    uh the timestamps into the Julian calendar which is the number of days since 452 BC or whatever like when I mean Julian Caesar converted everyone to Julian calendar in like 45 BC uh of what

    44:40

    the dating goes back uh even farther right so you can do something like and you get 238, but but it's a decimal. Um, so if you cast it as a you cast it as an integer,

    44:57

    you get 238. Another way to do this would be to convert this to the Unix epoch.

    If you know what that is, right? It's the number of seconds since the the dawn of Unix was like January 1st, 1970.

    So you can convert it to that. convert the

    45:14

    the current time stamp to that number of seconds that to the beginning of the year to the time stamps divided by 60 days or 60 seconds by 60 minutes times 24 hours and why is it integer because uh

    45:31

    because it's doing some kind of casting yeah which is that I don't I don't want to get the type conversion because this is a CMU and people get really uptight about it but yeah that's another problem All right. So the main take away from all this is like a really simple thing

    45:46

    like let's calculate dates and think things go wrong. All right.

    So um let me skip through output control or get through this real quickly. You can do order by if again SQL or rational model is unordered.

    If you care about order you can add order by clause. Uh since the relational model

    46:04

    is or SQL is also um it's unordered and uh if you care about the order, you'd add order by. Sometimes you maybe don't want all the results.

    You can add a limit clause or a fetch offset clause where you basically say I only want the first five rows or first uh 10 rows and

    46:20

    you can even offset it and then you can specify what to deal with when you have ties or two if two tupils have should be part of the same output. What do you do?

    The shorthand way is do to use a limit clause. Uh that's not in the SQL standard but many systems support that at least the duct DB and postgress do.

    46:36

    my SQL and then SQL server has their own syntax of saying top 10 uh in the front. Uh right so J long here so the as I said many times what you want to be able to do is put have a single SQL

    46:52

    query produce all the answer without having to go back for the application. So you can you can take the output of a SQL query and put it into an actual a table or a temp table and then do additional queries on that.

    So in the SQL standard you say select with the into clause and

    47:07

    then where it says course ids that's the name of the table I'm I'm going to insert into uh in postgress you can actually specify that it's a temporary table so that when you close the client the table gets blown away um but that's that's not in standard in all systems. All right so let's jump into the more

    47:23

    challenging things uh in the last half an hour. So nested queries CTE and um and window functions.

    So many times you're going to find yourself not be able to express what you want to compute within a single select statement and you

    47:40

    actually maybe use multiple select statements and as I said you could write with your inate result out to a temp table but often times you may want to do a select inside of your inside of an existing select and so these are called inner queries or nested queries and they can appear pretty much anywhere in a SQL

    47:56

    statement. So I can do something really simple like this where I have a select statement what we'll call the outer query uh the top part here and then inside my wear clause now I have another select statement called the inner query where I'm computing some answer that can then be used or or leveraged in the in

    48:12

    the outer query. Right?

    So this is an example putting in the wear clause. You can put it in the projection output.

    You can put it in the from clause. I think some systems let you put it in like the group by clause the having clause which is not a good idea but you could do that.

    And so we're not going to talk about how

    48:29

    to actually implement this or execute this efficiently just yet. But it's obviously most the dumbest thing to do would be treat this as like two nested for loops where say in the top one here I'm for every single student I'm going to then run that inner query in its entirety for you know to try to find a

    48:44

    match. My SQL used to do that.

    Most systems when they first doing start doing SQ queries they do that. And obviously that's very inefficient because the answer is not going to change from one tupole on the outer on the outer query to the next.

    So we'll talk about this later um after the midterm but ideally what you want to

    49:01

    be able to do is rewrite that query into a join because data systems know how to execute joins very efficiently and so you can take a if you have a nested query like that if you can convert it to a join it'll run much faster but not every system can do that post is actually very bad at it

    49:17

    the question the statement is there might be some cases where the subquery is actually faster um So this is getting the we but like you can imagine something like a computing aggregation like a a max ID if I could

    49:33

    run that once cache the result and reuse that then yeah that that would be fast but you but you almost never want to run I don't think you ever want to run the inner query over and over again it's like stupid question is what what system is actually

    49:50

    good at converting these things there's a system called out of uh Germany called Umbra. It's been written by one of the the the best database system researchers in the world.

    Uh and he has a new paper that came out this year. He can handle basically any any nested query he can

    50:05

    rewrite. Statement is well making more mainstream systems.

    So DuckDB does an earlier version of what he did uh because they just copied what he did because it was in the papers, right? Uh, but one of my

    50:20

    PJ students here came up with a came up with a a nested query that actually broke break ductb and broke this other system. It it won't work correctly.

    It it it it runs out of memory because it blows up. Uh because just think of like you just

    50:36

    you're just generating a bunch of results and the the balloons up. You run out of memory.

    So the newer version of this German system called Umbra which is now commercialized as Cedar DB that system actually supports it. This all I have to cut on the video.

    I can't see this. Yes, we buried the bodies and the

    50:51

    cops never said anything. Um, that's the way it is.

    All right. All right.

    So, uh, nested query. So, let's see.

    We want to get the student the names of the students that are in 445, right? And so, when you start doing homework one, you got to think about how you're going to construct these things.

    With nested queries, you want to sort of

    51:07

    maybe start with the outer query first and figure out what you what what you want the final output to be. And then oftentimes I'll just write maybe in like natural language what the inner inner query should be and I figure out how to sort of you know what that how to rewrite that integrated fit into what I want in the outer query.

    So if I want to

    51:24

    get all the names of the students that are in 445 I know I want to do select name from student with some kind of wear clause and my wear clause is going to say something like get this get the list of the people the set of people that are taking 445. So my inner query would look something like this.

    But now I need to mash the two together. And so this is

    51:40

    where I can add things like an in clause or other other u other operators in my expressions to specify what I how I want to link together the inner query with the outer query. So in this case here I'm saying where SID is in and then some

    51:56

    some nest query here. So this is doing set membership.

    I'll match on my outer uh my outer tables query or the outer query any record that has an SID that is in my set that is produced by the inner query. And so his point he was saying

    52:12

    before sometimes it's more efficient to run the the nested query in without doing a join. And obviously in this one it's pretty obvious that you want to you want to run that query once the inner query and be able to reuse the result over and over again.

    And a join would basically do that for you.

    52:28

    So notice also here too that like I'm now I'm referencing the the student ID in the inner query that's going to get bound to the enroll table inside the inner query and then the student ID is getting bound to the one on the outside. There are things like correlated subqueries where you can then you can match things on the inside with the

    52:44

    outside but we don't have to worry about that just yet. All right.

    So you can do the bunch of different uh operators to specify what the matching should be. My example I did before was was in is equipment to equals any but you can say like all have to

    53:00

    match uh at least one has to match or none has to match right you can do a bunch of uh these kind of operations all right the more complicated things let's find the student record that has the highest ID that is enrolled in at least one course right so the outer query I know I'm going to want to say I

    53:15

    want the student ID and their name and then the inner query is be something like is the highest enrolled student ID so my inner query could be like this right uh I want to get the max student ID if I'm enrolled and I want to match that with the student on on the outer query right and I just instead of

    53:31

    writing in equals any or equals all I'll just put in the end call clause like this this is one way to do this right you could do this with uh sorting the table and then fetching fetching the first row that would be the max I can put the the the max query the

    53:46

    aggregation inside the join clause right there's a bunch of different ways that you can write you write queries uh to produce the correct answer. I'm not saying one way is better than another.

    It's going to depend highly on what the system can actually support and run. In theory, SQL and it shouldn't

    54:03

    matter, but oftentimes it does for your for homework one. You're running on Doug DDB on your local machine.

    It there won't be a huge difference performance rewriting different things. At least there shouldn't be.

    All right, looks one more example. Find all the courses that have at least have

    54:19

    no students enrolled in it. My auto query should be select select all the courses and where I want to match where there isn't a tupole for that course in the enroll table.

    So I can do something really simple like this. I can say select star from enrolled where course ID uh where the course ID and when the

    54:35

    old table matches the course ID from the uh course table on the on the outside here. And then notice how this inner one here is being bound to the one on the on the outside.

    Right? So this is called a correlated subquery where the the the result being

    54:51

    computed on the inside depends on whatever tool you're looking at on the outside and many systems can't do that can't convert that into joins but now SQL server duct DB and and this German system CDB and Umbra can't

    55:08

    all right so when you do homework one sometimes it'll be obvious that you want to use a CTE which we'll cover in a second sometimes you want to do a nested query But in general uh in general it's up to preference. We'll see CTS in a second.

    55:26

    Lateral joins are a special kind of uh join where it allows a query that's technically at the same level uh of nesting within within one query to reference data in another query at the same level. So what do I mean by that?

    So think

    55:43

    think of it like it's two for loops where the I can now have a a previous executed query within my same sort of outer query be referenced by queries that came after it. So I said before that in SQL that ideally in a declarative language you don't you don't

    55:58

    want to specify what the order of the operations should be but in something like a lateral join you do have to specify it because you have to know like this thing has to get computed for this other one gets computed. So this is a really toy example.

    Select star from a derived table where I have select one as x. So this is making a

    56:15

    synthetic virtual table uh that has one one tuple with one value of one. And then now in my lateral operator and in this second query here I'm now allowed to reference up the the the nested query that came before it.

    56:30

    And it's roughly equivalent to producing the output of this this Python code like this. I'm seeing a lot of confused faces.

    Let's look at another example. So I want to calculate the number of students that enrolled in each course and get their average GPA and I want to

    56:47

    sort them based on the the enrollment count in descending order. So the outer query again is going to be select star from course get when all the tupils that are all the the attributes for anybody within a course or any course tupil.

    The

    57:02

    first lateral query is going to be for each course compute the number of enrolled students and then the second lateral query be for each course now comput the average GPA of the enrolled students. So it would look roughly like this.

    57:18

    So in some cases some um in some versions of SQL you have to specify that it's a it's a lateral join explicitly with a with like a join operator. In this case here in Postgress, you can just say I have a lateral and then you have a nested query and and then that can do whatever wants to.

    So the first

    57:34

    one here again we're computing the um the we're counting the number of students that that are enrolled in each given course. Right?

    So in this case here the the course ID I'm referencing is going to come from the the the outer query.

    57:49

    The second one down here where I'm computing the average GPA for every student that are enrolled in the course. Again, same thing here.

    Now, this course ID, it can is can come from this outer one here, or it could come from the one in the middle. It doesn't matter.

    They're actually both semantically correct. They would both produce the

    58:05

    same result, right? And you would get something like this.

    And I would sort again by by the count in ascending order. All right, let me give a quick show you quickly what what this looks like.

    So, duct DB, my SQL, Postgress, and Fireball

    58:22

    are all going to produce the the same result. um for lateral joints which is nice.

    Sorry, say it again. Uh for that statement, could you could you do that with a normal join uh for

    58:38

    that example? Yes.

    Actually, in the sake of time, let me let me get through CTE and we can because we we have the the caller the speaker today and I can I can always show what this looks like afterwards. All right, CTE, a comment table expressions.

    This is a newer construct in SQL. Think of this as like instead of

    58:56

    having a separate two separate select statements, one want one to create a a temp table and then another to query that temp table. I can create a temp table within my query that then gets populated and materialized and then it gets thrown away when the query is over.

    Right? So the way this is specified, I

    59:14

    have my whiff clause at the top. I give a name of my CTE or sort of the name of the table.

    whatever the columns are going to be as the output. Notice I I don't have to put the types because it's it's going to derive the types based on whatever is inside of the the the as clause here.

    And then now down below

    59:31

    after the the closing parenthesis, I can then reference that CTE as if it was a uh like a real table. It was think like a macro or like like I can take whatever is in that with clause and inject that

    59:47

    as a as a nested query here. parameterize it.

    See, can you say his question is can you parameterize it? What do you mean by that?

    Oh, yeah. So, the statement is uh statement

    00:03

    is can you can you pass in a value that causes the the I don't I don't think you can uh

    00:20

    We can take this offline. There is research where you can take a arbitrary uh procedural code like Python or or whatever you know Pascal people write these store procedures or functions in like this language called PL/SQL.

    There's ways to convert automatically

    00:35

    convert procedural code into uh into CTE which they do. It does look like this.

    But now you're actually maintaining a state table what the input is to do to to basically mimic the same thing. That that is not you don't need to know that for this.

    Okay.

    00:52

    All right. So quick example.

    Uh now I'm going to find the student record with the highest ID enrolled at least one course. You saw this before.

    Instead of doing a Nessa query, I now have my my CTE where I can do a join against it. uh and so the data center could be smart enough to know compute the CTE once and

    01:08

    then fill in the value and use that uh later on. It's nicer syntax than using nested queries because you sort of declare them up front and then you can reference them down below whereas nested queries you

    01:24

    kind of like they're embedded and makes makes the the SQL quite large. queries in this statement is yes nested queries cannot reference in line things correct yes but sometimes this is okay

    01:41

    all right the last thing I want to cover is window functions so the in all the aggregations and things things we talked about before you can't easily keep track of like uh how things are being processed the order that you're

    01:57

    looking at them So you can't compute things that that require a notion of bordering because again SQL is unordered. But sometimes you want to do things like a moving average like think of like a time series data data set where I'm having different time stamps different times throughout the day.

    I'm

    02:13

    recording the temperature. So I want to say what was the moving average of the temperature at this given time or stock price would be another good example of this.

    But with window functions, you can now do this um and now generate uh

    02:28

    multiple aggregation outputs as you scan along logically scan along the data uh in in in your SQL query. So it's sort of like an aggregation but you're not grouping them into produce a single output.

    You could have multiple outputs for every for every single tuple. So the

    02:44

    way this works you specify in your projection list you have a function name and it'll have basic aggregation functions and additional functions that are specific for uh window functions and then you have this over clause to specify how you want to split things up or slice things up as as you go along.

    03:01

    So the aggregations could be anything we talked about before like again if I want to comput the moving average the min max the count of of some some window as as I scan and then the special functions are row number and rank. So row number would be like what offset you are in the in in

    03:17

    a group as you slice it up and the rank is would be the position according to some sort order. Right?

    So my query is select star from from the enroll table and I'm going to compute the row number without doing any um without doing any slicing or any

    03:33

    group eyes. And so now you would see that uh I I'm I'm introducing this new column here called row number that keeps track of where the where that tupil is position is within that group of the window.

    03:49

    If I want to again do something like a group by they have a partition by keyword. So now I'm going I'm going to compute the uh get the course ID and the student ID for everyone enrolled in in a course.

    And I want to know what the row number is as as I partition them by the course ID, right? And so again, just

    04:05

    like the group by clause, row things up into groups when I produce my output. Same thing here.

    Now you can see the row number gets reset back to one every time I enter a new partition or enter a new group. I can sort things within my group as well, right?

    So I can in the over

    04:21

    clause, I can specify an order by and then um the second time I'll I'll have to go this I'll pass over this. you can do more complicated things in there.

    All right, so to finish up again, SQL's very important. It's not going away.

    04:37

    There's this survey the the E does every year about what what students say are the most important languages they need to learn. Uh SQL is always now number one.

    Python's number two. Uh number three is Java.

    Right? So again, as I said, for the rest

    04:54

    of your life, you're going to come across SQL and it's really important to at least know the basics of it. So homework one and this class right here today will be provide you the foundation you can then expand upon in in your further careers.

    All right, so homework one is going out today. Uh again doing basic data

    05:11

    analysis using ductb you submit you can run everything locally in test and then you submit to grayscope and your output has to match whatever the expected output is uh on grayscope. Okay.

    Should you use an LLM? Yes.

    Try it. See how far you can get.

    05:30

    But you need to understand what what the thing is actually spitting out because there might be an exam might ask you some basic SQL questions. Okay.

    All right. So, so again, next class uh I won't be here.

    We'll be in London. So, we'll be there's no Monday's a holiday.

    So, do no cost for anybody. and then

    05:47

    Wednesday we'll post on um we'll post on on YouTube. Okay.

    So, new classroom. So, this is Drew.

    He's at DBT. And as I said before, this is the arguably the one of the most important

    06:03

    applications in the the data space right now. Uh it's widely used everywhere.

    Um and it's it's it's super important. All right.

    So, Drew, sorry for being late. Sorry for the technical mess up.

    The floor is yours. Go for it.

    Great, Andy. Thanks for having me on.

    06:19

    Um, I appreciate the kind words. My name's Drew Bannon, one of the co-founders at DBT Labs.

    Uh, we do make a product called DBT. DBT is definitely not a database, definitely not a data warehouse.

    Uh, but we help companies use their data warehouses and make sense of

    06:34

    the data, you know, in in the database. Um, I think I've got about 10 minutes with you today.

    So this is a sort of modified version of an onboarding session that I do for no employees just talking about what DUT is and why it matters and how it fits into a broader ecosystem. Uh so I got an app and

    06:49

    pencil. Let's let's sketch it out.

    Um so this is the setup that you'll see in a lot of modern data oriented companies. They've got data coming from data sources.

    These are transactional databases like Postgress, MySQL. They've got advertising data and payments data,

    07:05

    finance, telemetry, sales, customer support, you name it. what they want to do is extract and load all this data into a centralized place data warehouse or data lake.

    And so practically what happens is they've got these extracted load processes that are creating, you

    07:20

    know, hundreds or thousands of tables loaded by different teams and different departments. Sometimes duplicated, sometimes stale, but you just get a bunch of tables in a data warehouse.

    On the other side of the data warehouse, we have the consumers. So there's BI and data science dashboards or notebooks or

    07:38

    scripts that are quering these tables. You've got your ML and AI use cases.

    pulling from maybe the broth data directly and they've got operational stuff like wanting to send emails to people who left, you know, items in the shopping cart uh on an e-commerce

    07:54

    website. So that's qu some more tables, too.

    And so the net result is that there's a giant mess here. There's no semblance of data lineage, meaning that you can't actually see where the data is coming from that's powering this BI report, right?

    You just have to trust that it works and and usually it's wrong in some way. Um, there's no

    08:11

    documentation. So, it's hard to find things like data assets that already exist.

    It's hard to reuse someone else's work. You frequently have to rebuild things from scratch.

    Um, there's no semblance of a deployment process here. You just edit stuff in prod.

    You edit the dashboard in prod. You change the tables in prod and hope you don't break

    08:27

    anything for anyone. There's no sense of QA, which means things are broken all the time.

    Quality assurance. And fundamentally, you get duplicated business logic for each of your different use cases or dashboards.

    And so here we're kind of you know imagine recalculating what it means to be a

    08:43

    customer a 100 times over that leads to inconsistencies and and so the um alternative approach here is the one that we take with DBT and the big idea is you know as before we want to load extract and load data into a

    08:58

    data warehouse. Uh but this time we'll draw this imaginary line down the middle of the database.

    Let's say the left hand side is for raw data and that's where we oops extract and load you know our raw tables loaded from our data sources and

    09:14

    then we have these blue pipelines here that I drew. So this is actually what dbt does.

    It takes your raw data sets and helps you create derived data sets that are you know enriched or apply your business logic to the raw data. The net result is that you get these higher

    09:29

    levels of abstraction and these tables that you can query and um basically the data set you expose to the business represents sort of the terminology that folks in the business use and not the raw data coming out of Salesforce or coming out of you know Postgress. So it's a sort of translation

    09:46

    or transformation layer. The benefit of doing it with DBT is that you can source control your business logic as SQL code.

    You can visualize your data lineage so you can see how data flows from raw tables through data models into dashboards.

    10:03

    You can use an actual CI process so you can develop code in depth and not in prod. And um you can find out that your game is broken before the CEO hears about it, which is always uh the better time to find out that your D's broken.

    And finally, you know, you can reuse assets.

    10:19

    Um we built this transform table. Other other colleagues across the business can build on top of it.

    we can amplify each other's impact. So I want to show you what this looks like a little bit in dbt.

    The big idea here is that you know we're looking at a dbt model called stage GitHub stars in

    10:35

    in this case GitHub. We first controlled it in GitHub.

    And so stage GitHub stars is the name of our table here. STG it's like a staging model.

    This is like going from your raw data to kind of one layer of preparation. You can rename some columns for clarity.

    Uh filter out

    10:51

    invalid or delete no record and things like that. soft deleted reference and then fundamentally our model code is um just SQL.

    So it's select star from you know some source and then we've got a little window function action here to uh dduplicate records by you know the first

    11:09

    first start update. Um the only difference between this and pure SQL is that DBT has a built-in templating language called Ginga and this allows us to do pretty cool things and specifically it allows us to represent edges between different nodes in a dependency graph.

    So here we're saying

    11:25

    stage GitHub stars oops uh stage GitHub stars depends on the raw source GitHub star hazers in our case loaded by Fiverr. Uh so that's one but you can also do things like write for loops and iterate over tables and union a bunch of tables

    11:40

    together with divergent schemas. Uh there's really a lot a lot you can do with this template language but fundamentally what we're doing here is defining this logic in SQL that gets version control.

    The net is that you get a sort of dependency graph that looks like this in dbt. So here's just a sample of an

    11:58

    overall like much bigger dbt dependency graph. And every single one of these nodes is either a data source, you know, in this case these are sources, or it's a sort of intermediate transformation like we see here.

    So this is us breaking a lot of business logic into smaller

    12:14

    pieces kind of like you would with functions, you know, in and um in another programming language. But each of these nodes in the graph is itself a table.

    And so there's some really cool stuff you can do with dbt like testing your data sets or documenting your data sets. Um so we can create assertions

    12:30

    that say you know the invoice ID of this table should always be unique and not null or something like that. And now we can find out if our assumptions about the data set uh don't hold you know again before this does.

    Finally we output a sort of dimensional model here.

    12:46

    This is dim strike customers and that's what we want to actually expose to the BI tools or data analysts in the company who are quering these data sets and trying to answer questions about business. And I can guarantee you it is a much nicer experience for that person to query this table that has all the

    13:02

    information that they need flooded from 40 different source tables than it would be to go query those source tables directly. Um because we did all this hard work kind of along the way to uh translate from the raw data sets into sort of the language of business.

    13:19

    That was the uh six or seven minute answer to what exactly is dbt. Uh, if you take anything away, it's transform data into data warehouse and do it like a software engineer would write code.

    Okay, awesome. Any questions for Drew?

    13:35

    Drew? Yes.

    Yes. Uh, yeah.

    So, you mentioned that you write your uh data pipelines in SQL. Do you have like a an efficient execution engine that can run this efficiently over like big amounts of data?

    13:50

    The question is true. Um um your you mentioned that people find these dags through SQL.

    Do you guys have an execution engine that can efficiently execute execute a great question we do and it's called Snowflake or data bricks or bigquery or red shift or uh so on and so on. So dbt

    14:07

    again is not a a data warehouse itself. What we do is we run queries on the customer's data warehouse.

    Um, which which for us I think the the biggest four are Snowflake, Red Shift, BigQuery, Diverse. Um, and so our customers really

    14:22

    like that because they did all this work to get their data into their data warehouse. It's in their like single source of truth.

    They want the data to stay there. So they like that DBT will connect to their Snowflake account, their BigQuery project and uh run the queries on the data, you know, in in C2 where it lives.

    A large percentage of

    14:41

    all queries that run on like Snowflake, BigQuery, and so comes from DBT. It's it's massive.

    David is it's it's like a data lineage marketplace tool. Drew, if you want

    15:00

    Well, what I would say is so we can do two things and maybe this diagram is confusing, but actually DBT's job is to create these uh tables. They can be two of these nodes.

    It's a table or or a view could be a view. So, DBT actually creates these based on the the version control

    15:15

    definition in your DBT project. Uh but we can also visualize it for you so that you can trace, you know, where does the data come from that that ultimately lands in the thin strike customers table.

    Well, you can trace it all the way back to these data sources and some other stuff over here that's not shown. Uh, but DBT's primary job is to actually

    15:32

    build those data sets in dependency order. We just so happen to also visualize it for you as like debugging or or data catalog.

    Cool. All right.

    Awesome. Let's thank Drew.

    No, no, no. Go for it.

    Go for it.

    15:48

    Uh, yeah. The if you were to look at a typical DBG model, basically uh I showed this before earlier.

    What we'll do is wrap your select statement in, you know, create table as blah blah blah. And so this is how we like push down the

    16:03

    execution to the logic. The user's job is to define the business logic in a SQL select statement.

    DBT's job is to materialize that logic as tables of the user, incrementally update tables, things like that. Um, so that's how that's how this all fits together.

    All right. So you may not all realize

    16:19

    the the significance of what DBT is because you don't you don't have any data, right? At some point in your life, you're going to have a bunch of Python scripts try to mash this, right?

    And then you realize this is terrible. How are you ever going to maintain this?

    This is the problem that DBT solves, right? That's why I

    16:34

    keep saying it's a big deal. It's widely used because it is a way to declare here's my pipeline for my for my data projects, whereas before was just like a bunch of random stuff people would have in GitHub.

    Okay. All right.

    Let's thank Drew. [Music] [Applause]

    16:50

    All right. Thanks, man.

    I appreciate it. All right.

    Again, Monday is the holiday, no classes, and then Wednesday I'm in London. Uh, that'll be on YouTube.

    And then please get started on Project Zero and the homework. Okay.

    Thank you. Hit it.

    17:06

    [Music] over

    17:21

    [Music] the fortune. Get the maintain flow with the grain.

    Get the fame. Maintain flow with the grain.