🚀 Add to Chrome – It’s Free - YouTube Summarizer
Category: N/A
No summary available.
00:00
hey friends so today we are diving into something very exciting Building Together modern SQL data warehouse projects but this one is not any project this one is a special one not only you will learn how to build a modern Data Warehouse from the scratch but also you
00:15
will learn how I implement this kind of projects in Real World Companies I'm bar zini and I have built more than five successful data warehouse projects in different companies and right now I'm leading big data and Pi Projects at Mercedes-Benz so that's me I'm sharing
00:30
with you real skills real Knowledge from complex projects and here's what you will get out of this project as a data architect we will be designing a modern data architecture following the best practices and as a data engineer you will be writing your codes to clean
00:46
transform load and prepare the data for analyzis and as a data Modell you will learn the basics of data moding and we will be creating from the scratch a new data model for analyzes and my friends by the end of this project you will have a professional portfolio project to
01:03
Showcase your new skills for example on LinkedIn so feel free to take the project modify it and as well share it with others but it going to mean the work for me if you share my content and guess what everything is for free so there are no hidden costs at all and in
01:18
this project we will be using SQL server but if you prefer other databases like my SQL or bis don't worry you can follow along just fine all right my friends so now if you want to do data analytics projects using SQL
01:35
we have three different types the first type of projects you can do data warehousing it's all about how to organize structure and prepare your data for data analysis it is the foundations of any data analytics projects and in The Next Step you can do exploratory
01:50
data analyzes Eda and all what you have to do is to understand and cover insights about our data sets in this kind of project you can learn how to ask the right questions and how to find the answer using SQL by just using basic SQL skills now moving on to the last stage
02:06
where you can do Advanced analytics projects where you going to use Advanced SQL techniques in order to answer business questions like finding Trends over time comparing the performance segmenting your data into different sections and as well generate reports for your stack holders so here you will
02:22
be solving real business questions using Advanced SQL techniques now what we're going to do we're going to start with the first type of projects SQL data warehousing where you will gain the following skills so first you will learn how to do ETL elt processing using SQL in order to prepare the data you will
02:38
learn as well how to build data architecture how to do data Integrations where we can merge multiple sources together and as well how to do data load and data modeling so if I got you interested grab your coffee and let's jump to the
02:53
projects all right my friends so now before we Deep dive into the tools and the cool stuff we have first to have good understanding about what is exactly a data warehouse why the companies try to build such a data management system so now the question is what is a data warehouse I will just use the definition
03:10
of the father of the data warehouse Bill Inon a data warehouse is subject oriented integrated time variance and nonvolatile collection of data designed to support the Management's decision-making process okay I I know that might be confusing subject oriented
03:25
it means thata Warehouse is always focused on a business area like the sales customers finance and so on integrated because it goes and integrate multiple Source systems usually you build a warehouse not only for one source but for multiple sources time variance it means you can keep
03:42
historical data inside the data warehouse nonvolatile it means once the data enter the data warehouse it is not deleted or modified so this is how build and mod defined data warehouse okay so now I'm going to show you the scenario where your company don't have a real data management so now let's say that
03:58
you have one system and you have like one data analyst has to go to this system and start collecting and extracting the data and then he going to spend days and sometimes weeks transforming the row data into something meaningful then once they have the report they're going to go and share it
04:14
and this data analyst is sharing the report using an Excel and then you have like another source of data and you have another data analyst that she is doing maybe the same steps collecting the data spending a lot of time transforming the data and then share at the end like a report and this time she is sharing the
04:30
data using PowerPoint and a third system and the same story but this time he is sharing the data using maybe powerbi so now if the company works like this then there is a lot of issues first this process it take too way long I saw a lot of scenarios where sometimes it takes
04:46
weeks and even months until the employee manually generating those reports and of course what going to happen for the users they are consuming multiple reports with multiple state of the data one report is 40 days old another one 10 days and a third one is like 5 days so
05:02
it's going to be really hard to make a real decision based on this structure a manual process is always slow and stressful and the more employees you involved in the process the more you open the door for human errors and errors of course in reports leads to bad decisions and another issue of course is
05:19
handling the Big Data if one of your sources generating like massive amount of data then the data analyst going to struggle collecting the data and maybe in some scenarios it will not be any more possible to get the data so the whole process can breaks and you cannot generate any more fresh data for
05:35
specific reports and one last very big issue with that if one of your stack holders asks for an integrated report from multiple sources well good luck with that because merging all those data manually is very chaotic timec consuming and full of risk so this is just a
05:52
picture if a company is working without a proper data management without a data leak data warehouse data leak houses so in order to make real and good decisions you need data management so now let's talk about the scenario of a data warehouse so the first thing that can
06:08
happen is that you will not have your data team collecting manually the data you're going to have a very important component called ETL ETL stands for extract transform and load it is a process that you do in order to extract the data from the sources and then apply
06:24
multiple Transformations on those sources and at the end it loads the data to the data warehouse and this one going to be the single point of Truth for analyzes and Reporting and it is called Data Warehouse so now what can happen all your reports going to be consuming
06:40
this single point of Truth so with that you create your multiple reports and as well you can create integrated reports from multiple sources not only from one single source so now by looking to the right side it looks already organized right and the whole process is
06:55
completely automated there is no more manual steps which of course it ru uses the human error and as well it is pretty fast so usually you can load the data from the sources until the reports in matter of hours or sometimes in minutes so there is no need to wait like weeks
07:12
and months in order to refresh anything and of course the big Advantage is that the data warehouse itself it is completely integrated so that means it goes and bring all those sources together in one place which makes it really easier for reporting and not only
07:27
integrate you can build in the data warehouse as well history so we have now the possibility to access historical data and what is also amazing that all those reports having the same data status so all those reports can have the same status maybe sometimes one day old
07:43
or something and of course if you have a modern Data Warehouse in Cloud platforms you can really easily handle any big data sources so no need to panic if one of your sources is delivering massive amount of data and of course in order to build the data warehouse you need different types of Developers so usually
08:00
the one that builds the ATL component and the data warehouse is the data engineer so they are the one that is accessing the sources scripting the atls and building the database for the data warehouse and now for the other part the one that is responsible for that is the
08:16
data analyst they are the one that is consuming the data warehouse building different data models and reports and sharing it with the stack holders so they are usually contacting the stack holders understanding the requirements and building multiple reports based on
08:31
the data warehouse so now if you have a look to those two scenarios this is exactly why we need data management your data team is not wasting time and fighting with the data they are now more organized and more focused and with like data warehouse and you are delivering
08:47
professional and fresh reports that your company can count on in order to make good and fast decisions so this is why you need a data management like a data warehouse think about data warehouse as a busy restaurant every day different suppliers bring in fresh ingredients
09:04
vegetables spices meat you name it they don't just use it immediately and throw everything in one pot right they clean it shop it and organize everything and store each ingredients in the right place fridge or freezer so this is the preparing face and when the order comes
09:20
in they quickly grab the prepared ingredients and create a perfect dish and then serve it to the customers of the restaurant and this process is exactly like the data warehouse process it is like the kitchen where the raw ingredients your data are cleaned sorted
09:35
and stored and when you need a report or analyzes it is ready to serve up exactly like what you need okay so now we're going to zoom in and focus on the component ETL if you are building such a project you're going
09:50
to spend almost 90% just building this component the ATL so it is the core element of the data warehouse and I want you to have a clear understanding what is exactly an ETL so our data exist in a source system and now what we want to do
10:05
is is to get our data from the source and move it to the Target source and Target could be like database tables so now the first step that we have to do is to specify which data we have to load from the source of course we can say that we want to load everything but let's say that we are doing incremental
10:22
loads so we're going to go and specify a subset of the data from The Source in order to prepare it and load it later to the Target so this step in the ATL process we call it extract we are just identifying the data that we need we pull it out and we don't change anything
10:37
it's going to be like one to one like the source system so the extract has only one task to identify the data that you have to pull out from the source and to not change anything so we will not manipulate the data at all it can stay as it is so this is the first step in
10:52
the ETL process the extracts now moving on to the stage number two we're going to take this extract data and we will do some manipulations Transformations and we're going to change the shape of those data and this process is really heavy working we can
11:07
do a lot of stuff like data cleansing data integration and a lot of formatting and data normalizations so a lot of stuff we can do in this step so this is the second step in the ETL process the transformation we're going to take the original data and reshape it transformat
11:23
into exactly the format that we need into a new format and shapes that we need for anal and Reporting now finally we get to the last step in the ATL process we have the load so in this step we're going to take this new data and we're going to insert it into the targets so it is very simple we're going
11:40
to take this prepared data from the transformation step and we're going to move it into its final destination the target like for example data warehouse so that's ETL in the nutshell first extract the row data then transform it into something meaningful and finally
11:55
load it to a Target where it's going to make a difference so that's that's it this is what we mean with the ETL process now in real projects we don't have like only source and targets our thata architecture going to have like multiple layers depend on your design whether you are building a warehouse or
12:11
a data lake or a data warehouse and usually there are like different ways on how to load the data between all those layers and in order now to load the data from one layer to another one there are like multiple ways on how to use the ATL process so usually if you are loading the data from the source to the layer
12:27
number one like only the data from the source and load it directly to the layer number one without doing any Transformations because I want to see the data as it is in the first layer and now between the layer number one and the layer number two you might go and use the full ETL so we're going to extract
12:44
from the layer one transform it and then load it to the layer number two so with that we are using the whole process the ATL and now between Layer Two and layer three we can do only transformation and then load so we don't have to deal with how to extract the data because it is
12:59
maybe using the same technology and we are taking all data from Layer Two to layer three so we transform the whole layer two and then load it to layer three and now between three and four you can use only the L so maybe it's something like duplicating and replicating the data and then you are
13:16
doing the transformation so you load to the new layer and then transform it of course this is not a real scenario I'm just showing you that in order to move from source to a Target you don't have always to use a complete ETL depend on the design of your data architecture you
13:31
might use only few components from the ETL okay so this is how ETL looks like in real projects okay so now I would like to show you an overview of the different techniques and methods in the etls we have wide range of possibilities where you have to make decisions on
13:46
which one you want to apply to your projects so let's start first with the extraction the first thing that I want to show you is we have different methods of extraction either you are going to The Source system and pulling the data from the source or the source system is pushing the data to the data warehouse
14:02
so those are the two main methods on how to extract data and then we have in the extraction two types we have a full extraction everything all the records from tables and every day we load all the data to the data warehouse or we make more smarter one where we say we're going to do an incremental extraction
14:19
where every day we're going to identify only the new changing data so we don't have to load the whole thing only the new data we go extract it and then load it to the data warehouse and in data extraction we have different techniques the first one is like manually where someone has to access a source system
14:35
and extract the data manually or we connect ourself to a database and we have then a query in order to extract the data or we have a file that we have to pass it to the data warehouse or another technique is to connect ourself to API and do their cods in order to extract the data or if the data is
14:52
available in streaming like in kfka we can do event based streaming in order to extract the data another way is to use the change data capture CDC is as well something very similar to streaming or another way is by using web scrapping where you have a code that going to run
15:08
and extract all the informations from the web so those are the different techniques and types that we have in the extraction now if you are talking on the transformation there are wide range of different Transformations that we can do on our data like for example doing data enrichment where we add values to our
15:25
data sets or we do a data integration where we have multiple sources and we bring everything to one data model or we derive a new of columns based on already existing one another type of data Transformations we have the data normalization so the sources has values that are like a code and you go and map
15:42
it to more friendly values for the analyzers which is more easier to understand and to use another Transformations we have the business rules and logic depend on the business you can Define different criterias in order to build like new columns and what belongs to Transformations is the data
15:59
aggregation so here we aggregate the data to a different granularity and then we have type of transformation called Data cleansing there are many different ways on how to clean our data for example removing the duplicates doing data filtering handling the missing data
16:14
handling invalid values or removing unwanted spaces casting the data types and detecting the outliers and many more so we have different types of data cleansing that we can do in our data warehouse and this is very important transformation so as you can see we have
16:30
different types of Transformations that we can do in our data warehouse now moving on to the load so what do we have over here we have different processing types so either we are doing patch processing or stream processing patch processing means we are loading the data
16:45
warehouse in one big patch of data that's going to run and load the data warehouse so it is only one time job in order to refresh the content of the data warehouse and as well the reports so that means we are scheduling the data warehouse in order to load it in the day once or twice and the other type we have
17:02
the stream processing so this means if there is like a change in the source system we going to process this change as soon as possible so we're going to process it through all the layers of the data warehouse once something changes from The Source system so we are streaming the data in order to have real
17:18
time data warehouse which is very challenging things to do in data warehousing and if you are talking about the loads we have two methods either we are doing a full load or incremental load it's a same thing as extraction right so for the full load in databases there are like different methods on how
17:33
to do it like for example we trate and then insert that means we make the table completely empty and then we insert everything from the scratch or another one you are doing an update insert we call it upsert so we can go and update all the records and then insert the new
17:49
one and another way is to drop create an insert so that means we drop the whole table and then we create it from scratch and then we insert the data it is very similar to the truncate but here we are as well removing and drubbing the whole table so those are the different methods of full loads the incremental load we
18:05
can use as well the upserts so update and inserts so we're going to do an update or insert statements to our tables or if the source is something like a log we can do only inserts so we can go and Abend the data always to the table without having to update anything
18:20
another way to do incremental load is to do a merge and here it is very similar to the upsert but as well with a delete so update insert delete so those are the different methods on how to load the data to your tables and one more thing in data warehousing we have something called slowly changing Dimensions so
18:36
here it's all about the hyz of your table and there are many different ways on how to handle the Hyer in your table the first type is sd0 we say there is no historization and nothing should be changed at all so that means you are not going to update anything the second one
18:52
which is more famous it is the sd1 you are doing an override so that means you are updating the records with the new informations from The Source system by overwriting the old value so we are doing something like the upsert so update and insert but you are losing of
19:08
course history another one we have the scd2 and here you want to add historization to your table so what we do so what we do each change that we get from The Source system that means we are inserting new records and we are not going to overwrite or delete the old data we are just going to make it
19:24
inactive and the new record going to be active one so there are different methods on how to do historization as well while you are loading the data to the data warehouse all right so those are the different types and techniques that you might encounter in data management projects so now what I'm
19:39
going to show you quickly which of those types we will be using in our projects so now if we are talking about the extraction over here we will be doing a pull extraction and about the full or incremental it's going to be a full extraction and about the technique we are going to be passsing files to the
19:55
data warehouse and now about the data transformation well this one we will cover everything all those types of Transformations that I'm showing you now is going to be part of the project because I believe in each data project you will be facing those Transformations now if we have a look to the load our
20:12
project going to be patch processing and about the load methods we will be doing a full load since we have full extraction and it's going to be trunk it and inserts and now about the historization we will be doing the sd1 so that means we will be updating the
20:28
content of the thata Warehouse so those are the different techniques and types that we will be using in our ETL process for this project all right so with that we have now clear understanding what is a data warehouse and we are done with the theory parts so now the next step we're going to start with the projects
20:43
the first thing that you have to do is to prepare our environment to develop the projects so let's start with that all right so now we go to the link in the description and from there we're going to go to the downloads and and
20:58
here you can find all the materials of all courses and projects but the one that we need now is the SQL data warehouse projects so let's go to the link and here we have bunch of links that we need for the projects but the most important one to get all data and files is this one download all project
21:14
files so let's go and do that and after you do that you're going to get a zip file where you have there a lot of stuff so let's go and extract it and now inside it if you go over here you will find the reposter structure from git and the most important one here is the data ass sets so you have two sources the CRM
21:31
and the Erp and in each one of them there are three CSV files so those are the data set for the project for the other stuffs don't worry about it we will be explaining that during the project so go and get the data and put it somewhere at your PC where you don't lose it okay so now what else do we have
21:47
we have here a link to the get repository so this is the link to my repository that I have created through the projects so you can go and access it but don't worry about it we're going to explain the whole structure during the project and you will be creating your own repository and as well we have the link to the notion here we are doing the
22:04
project management here you're going to find the main steps the main phes of the SQL projects that we will do and as well all the task that we will be doing together during the projects and now we have links to the project tools so if you don't have it already go and download the SQL Server Express so it's
22:21
like a server that going to run locally at your PC where your database going to live another one that you have to download is the SQL Server management Studio it is just a client in order to interact with the database and there we're going to run all our queries and then link to the GitHub and as well link
22:36
to the draw AO if you don't have it already go and download it it is free and amazing tool in order to draw diagrams so through the project we will be drawing data models the data architecture a data lineage so a lot of stuff we'll be doing using this tool so
22:51
go and download it and the last thing it is nice to have you have a link to the notion where you can go and create of course free account accounts if you want to build the project plan and as well Follow Me by creating the project steps and the project tasks okay so that's all
23:06
those are all the links for the projects so go and download all those stuff create the accounts and once you are ready then we continue with the projects all right so now I hope that you have downloaded all the tools and created the accounts now it's time to
23:23
move to very important step that's almost all people skip while doing projects and then that is by creating the project plan and for that we will be using the tool notion notion is of course free tool and it can help you to organize your ideas your plans and
23:39
resources all in one place I use it very intensively for my private projects like for example creating this course and I can tell you creating a project plan is the key to success creating a data warehouse project is usually very complex and according to Gardner reports
23:54
over 50% of data warehouse projects fail and my opinion about any complex project the key to success is to have a clear project plan so now at this phase of the project we're going to go and create a rough project plan because at the moment we don't have yet clear understanding
24:11
about the data architecture so let's go okay so now let's create a new page and let's call it data warehouse projects the first thing is that we have to go and create the main phases and stages of the projects and for that we need a table so in order to do that hit slash and then type database in line and then
24:28
let's go and call it something like data warehouse epic and we're going to go and hide it because I don't like it and then on the table we can go and rename it like for example project epics something like that and now what we're going to do we're going to go and list all the big
24:43
task of the projects so an epic is usually like a large task that needs a lot of efforts in order to solve it so you can call it epics stages faces of the project whatever you want so we're going to go and list our project steps so it start with the requirements
24:58
analyzes and then designing data architecture and another one we have the project initialization so those are the three big task in the project first and now what do we need we need another table for the small chunks of the tasks the
25:15
subtasks and we're going to do the same thing so we're going to go and hit slash and we're going to search for the table in line and we're going to do the same thing so first we're going to call it data warehouse tasks and then we're going to hide it and over here we're going to rename it and say this is the project tasks so now what we're going to
25:32
do we're going to go to the plus icon over here and then search for relation this one over here with the arrow and now we're going to search for the name of the first table so we called it data warehouse iix so let's go and click it and we're going to say as well two-way relation so let's go and add the
25:49
relation so with that we got a fi in the new table called Data Warehouse iix this comes from this table and as well we have here data warehouse tasks that comes from from the below table so as you can see we have linked them together now what I'm going to do I'm going to take this to the left side and then what
26:04
we're going to do we're going to go and select one of those epics like for example let's take design the data architecture and now what we're going to do we're going to go and break down this Epic into multiple tasks like for example choose data management approach
26:19
and then we have another task what we're going to do we're going to go and select as well the same epic so maybe the next step is brainstorm and design the layers and then let's go to another iic for example the project initialization and we say over here for example create get
26:36
repo prepare the structure we can go and make another one in the same epic let's say we're going to go and create the database and the schemas so as you can see I'm just defining the subtasks of those epics so now what we're going to do we're going to go and add a checkbox
26:51
in order to understand whether we have done the task or not so we go to the plus and search for check we need the check box and what we're going to do we're going to make it really small like this and with that each time we are done with the task we're going to go and click on it just to make sure that we
27:07
have done the task now there is one more thing that is not really working nice and that is here we're going to have like a long list of tasks and it's really annoying so what we're going to do we're going to go to the plus over here and let's search for roll up so let's go and select it so now what we're going to do we have to go and select the
27:23
relationship it's going to be that data warehouse task and after that we're going to go to the property and make it as the check box so now as you can see in the first table we are saying how many tasks is closed but I don't want to show it like this what you going to do we're going to go to the calculation and to the percent and then percent checked
27:39
and with that we can see the progress of our project and now instead of the numbers we can have really nice bar great so as well we can go and give it a name like progress so that's it and we can go and hide the data warehouse tasks and now with that we have really nice progress bar for each epic and if we
27:55
close all the tasks of this epic we can see that we have reached 100% so this is the main structure now we can go and add some cosmetics and rename stuff in order to make things looks nicer like for example if I go to the tasks over here I can go and call it tasks and as well go
28:11
and change the icon to something like this and if you'd like to have an icon for all those epics what we going to do we're going to go to the Epic for example design data architecture and then if you hover on top of the title you can see add an icon and you can go and pick any icon that you want so for
28:27
example this one and now now as you can see we have defined it here in the top and the icon going to be as well in the pillow table okay so now one more thing that we can do for the project tasks is that we can go and group them by the epics so if you go to the three dots and then we go to groups and then we can
28:42
group up by the epics and as you can see now we have like a section for each epic and you can go and sort the epics if you want if you go over here sort then manual and you can go over here and start sorting the epics as you want and with that you can expand and minimize
28:58
each task if you don't want to see always all tasks in one go so this is really nice way in order to build like data management for your projects of course in companies we use professional Tools in order to do projects like for example Gyra but for private person projects that I do I always do it like
29:15
this and I really recommend you to do it not only for this project for any project that you are doing CU if you see the whole project in one go you can see the big picture and closing tasks and doing it like this these small things can makes you really satisfied and keeps you motivated to finish the whole
29:30
project and makes you proud okay friends so now I just went and added few icons a rename stuff and as well more tasks for each epic and this going to be our starting point in the project and once we have more informations we're going to go and add more details on how exactly
29:46
we're going to build the data warehouse so at the start we're going to go and analyze and understand the requirements and only after that we're going to start designing the data architecture and here we have three tasks first we have to to choose the data management approach and after that we're going to do brainstorming and designing the layers
30:03
of the data warehouse and at the end we're going to go and draw a data architecture so with that we have clear understanding how the data architecture looks like and after that we're going to go to the next epic where we're going to start preparing our projects so once we have clear understanding of the data
30:18
architecture the first task here is to go and create detailed project tasks so we're going to go and add more epes and more tasks and once we are done then we're going to go and create the naming conventions for the project just to make sure that we have rules and standards in the whole project and next we're going
30:34
to go and create a repository in the git and we can to prepare as well the structure of the repository so that we always commit our work there and then we can start with the first script where we can create a database and schemas so my friends this is the initial plan for the project now let's start with the first
30:51
epic we have the requirements analyzes now analyzing the requirement it is very important to understand which type of data wehous you're going to go and build because there is like not only one standard on how to build it and if you
31:07
go blindly implementing the data warehouse you might be doing a lot of stuff that is totally unnecessary and you will be burning a lot of time so that's why you have to sit with the stockholders with the department and understand what we exactly have to build and depend on the requirements you
31:23
design the shape of the data warehouse so now let's go and analyze the requirement of this project now the whole project is splitted into two main sections the first section we have to go and build a data warehouse so this is a data engineering task and we will go and develop etls and data warehouse and once
31:41
we have done that we have to go and build analytics and reporting business intelligence so we're going to do data analysis but now first we will be focusing on the first part building the data warehouse so what do you have here the statement is very simple it says
31:56
develop a modern data warehouse using SQL Server to consolidate sales data enabling analytical reporting and informed decision making so this is the main statements and then we have specifications the first one is about the data sources it says import data
32:12
from two Source systems Erb and CRM and they are provided as CSV files and now the second task is talking about the data quality we have to clean and fix data quality issues before we do the data analyses because let's be real there is no R data that is perfect is
32:29
always missing and we have to clean that up now the next task is talking about the integration so it says we have to go and combine both of the sources into one single userfriendly data model that is designed for analytics and Reporting so that means we have to go and merge those
32:45
two sources into one single data model and now we have here another specifications it says focus on the latest data sets so there is no need for historization so that means we don't have to go and build histories in the the database and the final requirement is talking about the documentation so it
33:01
says provide clear documentations of the data model so that means the last product of the data warehouse to support the business users and the analytical teams so that means we have to generate a manual that's going to help the users that makes lives easier for the consumers of our data so as you can see
33:18
maybe this is very generic requirements but it has a lot of information already for you so it's saying that we have to use the platform SQL Server we have two Source systems using using the CSV files and it sounds that we really have a bad data quality in the sources and as well it wants us to focus on building
33:35
completely new data model that is designed for reporting and it says we don't have to do historization and it is expected from us to generate documentations of the system so these are the requirements for the data engineering part where we're going to go and build a data warehouse that fulfill
33:52
these requirements all right so with that we have analyzed the requirements and as well we have closed at the first easiest epic so we are done with this let's go and close it and now let's open another one here we have to design the data architecture and the first task is to choose data management approach so
34:10
let's go now designing the data architecture it is exactly like building a house so before construction starts an architect going to go and design a plan a blueprint for the house how the rooms
34:25
will be connected how to make the house functional safe and wonderful and without this blueprint from The Architects the builders might create something unstable inefficient or maybe unlivable the same goes for data projects a data architect is like a house architect they design how your
34:41
data will flow integrate and be accessed so as data Architects we make sure that the data warehouse is not only functioning but also scalable and easy to maintain and this is exactly what we will do now we will play the role of the data architect and we will start brainstorming and designing the
34:58
architecture of the data warehouse so now I'm going to show you a sketch in order to understand what are the different approaches in order to design a data architecture and this phase of the projects usually is very exciting for me because this is my main role in data projects I am a data architect and
35:14
I discuss a lot of different projects where we try to find out the best design for the projects all right so now let's go now the first step of building a data architecture is to make very important decision to choose between four major
35:30
types the first approach is to build a data warehouse it is very suitable if you have only structured data and your business want to build solid foundations for reporting and business intelligence and another approach is to build a data leak this one is way more flexible than
35:47
a data warehouse where you can store not only structured data but as well semi and unstructured data we usually use this approach if you have mixed types of data like database tables locks images videos and your business want to focus not only on reporting but as well on
36:03
Advanced analytics or machine learning but it's not that organized like a data warehouse and data leaks if it's too much unorganized can turns into Data swamp and this is where we need the next approach so the next one we can go and build data leak house so it is like a
36:18
mix between data warehouse and data leak you get the flexibility of having different types of data from the data Lake but you still want to structure and organiz your data like we do in the data warehouse so you mix those two words into one and this is a very modern way on how to build data Architects and this
36:35
is currently my favorite way of building data management system now the last and very recent approach is to build data Mish so this is a little bit different instead of having centralized data management system the idea now in the data Mish is to make it decentralized you cannot have like one centralized
36:51
data management system because always if you say centralized then it means bottleneck so instead you have multiple departments and multiple domains where each one of them is building a data product and sharing it with others so now you have to go and pick one of those approaches and in this project we will
37:07
be focusing on the data warehouse so now the question is how to build the data warehouse well there is as well four different approaches on how to build it the first one is the inone approach so again you have your sources and the first layer you start with the staging where the row data is landing and then
37:23
the next layer you organize your data in something called Enterprise data Warehouse where you go and model the data using the third normal format it's about like how to structure and normalize your tables so you are building a new integrated data model
37:38
from the multiple sources and then we go to the third layer it's called the data Mars where you go and take like small subset of the data warehouse and you design it in a way that is ready to be consumed from reporting and it focus on only one toque like for example the
37:53
customers sales or products and after that you go and connect your bi tool like powerbi or Tableau to the data Mars so with that you have three layers to prepare the data before reporting now moving on to the next one we have the kle approach he says you know what
38:08
building this Enterprise data warehouse it is wasting a lot of time so what we can do we can jump immediately from the stage layer to the final data marks because building this Enterprise data warehouse it is a big struggle and usually waste a lot of time so he always
38:23
want you to focus and building the data marks quickly as possible so it is faster approach than Inon but with the time you might get chaos in the data Mars because you are not always focusing in the big picture and you might be repeating same Transformations and Integrations in different data Mars so
38:40
there is like trade-off between the speed and consistent data warehouse now moving on to the third approach we have the Data Vault so we still have the stage and the data Mars but it says we still need this Central Data Warehouse in the middle but this middle layer we're going to bring more standards and
38:56
rules so it tells you to split this middle layer into two layers the row Vault and the business vault in the row Vault you have the original data but in the business Vault you have all the business rules and Transformations that prepares the data for the data Mars so
39:12
Data Vault it is very similar to the in one but it brings more standards and rules to the middle layer now I'm going to go and add a fourth one that I'm going to call it Medallion architecture and this one is my favorite one because it is very easy to understand and to
39:27
build so it says you're going to go and build three layers bronze silver and gold the bronze layer it is very similar to the stage but we have understood with the time that the stage layer is very important because having the original data as it is it going to helps a lot by tracebility and finding issues then the
39:44
next layer we have the silver layer it is where we do Transformations data cleansy but we don't apply yet any business rules now moving on to the last layer the gold layer it is as well very similar to the data Mars but there we can build different typ type of objects not only for reporting but as well for
40:00
machine learning for AI and for many different purposes so they are like business ready objects that you want to share as a data product so those are the four approaches that you can use in order to build a data warehouse so again if you are building a data architecture
40:16
you have to specify which approach you want to follow so at the start we said we want to build a data warehouse and then we have to decide between those four approaches on how to build the data warehouse and in this project we will be using using The Medallion architecture so this is a very important question that you have to answer as the first
40:32
step of building a data architecture all right so with that we have decided on the approach so we can go and Mark it as done the next step we're going to go and design the layers of the data warehouse now there is like not 100%
40:50
standard way and rules for each layer what you have to do as a data architect you have to Define exactly what is the purpose of each layer so we start with the bronze layer so we say it going to store row and unprocessed data as it is from the sources and why we are doing
41:06
that it is for tracebility and debugging if you have a layer where you are keeping the row data it is very important to have the data as it is from the sources because we can go always back to the pron layer and investigate the data of specific Source if something goes wrong so the main objective is to
41:22
have row untouched data that's going to helps you as a data engineer by analyzing the road cause of issues now moving on to the silver layer it is the layer where we're going to store clean and standardized data and this is the place where we're going to do basic transformations in order to prepare the
41:39
data for the final layer now for the good layer it going to contain business ready data so the main goal here is to provide data that could be consumed by business users and analysts in order to build reporting and analytics so with that we have defined the main goal for
41:54
each layer now next what I would like to do is to to define the object types and since we are talking about a data warehouse in database we have here generally two types either a table or a view so we are going for the bronze layer and the silver layer with tables but for the gold layer we are going with
42:10
the views so the best practice says for the last layer in your data warehouse make it virtual using views it going to gives you a lot of dynamic and of course speed in order to build it since we don't have to make a load process for it and now the next step is that we're going to go and Define the load method
42:25
so in this project I have decided to go with the full load using the method of trating and inserting it is just faster and way easier so we're going to say for the pron layer we're going to go with the full load and you have to specify as well for the silver layer as well we're going to go with the full load and of course for the views we don't need any
42:41
load process so each time you decide to go with tables you have to define the load methods with full load incremental loads and so on now we come to the very interesting part the data Transformations now for the pron layer it is the easiest one about this topic because we don't have any transformations we have to commit
42:58
ourself to not touch the data do not manipulate it don't change anything so it's going to stay as it is if it comes bad it's going to stay bad in the bronze layer and now we come to the silver layer where we have the heavy lifting as we committed in the objective we have to make clean and standardized data and for
43:14
that we have different types of Transformations so we have to do data cleansing data standardizations data normalizations we have to go and derive new columns and data enrichment so there are like bunch of trans transformation that we have to do in order to prepare the data our Focus here is to transform
43:32
the data to make it clean and following standards and try to push all business transformations to the next layer so that means in the god layer we will be focusing on business Transformations that is needed for the consumers for the use cases so what we do here we do data
43:47
Integrations between Source system we do data aggregations we apply a lot of business Logics and rules and we build a data model that is ready for for example business inions so here we do a lot of business Transformations and in the silver layer we do basic data
44:02
Transformations so it is really here very important to make the fine decisions what type of transformations to be done in each layer and make sure that you commit to those rules now the next aspect is about the data modeling in the bronze layer and the silver layer
44:17
we will not break the data model that comes from the source system so if the source system deliver five tables we're going to have here like five tables and as well in the silver layer we will not go and D normalize or normalize or like make something new we're going to leave it exactly like it comes from the source
44:32
system because what we're going to do we're going to build the data model in the gold layer and here you have to Define which data model you want to follow are you following the star schema the snowflake or are you just making aggregated objects so you have to go and make a list of all data models types
44:47
that you're going to follow in the gold layer and at the end what you can specify in each layer is the target audience and this is of course very important decision in the bronze layer you don't want to give access access to any end user it is really important to make sure that only data Engineers access the bronze layer it makes no
45:03
sense for data analysts or data scientist to go to the bad data because you have a better version for that in the silver layer so in the silver layer of course the data Engineers have to have an access to it and as well the data analysts and the data scientist and so on but still you don't give it to any
45:19
business user that can't deal with the row data model from the sources because for the business users you're going to get a bit layer for them and that is the gold layer so the gold layer it is suitable for the data analyst and as well the business users because usually
45:35
the business users don't have a deep knowledge on the technicality of the Sero layer so if you are designing multiple layers you have to discuss all those topics and make clear decision for each layer all right my friends so now before we proceed with the design I want to tell you a secret principle Concepts
45:51
that each data architect must know and that is the separation of concerns so what is that as you are designing an architecture you have to make sure to break down the complex system into smaller independent parts and each part is responsible for a specific task and
46:08
here comes the magic the component of your architecture must not be duplicated so you cannot have two parts are doing the same thing so the idea here is to not mix everything and this is one of the biggest mistakes in any big projects and I have sewn that almost everywhere
46:25
so a good data architect follow this concept this principle so for example if you are looking to our data architecture we have already done that so we have defined unique set of tasks for each layer so for example we have said in the silver layer we do data cleansing but in
46:41
the gold layer we do business Transformations and with that you will not be allowing to do any business transformations in the silver layer and the same thing goes for the gold layer you don't do in the gold layer any data cleansing so each layer has its own unique tasks and the same thing goes for
46:57
the pron layer and the silver layer you do not allow to load data from The Source systems directly to the silver layer because we have decided the landing layer the first layer is the pron layer otherwise you will have like set of source systems that are loaded first to the pron layer and another set
47:14
is skipping the layer and going to the silver and with that we have overlapping you are doing data inje in two different layers so my friends if you have this mindsets separation of concerns I promise you you're going to be a data architect so think about it all right my
47:29
friends so with that we have designed the layers of the data warehouse we can go and close it the next step we're going to go to draw o and start drawing the data architecture so there is like no one standard on how to build a data
47:45
architecture you can add your style and the way that you want so now the first thing that we have to show in data architecture is the different layers that we have the first layer is the source system layer so let's go and take a box like this and make it a little bit bigger and I'm just going to go and make
48:00
the design so I'm going to remove the fill and make the line dotted one and after dots I'm going to go and change maybe the color to something like this gray so now we have like a container for the first layer and then we have to go and add like a text on top of it so what I'm going to do I'm going to take another box let's go and type inside it
48:17
sources and I'm going to go and style it so I'm going to go to the text and make it maybe 24 and then remove the lines like this make it a little bit smaller and put it on top so this is the first layer this is where the data come from and then the data going to go inside a
48:33
data warehouse so I'm just going to go and duplicate this one this one is the data warehouse all right so now the third layer what is going to be it's going to be the consumers who will be consuming this data warehouse so I'm going to put
48:50
another box and say this is the consume layer okay so those are the three containers now inside the data warehouse we have decided to build it using the Medan architecture so we're going to have three layers inside the warehouse so I'm going to take again another box
49:06
I'm going to call this one this is the bronze layer and now we have to go and put a design for it so I'm going to go with this color over here and then the text and maybe something like 20 and then make it a little bit smaller and just put it here and beneath that we're
49:22
going to have the component so this is just a title of a container so I'm going to have it like this this remove the text from inside it and remove the filling so this container is for the bronze layer let's go and duplicate it for the next one so this one going to be
49:38
the silver layer and of course we can go and change the coloring to gray because it is silver and as well the lines and remove the filling great and now maybe I'm going to make the font as bold all right now the third layer going to be the gold
49:54
layer and we have to go and pick it color for that so style and here we have like something like yellow the same thing for the container I remove the filling so with that we are showing now the different layers inside our data warehouse now those containers are empty what we're going to do we're going to go
50:10
inside each one of them and start adding contents so now in the sources it is very important to make it clear what are the different types of source system that you are connecting to the data warehouse because in real project there are like multiple types you might have a database API files CFA and here it's
50:26
important to show those different types in our projects we have folders and inside those folders We have CSV files so now what you have to do we have to make it clear in this layer that the input for our project is CSV file so it really depend how you want to show that I'm going to go over here and say maybe
50:42
folder and then I'm going to go and take the folder and put it here inside and then maybe search for file more results and go pick one of those icons for example I'm going to go with this one over here so I'm going to make it smaller and add it on top of the folder so with that we make it clear for
50:57
everyone seeing the architecture that the sources is not a database is not an API it is a file inside the folder so now very important here to show is the source systems what are the sources that is involved in the project so here what we're going to do we're going to go and give it a name for example we have one
51:13
source called CRM B like this and maybe make the icon and we have another source called Erp so we going to go and duplicate it put it over here and then rename it Erp so now it is for everyone clear we have two sources for the this project and the technology is used is
51:28
simply a file so now what we can do as well we can go and add some descriptions inside this box to make it more clear so what I'm going to do I'm going to take a line because I want to split the description from the icons something like this and make it gray and then below it we're going to go and add some text and we're going to say is CSV file
51:47
and the next point and we can say the interface is simply files in folder and of course you can go and add any specifications and explanation about the sources if it is a database you can see the type of the database and so on so with that we made it in the data
52:02
architecture clear what are the sources of our data warehouse and now the next step what we're going to do we're going to go and design the content of the bronze silver and gold so I'm going to start by adding like an icon in each container it is to show about that we are talking about database so what we're going to do we're going to go and search
52:18
for database and then more result more results I'm going to go with this icon over here so let's go and make it it's bigger something like this maybe change the color of that so we're going to have the bronze and as well here the silver
52:34
and the gold so now what we're going to do we're going to go and add some arrows between those layers so we're going to go over here so we can go and search for Arrow and maybe go and pick one of those let's go and put it here and we can go and pick a color for that maybe something like this and adjust it so now
52:50
we can have this nice Arrow between all the layers just to explain the direction of our architecture right so we can read this from left to right and as well between the gold layer and the consume okay so now what I'm going to do next we're going to go and add one statement
53:05
about each layer the main objective so let's go and grab a text and put it beneath the database and we're going to say for example for the bl's layer it's going to be the row data maybe make the text bigger so you are the row data and then the next one in the silver you are
53:21
cleans standard data and then the last one for the gos we can say business ready data so with that we make the objective clear for each layer now below all those icons what we going to do we're going to have a separator again
53:36
like this make it like colored and beneath it we're going to add the most important specifications of this layer so let's go and add those separators in each layer okay so now we need a text below it let's take this one here so what is the object type of the bronze
53:53
layer it's going to be a table and we can go and add the load methods we say this is patch processing since we are not doing streaming we can say it is a full load we are not doing incremental load so we can say here
54:08
Tran and insert and then we add one more section maybe about the Transformations so we can say no Transformations and one more about the data model we're going to say none as is and now what I'm going to do I'm going to go and add those specifications as
54:24
well for the silver and gold so here what we have discussed the object type the load process the Transformations and whether we are breaking the data model or not the same thing for the gold layer so I can say with that we have really nice layering of the data warehouse and what we are
54:39
left is with the consumers over here you can go and add the different use cases and tools that can access your data warehouse like for example I'm adding here business intelligence and Reporting maybe using poweri or Tau or you can say you can access my data warehouse in order to do atoc analyzes using the SQ
54:56
queries and this is what we're going to focus on the projects after we buil the data warehouse and as well you can offer it for machine learning purposes and of course it is really nice to add some icons in your architecture and usually I use this nice websites called Flat icon it has really amazing icons that you can
55:12
go and use it in your architecture now of course we can go and keep adding icons and stuff to explain the data architecture and as well the system like for example it is very important here to say which tools you are using in order to build this data warehouse is it in the cloud are you using Azure data
55:27
breaks or maybe snowflake so we're going to go and add for our project the icon of SQL Server since we are building this data warehouse completely in the SQL Server so for now I'm really happy about it as you can see we have now a plan right all right guys so with that we have designed the data architecture
55:43
using the drw O and with that we have done the last step in this epic and now with that we have a design for the data architecture and we can say we have closed this epic now let's go to the next one we will start doing the first step to prepare our projects and the first task here is to create a detailed
55:59
project plan all right my friends so now it's clear for us that we have three layers and we have to go and build them so that means our big epic is going to be after the layers so here I have added three more epics so we have build bronze layer
56:16
build silver layer and gold layer and after that I went and start defining all the different tasks that we have to follow in the projects so at the start will be analyzing then coding and after that we're going to go and do testing and once everything is ready we're going
56:32
to go and document stuff and at the end we have to commit our work in the get repo all those epics are following the same like pattern in the tasks so as you can see now we have a very detailed project structure and now things are more cleared for us how we going to
56:47
build the data warehouse so with that we are done from this task and now the next task we have to go and Define the naming Convention of the projects all right so now at this phase of the projects we usually Define the naming
57:03
conventions so what is that it a set of rules that you define for naming everything in the projects whether it is a database schema tables start procedures folders anything and if you don't do that at the early phase of the project I promise you chaos can happen
57:21
because what going to happen you will have different developers in your projects and each of those developers have their own style of course so one developer might name a tabled Dimension customers where everything is lowercase and between them underscore and you have another developer creating another table
57:36
called Dimension products but using the camel case so there is no separation between the words and the first character is capitalized and maybe another one using some prefixes like di imore categories so we have here like a shortcut of the dimension so as you can see there are different designs and
57:53
styles and if you leave the door open what can happen in the middle of the projects you will notice okay everything looks inconsistence and you can define a big task to go and rename everything following specific role so instead of wasting all this time at this phase you
58:09
go and Define the naming conventions and let's go and do that so we will start with a very important decision and that is which naming convention we going to follow in the whole project so you have different cases like the camel case the Pascal case the Kebab case and the snake
58:25
case and for this project we're going to go with the snake case where all the letters of award going to be lowercase and the separation between wordss going to be an underscore for example a table name called customer info customer is lowercased info is as well lowercased
58:42
and between them an underscore so this is always the first thing that you have to decide for your data project the second thing is to decide the language so for example I work in Germany and there is always like a decision that we have to make whether we use Germany or English so we have to decide for our
58:58
project which language we're going to use and a very important general rule is that avoid reserved words so don't use a square reserved word as an object name like for example table don't give a table name as a table so those are the general principles so those are the
59:15
general rules that you have to follow in the whole project this applies for everything for tables columns start procedures any names that you are giving in your scripts now moving on we have specifications for the table names and here we have different set of rules for
59:30
each layer so here the rule says Source system uncore entity so we are saying all the tables in the bronze layer should start first with the source system name like for example CRM or Erb and after that we have an underscore and then at the end we have the entity name
59:47
or the table name so for example we have this table name CRM uncore so that means this table comes from the source system CRM and then we have the table name the entity name customer info so this is the rule that we're going to follow in naming all tables in the pron layer then
00:03
moving on to the silver layer it is exactly like the bronze because we are not going to rename anything we are not going to build any new data model so the naming going to be one to one like the bronze so it is exactly the same rules as the bronze but if we go to the gold
00:19
here since we are building new data model we have to go and rename things and since as well we are integrating multi sources together we will not be using the source system name in the tables because inside one table you could have multiple sources so the rule says all the names must be meaningful
00:36
business aligned names for the tables starting with the category prefix so here the rule says it start with category then underscore and then entity now what is category we have in the go layer different types of tables so we could build a table called a fact table
00:53
another one could be a dimension a third type could be an aggregation or report so we have different types of tables and we can specify those types as a perect at the start so for example we are seeing here effect uncore sales so the
01:09
category is effect and the table name called sales and here I just made like a table with different type of patterns so we could have a dimension so we say it start with the di imore for example the IM customers or products and then we have another type called fact table so
01:25
it starts with fact underscore or aggregated table where we have the fair three characters like aggregating the customers or the sales monthly so as you can see as you are creating a naming convention you have first to make it clear what is the rule describe each
01:40
part of the rule and start giving examples so with that we make it clear for the whole team which names they should follow so we talked here about the table naming convention then you can as well go and make naming convention for the columns like for example in the gold layer we're going to go and have
01:56
circuit keys so we can Define it like this the circuit key should start with a table name and then underscore a key like for example we can call it customer underscore key it is a surrogate key in the dimension customers the same thing for technical columns as a data engineer
02:11
we might add our own columns to the tables that don't come from the source system and those columns are the technical columns or sometimes we call them metadata columns now in order to separate them from the original columns that comes from the source system we can have like a prefix for that like
02:28
for example the rule says if you are building any technical or metadata columns the column should start with dwore and then that column name for example if you want the metadata load date we can have dwore load dates so with that if anyone
02:44
sees that column starts with DW we understand this data comes from a data engineer and we can keep adding rules like for example the St procedure over here if you are making an ETL script then it should should start with the prefix load uncore and then the layer
02:59
for example the St procedure that is responsible for loading the bronze going to be called load uncore bronze and for the Silver Load uncore silver so those are currently the rules for the St procedure so this is how I do it usually in my projects all right my friends so
03:14
with do we have a solid namey conventions for our projects so this is done and now the next with that we're going to go to git and you will create a brand new repository and we're going to prepare its structure so let's go go all right so now we come to as well
03:31
important step in any projects and that's by creating the git repository so if you are new to git don't worry about it it is simpler than it sounds so it's all about to have a safe place where you can put your codes that you are developing and you will have the possibility to track everything happen to the codes and as well you can use it
03:48
in order to collaborate with your team and if something goes wrong you can always roll back and the best part here once you are done with the project you can share your reposter as a part of your portfolio and it is really amazing thing if you are applying for a job by showcasing your skills that you have built a data warehouse by using well
04:05
documented get reposter so now let's go and create the reposter of the project now we are at the overview of our account so the first thing that you have to do is to go to the repos stories over here and then we're going to go to this green button and click on you the first thing that we have to do is to give
04:21
Theory name so let's call it SQL data warehouse project and then here we can go and give it a description so for example I'm saying building a modern data warehouse with SQL Server now the next option whether you want to make it public and private I'm going to leave it
04:37
as a public and then let's go and add here a read me file and then here about the license we can go over here and select the MIT MIT license gives everyone the freedom of using and modifying your code okay so I think I'm happy with the setup let's go and create
04:53
the repost story and with that we have our brand new reposter now the next step that I usually do is to create the structure of the reposter and usually I always follow the same patterns in any projects so here we need few folders in order to put our files right so what I
05:09
usually do I go over here to add file create a new file and I start creating the structure over here so the first thing is that we need data sets then slash and with that the repos you can understand this is a folder not a file and then you can go and add anything
05:24
like here play holder just an empty file this just can to help me to create the folders so let's go and commit so commit the changes and now if you go back to the main projects you can see now we have a folder called data sets so I'm going to go and keep creating stuff so I
05:41
will go and create the documents placeholder commit the changes and then I'm going to go and create the scripts Place holder and the final one what I usually add is the the tests something like this so with that
06:00
as you can see now we have the main folders of our repository now what I usually do the next with that I'm going to go and edit the main readme so you can see it over here as well so what we're going to do we're going to go inside the read me and then we're going to go to the edit button here and we're going to start writing the main
06:16
information about our project this is really depend on your style so you can go and add whatever you want this is the main page of your repository and now as you can see the file name here ismd it stands for markdown it is just an easy
06:31
and friendly format in order to write a text so if you have like documentations you are writing a text it is a really nice format in order to organize it structure it and it is very friendly so what I'm going to do at the start I'm going to give a few description about the project so we have the main title
06:47
and then we have like a welcome message and what this reposter is about and in the next section maybe we can start with the project requirements and then maybe at the end you can say few words about the licensing and few words about you so as you can see it's like the homepage of
07:02
the project and the repository so once you are done we're going to go and commit the changes and now if you go to the main page of the repository you can see always the folder and files at the start and then below it we're going to see the informations from the read me so again here we have the welcome statement
07:19
and then the projects requirements and at the end we have the licensing and about me so my friends that's that's it we have now a repost story and we have now the main structure of the projects and through the projects as we are building the data warehouse we're going to go and commit all our work in this
07:35
repository nice right all right so with that we have now your repository ready and as we go in the projects we will be adding stuff to it so this step is done and now the last step finally we're going to go to the SQL server and we're going to write our first scripts where
07:51
we're going to create a database and schemas all right now the first step is we have to go and create brand new database so now in order to do that first we have to switch to the database master so you can do it like this use master and semicolon
08:09
and if you go and execute it now we are switched to the master database it is a system database in SQL Server where you can go and create other databases and you can see from the toolbar that we are now logged into the master database now the next step we have to go and create our new database so we're going to say
08:25
say create database and you can call it whatever you want so I'm going to go with data warehouse semicolon let's go and execute it and with that we have created our database let's go and check it from the object Explorer let's go and refresh and you can see our new data
08:41
warehouse this is our new database awesome right now to the next step we're going to go and switch to the new database so we're going to say use data warehouse and semicolon so let's go and switch to it and you can see now now we are logged into the data warehouse
08:57
database and now we can go and start building stuff inside this data warehouse so now the first step that I usually do is I go and start creating the schemas so what is the schema think about it it's like a folder or a container that helps you to keep things organized so now as we decided in the
09:14
architecture we have three layers bronze silver gold and now we're going to go and create for each layer a schema so let's go and do that we're going to start with the first one create schema and the first one is bronze so let's do it like this and a semicolon let's go
09:29
and create the first schema nice so we have new schema let's go to our database and then in order to check the schemas we go to the security and then to the schemas over here and as you can see we have the bronze and if you don't find it you have to go and refresh the whole schemas and then you will find the new
09:46
schema great so now we have the first schema now what we're going to do we're going to go and create the others two so I'm just going to go and duplicate it so the next one going to be the silver and the third one going to be the golds so let's go and execute those two together we will get an error and that's because
10:02
we are not having the go in between so after each command let's have a go and now if I highlight the silver and gold and then execute it will be working the go in SQL it is like separator so it tells SQL first execute completely the
10:18
First Command before go to the next one so it is just separator now let's go to our schemas refresh and now we can see as well we have the gold and the silver so with this we have now a database we have the three layers and we can start developing each layer
10:37
individually okay so now let's go and commit our work in the git so now since it is a script and code we're going to go to the folder scripts over here and then we're going to go and add a new file let's call it init database.sql and now we're going to go and paste our code
10:52
over here so now I have done few modifications like for example before we create the database we have to check whether the database exists this is an important step if you are recreating the database otherwise if you don't do that you will get an error where it's going going to say the database already exists
11:09
so first it is checking whether the database exist then it drops it I have added few comments like here we are saying creating the data warehouse creating the schemas and now we have a very important step we have to go and add a header comment at the start of each scripts to be honest after 3 months
11:26
from now you will not be remembering all the details of these scripts and adding a comment like this it is like a sticky note for you later once you visit this script again and it is as well very important for the other developers in the team because each time you open a scripts the first question going to be
11:42
what is the purpose of this script because if you or anyone in the team open the file the first question going to be what is the purpose of these scripts why we are doing these stuff so as you can see here we have a comment saying this scripts create a new data warehouse after checking if it already
11:59
exists if the database exists it's going to drop it and recreate it and additionally it's going to go and create three schemas bronze silver gold so that it gives Clarity what this script is about and it makes everyone life easier now the second reason why this is very
12:14
important to add is that you can add warnings and especially for this script it is very important to add these notes because if you run these scripts what's going to happen it's going to go and destroy the whole database imagine someone open the script and run it imagine an admin open the script and run
12:31
it in your database everything going to be destroyed and all the data will be lost and this going to be a disaster if you don't have any backup so with that we have nice H our comment and we have added few comments in our codes and now we are ready to commit our codes so let's go and commit it and now we have
12:49
our scripts in the git as well and of course if you are doing any modifications make sure to update the changes in the Gs okay my friends so with that we have an empty database and schemas and we are done with this task and as well we are done with the whole epic so we have completed the project
13:05
initialization and now we're going to go to the interesting stuff we will go and build the bronze layer so now the first task is to analyze the source systems so let's go all right so now the big question is how to build the bronze layer so first
13:22
thing first we do analyzing as you are developing anything you don't immediately start writing a code so before we start coding the bronze layer what we usually do is we have to understand the source system so what I usually do I make an interview with the source system experts and ask them many
13:38
many questions in order to understand the nature of the source system that I'm connecting to the data warehouse and once you know the source systems then we can start coding and the main focus here is to do the data ingestion so that means we have to find a way on how to
13:54
load the data from The Source into the data warehouse so it's like we are building a bridge between the source and our Target system the data warehouse and once we have the code ready the next step is we have to do data validation so here comes the quality control it is
14:09
very important in the bronze layer to check the data completeness so that means we have to compare the number of Records between the source system and the bronze layer just to make sure we are not losing any data in between and another check that we will be doing is the schema checks and that's to make
14:24
sure that the data is placed on the right position and finally we don't have to forget about documentation and committing our work in the gits so this is the process that we're going to follow to build the bronze
14:39
layer all right my friends so now before connecting any Source systems to our data warehouse we have to make very important step is to understand the sources so how I usually do it I set up a meeting with the source systems experts in order to interview them to ask them a lot of stuff about the source
14:56
and gaining this knowledge is very important because asking the right question will help you to design the correct scripts in order to extract the data and to avoid a lot of mistakes and challenges and now I'm going to show you the most common questions that I usually ask before connecting anything okay so
15:12
we start first by understanding the business context and the ownership so I would like to understand the story behind the data I would like to understand who is responsible for the data which it departments and so on and then it's nice to understand as well what business process it supports does
15:27
it support the customer transactions the supply chain Logistics or maybe Finance reporting so with that you're going to understand the importance of your data and then I ask about the system and data documentation so having documentations from the source is your learning materials about your data and it going
15:43
to saves you a lot of time later when you are working and designing maybe new data models and as well I would like always to understand the data model for the source system and if they have like descript I of the columns and the tables it's going to be nice to have the data catalog this can helps me a lot in the
15:59
data warehouse how I'm going to go and join the tables together so with that you get a solid foundations about the business context the processes and the ownership of the data and now in The Next Step we're going to start talking about the technicality so I would like to understand the architecture and as well the technology stack so the first
16:16
question that I usually ask is how the source system is storing the data do we have the data on the on Prem like an SQL Server Oracle or is it in the cloud like Azure lws and so on and then once we understand that then we can discuss what are the integration capabilities like
16:33
how I'm going to go and get the data do the source system offer apis maybe CFA or they have only like file extractions or they're going to give you like a direct connection to the database so once you understand the technology that you're going to use in order to extract the data then we're going to Deep dive
16:49
into more technical questions and here we can understand how to extract the data from The Source system and and then load it into the data warehouse so the first things that we have to discuss with the experts can we do an incremental load or a full load and then after that we're going to discuss the
17:04
data scope the historization do we need all data do we need only maybe 10 years of the data are there history is already in the source system or should we build it in the data warehouse and so on and then we're going to go and discuss what is the expected size of the extracts are
17:20
we talking here about megabytes gigabytes terabytes and this is very important to understand whether we have the right tools and platform to connect the source system and then I try to understand whether there are any data volume limitations like if you have some Old Source systems they might struggle a
17:37
lot with performance and so on so if you have like an ETL that extracting large amount of data you might bring the performance down of the source system so that's why you have to try to understand whether there are any limitations for your extracts and as well other aspects that might impact the performance of The
17:53
Source system this is very important if they give you an access to the database you have to be responsible that you are not bringing the performance of the database down and of course very important question is to ask about the authentication and the authorization like how you going to go and access the
18:08
data in the source system do you need any tokens Keys password and so on so those are the questions that you have to ask if you are connecting new source system to the data warehouse and once you have the answers for those questions you can proceed with the next steps to connect the sources to the that
18:24
Warehouse all right my friends so with that you have learned how to analyze a new source systems that you want to connect to your data warehouse so this STP is done and now we're going to go back to coding where we're going to write scripts in order to do the data ingestion from the CSV files to the Bros
18:43
layer and let's have quick look again to our bronze layer specifications so we just have to load the data from the sources to the data warehouse we're going to build tables in the bronze layer we are doing a full load so that means we are trating and then inserting the data there will be no data
18:59
Transformations at all in the bronze layer and as well we will not be creating any data model so this is the specifications of the bronze layer all right now in order to create the ddl script for the bronze layer creating the tables of the bronze we have to understand the metadata the structure
19:15
the schema of the incoming data and here either you ask the technical experts from The Source system about these informations or you can go and explore the incoming data and try to define the structure of your tables so now what we're going to do we're going to start with the First Source system the CRM so
19:32
let's go inside it and we're going to start with the first table that customer info now if you open the file and check the data inside it you see we have a Header information and that is very good because now we have the names of the columns that are coming from the source and from the content you can Define of
19:47
course the data types so let's go and do that first we're going to say create table and then we have to define the layer it's going to be the bronze and now very important we have to follow the naming convention so we start with the name of the source system it is the CRM underscore and then after that the table
20:03
name from The Source system so it's going to be the costore info so this is the name of our first table in the bronze layer then the next step we have to go and Define of course the columns and here again the column names in the bronze layer going to be one to one exactly like the source system so the
20:20
first one going to be the ID and I will go with the data type integer then the next one going to be the key invar Char and the length I will go with [Music]
20:35
50 and the last one going to be the create dates it's going to be date so with that we have covered all the columns available from The Source system so let's go and check and yes the last one is the create date so that's it for the first table now semicolon of course
20:51
at the end let's go and execute it and now we're going to go to the object Explorer over here refresh and we can see the first table inside our data warehouse amazing right so now next what you have to do is to go and create a ddl statement for each file for those two
21:07
systems so for the CRM we need three ddls and as well for the other system the Erp we have as well to create three ddls for the three files so at the ends we're going to have in the bronze ler Six Tables six ddls so now pause the
21:22
video go create those ddls I will be doing the same as well and we will see you soon all right so now I hope you have created all those details I'm going to show you what I have just created so the second table in the source CRM we have
21:39
the product informations and the third one is the sales details then we go to the second system and here we make sure that we are following the naming convention so first The Source system Erb and then the table name so the second system was really easy you can see we have only here like two columns
21:55
and for the customers like only three and for the categories only four columns all right so after defining those stuff of course we have to go and execute them so let's go and do that and then we go to the object Explorer over here refresh the tables and with that you can see we
22:11
have six empty tables in the bronze layer and with that we have all the tables from the two Source systems inside our database but still we don't have any data and you can see our naming convention is really nice you see the first three tables comes from the CRM
22:26
Source system and then the other three comes from the Erb so we can see in the bronze layer the things are really splitted nicely and you can identify quickly which table belonged to which source system now there is something else that I usually add to the ddl script is to check whether the table
22:42
exists before creating so for example let's say that you are renaming or you would like to change the data type of specific field if you just go and run this Square you will get an error because the database going to say we have already this table so in other databases you can say create or replace
22:58
table but in the SQL Server you have to go and build a tsql logic so it is very simple first we have to go and check whether the object exist in the database so we say if object ID and then we have to go and specify the table name so let's go and copy the whole thing over
23:15
here and make sure you get exactly the same name as a table name so there is see like space I'm just going to go and remove it and then we're going to go and Define the object type so going to be the U it stands for user it is the user defined tables so if this table is not null so this means the database did find
23:32
this object in the database so what can happen we say go and drop the table so the whole thing again and semicolon so again if the table exist in the database is not null then go and drop the table and after that go and created so now if
23:49
you go and highlight the whole thing and then execute it it will be working so first drop the table if it exist then go and create the table from scratch now what you have to do is to go and add this check before creating any table inside our database so it's going to be
24:06
the same thing for the next table and so on I went and added all those checks for each table and what can happen if I go and execute the whole thing it going to work so with that I'm recreating all the tables in the bronze layer from the scratch
24:25
now the methods that we're going to use in order to load the data from the source to the data warehouse is the bulk inserts bulk insert is a method of loading massive amount of data very quickly from files like CSV files or maybe a text file directly into a
24:41
database it's is not like the classical normal inserts where it's going to go and insert the data row by row but instead the PK insert is one operation that's going to load all the data in one go into the database and that's what makes it very fast so let's go and use this methods okay so now let's start
24:58
writing the script in order to load the first table in the source CRM so we're going to go and load the table customer info from the CSV file to the database table so the syntax is very simple we're going to start to saying pulk insert so with that SQL understand we are doing
25:14
not a normal insert we are doing a pulk insert and then we have to go and specify the table name so it is bronze. CRM cost info so now now we have to specify the full location of the file that we are trying to load in this table
25:29
so now what we have to do is to go and get the path where the file is stored so I'm going to go and copy the whole path and then add it to the P insert exactly like where the data exists so for me it is in csql data warehouse project data set in the source CRM and then I have to
25:47
specify the file name so it's going to be the costore info. CSV you have to get it exactly like like the path of your files otherwise it will not be working so after the path now we come to the with CLA now we have to tell the SQL Server how to handle our file so here
26:04
comes the specifications there is a lot of stuff that we can Define so let's start with the very important one is the row header now if you check the content of our files you can see always the first row includes the Header information of the file so those
26:19
informations are actually not the data it's just the column names the ACT data starts from the second row and we have to tell the database about this information so we're going to say first row is actually the second row so with that we are telling SQL to skip the
26:37
first row in the file we don't need to load those informations because we have already defined the structure of our table so this is the first specifications the next one which is as well very important and loading any CSV file is the separator between Fields the
26:52
delimiter between Fields so it's really depend on the file structure that you are getting from the source as you can see all those values are splitted with a comma and we call this comma as a file separator or a delimiter and I saw a lot of different csvs like sometime they use
27:07
a semicolon or a pipe or special character like a hash and so on so you have to understand how the values are splitted and in this file it's splitted by the comma and we have to tell SQL about this info it's very important so we going to say fill Terminator and then
27:22
we're going to say it is the comma and basically those two informations are very important for SQL in order to be able to read your CSV file now there are like many different options that you can go and add for example tabe lock it is an option in order to improve the
27:38
performance where you are locking the entire table during loading it so as SQL is loading the data to this table it going to go and lock the whole table so that's it for now I'm just going to go and add the semicolon and let's go and insert the data from the file inside our
27:53
pron table let's execute it and now you can see SQL did insert around 880,000 rows inside our table so it is working we just loaded the file into our data Bas but now it is not enough to just write the script you have to test the quality of your bronze table especially
28:09
if you are working with files so let's go and just do a simple select so from our new table and let's run it so now the first thing that I check is do we have data like in each column well yes as you can see we have data and the second thing is
28:26
do we have the data in the correct column this is very critical as you are loading the data from a file to a database do we have the data in the correct column so for example here we have the first name which of course makes sense and here we have the last name but what could happen and this mistakes happens a lot is that you find
28:43
the first name informations inside the key and as well you see the last name inside the first name and the status inside the last name so there is like shifting of the data and this data engineering mistake is very common if you are working with CSV files and there are like different reasons why it
28:59
happens maybe the definition of your table is wrong or the filled separator is wrong maybe it's not a comma it's something else or the separator is a bad separator because sometimes maybe in the keys or in the first name there is a comma and the SQL is not able to split
29:15
the data correctly so the quality of the CSV file is not really good and there are many different reasons why you are not getting the data in the correct column but for now everything looks fine for us and the next step is that I go and count the rows inside this table so
29:31
let's go and select that so we can see we have 18,490 and now what we can do we can go to our CSV file and check how many rows do we have inside this file and as you can see we have 18,490 we are almost there there is like
29:46
one extra row inside the file and that's because of the header the first Header information is not loaded inside our table and that's why always in our tables we're going to have one less row than the original files so everything looks nice and we have done this step
30:01
correctly now if I go and run it again what's going to happen we will get dcat inside the bronze layer so now we have loaded the file like twice inside the same table which is not really correct the method that we have discussed is first to make the table empty and then
30:18
load trate and then insert in order to do that before the bulk inserts what we're going to do we're going to say truncate table and then we're going to have our table and that's it with a semicolon so now what we are doing is first we are
30:34
making the table empty and then we start loading from the scratch we are loading the whole content of the file inside the table and this is what we call full load so now let's go and Mark everything together and execute and again if you go and check the content of the table you
30:50
can see we have only 18,000 rows let's go and run it again the count of the bronze layer you can see we still have the 18,000 so each time you run this script now we are refreshing the table customer info from the file into the
31:05
database table so we are refreshing the bronze layer table so that means if there is like now any changes in the file it will be loaded to the table so this is how you do a full load in the bronze layer by trating the table and then doing the inserts and now of course
31:21
what we have to do is to Bow the video and go and write WR the same script for all six files so let's go and do [Music] that okay back so I hope that you have as well written all those scripts so I
31:37
have the three tables in order to load the First Source system and then three sections in order to load the Second Source system and as I'm writing those scripts make sure to have the correct path so for the Second Source system you have to go and change the path for the other folder and as well don't forget the table name on the bronze layer is
31:54
different from the file name because we start always with the source system name with the files we don't have that so now I think I have everything is ready so let's go and execute the whole thing perfect awesome so everything is working let me check the messages so we can see
32:10
from the message how many rows are inserted in each table and now of course the task is to go through each table and check the content so that means now we have really ni script in order to load the bronze
32:26
layer and we will use this script in daily basis every day we have to run it in order to get a new content to the data warehouse and as you learned before if you have like a script of SQL that is frequently used what we can do we can go and create a stored procedure from those
32:43
scripts so let's go and do that it's going to be very simple we're going to go over here and say create or alter procedure and now we have to define the name of the Sol procedure I'm going to go and put it in the schema bronze because it belongs to the bronze layer so then we're going to go and follow the
32:59
naming convention the S procedure starts with load underscore and then the bronze layer so that's it about the name and then very important we have to define the begin and as well the end of our SQL statements so here is the beginning and let's go to the end and say this is the
33:16
end and then let's go highlight everything in between and give it one push with tab so with that it is easier to read so now next one we're going to do we're going to go and execute it so let's go and create this St procedure and now if you want to go and check your St procedure you go to the database and
33:31
then we have here folder called programmability and then inside we have start procedure so if you go and refresh you will see our new start procedure let's go and test it so I'm going to go and have new query and what we're going to do we're going to say execute bronze. load bronze so let's go and execute it
33:48
and with that we have just loaded completely the pron layer so as you can see SQL did go and insert all the data from the files to the bronze layer it is way easier than each time running those scripts of course all right so now the next step is that as you can see the output message it is really not having a
34:06
lot of informations the message of your ETL with s procedure it will not be really clear so that's why if you are writing an ETL script always take care of the messaging of your code so let me show you a nice design let's go back to our St procedure so now what we can do
34:21
we can go and divide the message p based on our code so now we can start with a message for example over here let's say print and we say what you are doing with this thir procedure we are loading the bronze ler so this is the main message the most important one and we can go and
34:37
play with the separators like this so we can say print and now we can go and add some nice separators like for example the equals at the start and at the end just to have like a section so this is just a nice message at the start so now by looking to our code we can see that
34:52
our code is splited into two sections the first section we are loading all the tables from The Source system CRM and the second section is loading the tables from the Erp so we can split the prints by The Source system so let's go and do that so we're going to say print and
35:08
we're going to say loading CRM tables this is for the first section and then we can go and add some nice separators like the one let's take the minus and of course don't forget to add semicolons like me so we can to have semicolon
35:24
for each print same thing over here I will go and copy the whole thing because we're going to have it at the start and as well at the end let's go copy the whole thing for the second section so for the Erp it starts over here and we're going to have it like this and
35:39
we're going to call it loading Erp so with that in the output we can see nice separation between loading each Source system now we go to the next step where we go and add like a print for each action so for example here we are Tran getting the table so we say print and
35:55
now what we can do we can go and add two arrows and we say what we are doing so we are trating the table and then we can go and add the table name in the message as well so this is the first action that we are doing and we can go and add another print for inserting the data so
36:10
we can say inserting data into and then we have the table name so with that in the output we can understand what SQL is doing so let's go and repeat this for all other tables Okay so I just added all those prints and don't forget the
36:25
semicolon at the end so I would say let's go and execute it and check the output so let's go and do that and then maybe at the start just to have quick output execute our stored procedure like this so let's see now if you check the output you can see things are more
36:42
organized than before so at the start we are reading okay we are loading the bronze layer now first we are loading the source system CRM and then the second section is for the Erp and we can see the actions so we trating inserting trating inserting for each table and as
36:57
well the same thing for the Second Source so as you can see it is nice and cosmetic but it's very important as you are debugging any errors and speaking of Errors we have to go and handle the errors in our St procedure so let's go and do that it's going to be the first thing that we do we say begin try and
37:14
then we go to the end of our scripts and we say before the last end we say end try and then the next thing we have to add the catch so we're going to say begin catch and end catch so now first let's go and organize our code I'm going
37:30
to take the whole codes and give it one more push and as well the begin try so it is more organized and as you know the try and catch is going to go and execute the try and if there is like any errors during executing this script the second
37:47
section going to be executed so the catch will be executed only if the SQL failed to run that try so now what we have to do is to go and Define for SQL what to do if there's like an error in your code and here we can do multiple stuff like maybe creating a logging
38:02
tables and add the messages inside this table or we can go and add some nice messaging to the output like very example we can go and add like a section again over here so again some equals and we can go and repeat it over here and then add some content in between so we
38:19
can start with something like to say error Accord during loading bronze layer and then we can go and add many stuff like for example we can go and add the error message and here we can go and call the
38:35
function error message and we can go and add as well for example the error number so error number and of course the output of this going to be in number but the error message here is a text so we have to go and change the data type so we're going
38:51
to do a cast as in VAR Char like this and then there is like many functions that you can add to the output like for example the error States and so on so you can design what can happen if there is an error in the ETL now what else is very important in each ETL process is to
39:09
add the duration of each like step so for example I would like to understand how long it takes to load this table over here but looking to the output I don't have any informations how long is taking to load my tables and this is very important because because as you are building like a big data warehouse
39:26
the ATL process is going to take long time and you would like to understand where is the issue where is the bottleneck which table is consuming a lot of time to be loaded so that's why we have to add those informations as well to the output or even maybe to protocol it in a table so let's go and
39:41
add as well this step so we're going to go to the start and now in order to calculate the duration you need the starting time and the end time so we have to understand when we started loaded and when we ended loading the table so now the first thing is we have to go and declare the variables so we're
39:58
going to say declare and then let's make one called start time and the data type of this going to be the date time I need exactly the second when it started and then another one for the end time so another variable end time and as well the same thing date time so with that we
40:14
have declared the variables and the next step is to go and use them so now let's go to the first table to the customer info and at the start we're going to say set start time equal to get date so we will get the exact time when we start loading
40:31
this table and then let's go and copy the whole thing and go to the end of loading over here so we're going to say set this time the end time equal as well to the get dates so with that now we have the values of when we start loading this table and when we completed loading
40:47
the table and now the next step is we have to go and print the duration those informations so over here we can go and say print and we can go and have as again the same design so two arrows and we can say very simply load duration and then double points and space and now
41:04
what we have to do is to calculate the duration and we can do that using the date and time function date diff in order to find the interval between two dates so we're going to say plus over here and then use date diff and here we have to Define three arguments first one
41:19
is the unit so you can Define second minute hours and so on so we're going to go with a second and then we're going to define the start of the interval it's going to be the start time and then the last argument is going to be the end of the boundary it's going to be the end time and now of course the output of this going to be in number that's why we
41:35
have to go and cast it so we're going to say cast as enar Char and then we're going to close it like this and maybe at the ends we're going to say plus space seconds in order to have a nice message so again what we have done we have declared the two variables and we are
41:51
using them at the start we we are getting the current date and time and at the end of loading the table we are getting the current date and time and then we are finding the differences between them in order to get the load duration and in this case we are just priting this information and now we can
42:07
go of course and add some nice separator between each table so I'm going to go and do it like this just few minuses not a lot of stuff so now what we have to do is to go and add this mechanism for each table in order to measure the speed of the ETL for each one of
42:24
[Music] them okay so now I have added all those configurations for each table and let's go and run the whole thing now so let's go and edit the stor procedure this and
42:40
we're going to go and run it so let's go and execute so now as you can see we have here one more info about the load durations and it is everywhere I can see we have zero seconds and that's because it is super fast of loading those informations we are doing everything locally at PC so loading the data from
42:57
files to database going to be Mega fast but of course in real projects you have like different servers and networking between them and you have millions of rods in the tables of course the duration going to be not like 0 seconds things going to be slower and now you can see easily how long it takes to load
43:14
each of your tables and now of course what is very interesting is to understand how long it takes to load the whole pron lier so now your task is is as well to print at the ends informations about the whole patch how long it took to load the bronze
43:32
[Music] layer okay I hope we are done now I have done it like this we have to Define two new variables so the start time of the batch and the end time of the batch and the first step in the start procedure is to get that date and time informations
43:49
for the first variable and exactly at the end the last thing that we do in the start procedure we're going to go and get the date and time informations for the end time so we say again set get date for the patch in time and then all what you have to do is to go and print a
44:05
message so we are saying loading bronze layer is completed and then we are printing total load duration and the same thing with a date difference between the patch start time and the end time and we are calculating the seconds and so on so now what you have to do is to go and execute the whole thing so
44:21
let's go and refresh the definition of the S procedure and then let's go and execute it so in the output we have to go to the last message and we can see loading pron layer is completed and the total load duration is as well 0 seconds because the execution time is less than
44:38
1 seconds so with that you are getting now a feeling about how to build an ETL process so as you can see the data engineering is not all about how to load the data it's how to engineer the whole pipeline how to measure the speed of loading the data what can happen happen
44:53
if there's like an error and to print each step in your ETL process and make everything organized and cleared in the output and maybe in the logging just to make debugging and optimizing the performance way easier and there is like a lot of things that we can add we can
45:08
add the quality measures and stuff so we can add many stuff to our ETL scripts to make our data warehouse professional all right my friends so with that we have developed a code in order to load the pron layer and we have tested that as well and now in the next step we we're going to go back to draw because we want
45:24
to draw a diagram about the data flow so let's go so now what is a data flow diagram we're going to draw a Syle visual in order to map the flow of your data where it come froms and where it ends up so we
45:41
want just to make clear how the data flows through different layers of your projects and that's help us to create something called the data lineage and this is really nice especially if you are analyzing an issue so if you have like multiple layers and you don't have a real data lineage or flow it's going
45:57
to be really hard to analyze the scripts in order to understand the origin of the data and having this diagram going to improve the process of finding issues so now let's go and create one okay so now back to draw and we're going to go and build the flow diagram so we're going to start first with the source system so
46:14
let's build the layer I'm going to go and remove the fill dotted and then we're going to go and add like a box saying sources and we're going to put it over here increase the size 24 and as well without any lines now what do we
46:30
have inside the sources we have like folder and files so let's go and search for a folder icon I'm going to go and take this one over here and say you are the CRM and we can as well increase the size and we have another source we have the
46:46
Erp okay so this is the first layer let's go and now have the bronze layer so we're going to go and grab another box and we're going to go and make the coloring like this and instead of Auto maybe take the hatch maybe something like this whatever you know so rounded
47:03
and then we can go and put on top of it like the title so we can say you are the bronze layer and increase as well the size of the font so now what you're going to do we're going to go and add boxes for each table that we have in the bronze layer so for example we have the
47:20
sales details we can go and make it little bit smaller so maybe 16 and not bold and we have other two tables from the CRM we have the customer info and as well the product info so those are the three tables that comes from the CRM and
47:37
now what we're going to do we're going to go and connect now the source CRM with all three tables so what we going to do we're going to go to the folder and start making arrows from the folder to the bronze layer like this and now we have to do the same thing for the Erp
47:54
source so as you can see the data flow diagram shows us in one picture the data lineage between the two layers so here we can see easily those three tables actually comes from the CRM and as well those three tables in the bronze layer are coming from the Erp I understand if
48:09
we have like a lot of tables it's going to be a huge Miss but if you have like small or medium data warehouse building those diagrams going to make things really easier to understand how everything is Flowing from the sources into the different layers in your data warehouse all right so with that we have
48:26
the first version of the data flow so this step is done and the final step is to commit our code in the get repo okay so now let's go and commit our work since it is scripts we're going to go to the folder scripts and here we're
48:42
going to have like scripts for the bronze silver and gold that's why maybe it makes sense to create a folder for each layer so let's go and start creating the bronze folder so I'm going to go and create a new file and then I'm going to say pron slash and then we can have the DL script of the pron layer dot
48:59
SQL so now I'm going to go and paste the edal codes that we have created so those six tables and as usual at the start we have a comment where we are explaining the purpose of these scripts so we are saying these scripts creates tables in the pron schema and by running the scripts you are redefining the DL
49:15
structure of the pron tables so let's have it like that and I'm going to go and commit the changes all right so now as you can see inside the scripts we have a folder called bronze and inside it we have the ddl script for the bronze layer and as well in the pron layer
49:31
we're going to go and put our start procedure so we're going to go and create a new file let's call it proc load bronze. SQL and then let's go and paste our scripts and as usual I have put it at the start an explanation about the sord procedure so we are seeing this
49:47
St procedure going to go and load the data from the CSV files into the pron schema so it going go and truncate first the tables and then do a pulk inserts and about the parameters this s procedure does not accept any parameter or return any values and here a quick
50:02
example how to execute it all right so I think I'm happy with that so let's go and commit it all right my friends so with that we have committed our code into the gch and with that we are done building the pron layer so the whole is done now we're going to go to the next
50:18
one this one going to be more advanced than the bronze layer because the there will be a lot of struggle with cleaning the data and so on so we're going to start with the first task where we're going to analyze and explore the data in the source systems so let's
50:34
go okay so now we're going to start with the big question how to build the silver layer what is the process okay as usual first things first we have to analyze and now the task before building anything in the silver layer we have to go and explore the data in order to understand the content of our sources
50:51
once we have it what we're going to do we will be starting coding and here the transformation that we're going to do is data cleansing this is usually process that take really long time and I usually do it in three steps the first step is to check first the data quality issues that we have in the pron layer so before
51:07
writing any data Transformations first we have to understand what are the issues and only then I start writing data transformations in order to fix all those quality issues that we have in the bronze and the last step once I have clean results what we're going to do we're going to go and inserted into the
51:23
silver layer and those are the three faces that we will be doing as we are writing the code for the silver layer and the third step once we have all the data in the server layer we have to make sure that the data is now correct and we don't have any quality issues anymore and if you find any issues of course
51:38
what you going to do we're going to go back to coding we're going to do the data cleansing and again check so it is like a cycle between validating and coding once the quality of the silver layer is good we cannot skip the last phase where we going to document and commit our work in the Gs and here we're
51:53
going to have two new documentations we're going to build the data flow diagram and as well the data integration diagram after we understood the relationship between the sources from the first step so this is the process and this is how we going to build the server
52:11
layer all right so now exploring the data in the pron layer so why it is very important because understanding the data it is the key to make smart decisions in the server layer it was not the focus in the BR layer to understand the content of the data at all we focused only how
52:26
to get the data to the data warehouse so that's why we have now to take a moment in order to explore and understand the tables and as well how to connect them what are the relationship between these tables and it is very important as you are learning about a new source system
52:42
is to create like some kind of documentation so now let's go and explore the sources okay so now let's go and explore them one by one we can start with the first one from the CRM we have the customer info so right click on it and say select top thousand rows and this is of course important if you have
52:57
like a lot of data don't go and explore millions of rows always limit your queries so for example here we are using the top thousands just to make sure that you are not impacting the system with your queries so now let's have a look to the content of this table so we can see that we have here customer informations so we have an ID we have a key for the
53:14
customer we have first name last name my Ral status gender and the creation date of the customer so simply this is a table for the customer customer information and a lot of details for the customers and here we have like two identifiers one it is like technical ID and another one it's like the customer
53:32
number so maybe we can use either the ID or the key in order to join it with other tables so now what I usually do is to go and draw like data model or let's say integration model just to document and visual what I am understanding because if you don't do that you're going to forget it after a while so now
53:48
we go and search for a shape let's search for table and I'm going to go and pick this one over here so here we can go and change the style for example we can make it rounded or you can go make it sketch and so on and we can go and change the color so I'm going to make it blue then go to the text make sure to
54:04
select the whole thing and let's make it bigger 26 and then what I'm going to do for those items I'm just going to select them and go to arrange and maybe make it 40 something like this so now what we're going to do we're going to just go and put the table name so this is the one
54:21
that we are now learning about and what I'm going to do I'm just going to go and put here the primary key I will not go and list all the informations so the primary key was the ID and I will go and remove all those stuff I don't need it now as you can see the table name is not really friendly so I can go and bring a
54:38
text and put it here on top and say this is the customer information just to make it friendly and do not forget about it and as well going to increase the size to maybe 20 something like this okay with that we have our first table and
54:53
we're going to go and keep exploring so let's move to the second one we're going to take the product information right click on it and select the top thousand rows I will just put it below the previous query query it now by looking to this table we can see we have product informations so we have here a primary
55:10
key for the product and then we have like key or let's say product number and after that we have the full name of the product the product costs and then we have the product line and then we have like start and end well this is interesting to understand why we have start and ends let's have a
55:26
look for example for those three rows all of those three having the same key but they have different IDs so it is the same product but with different costs so for 2011 we have the cost of 12 then 2012 we have 14 and for the last year
55:44
2013 we have 13 so it's like we have like a history for the changes so this table not only holding the current affirmations of the product but also history informations of the products and that's why we have those two dates start and end now let's go back and draw this
56:00
information over here so I'm just going to go and duplicate it so the name of this table going to be the BRD info and let's go and give it like a short description current and history products information something like this just to not forget that we have history in this
56:16
table and here we have as well the PRD ID and there is like nothing that we can use in order to join those two tables we don't have like a customer ID here or in the other table we don't have any product ID okay so that's it for this table let's jump to the third table and the last one in the CRM so let's go and
56:34
select I just made other queries as well short so let's go and execute so what do you have over here we have a lot of informations about the order the sales and a lot of measures order number we have the product key so this is something that we can use in order to join it with the product table we have
56:50
the customer ID we don't have the customer key so here we have like ID and here we have key so there's like two different ways on how to join tables and then we have here like dates the order dates the shipping date the due date and then we have the sales amount the
57:06
quantity and the price so this is like an event table it is transactional table about the orders and sales and it is great table in order to connect the customers with the products and as well with the orders so let's document this new information that we have so the
57:22
table name is the sales details so we can go and describe it like this transactional records about sales and orders and now we have to go and describe how we can connect this table to the other two so we are not using the
57:39
product ID we are using the product key and now we need a new column over here so you can hold control and enter or you can go over here and add a new row and the other row is going to be the customer ID so now for the the customer ID it is easy we can gr and grab an
57:54
arrow in order to connect those two tables but for the product key we are not using the ID so that's why I'm just going to go and remove this one and say product key let's have here again a check so this is a product key it's not a product ID and if we go and check the old table the products info you can see
58:11
we are using this key and not the primary key so what we're going to do now we will just go and Link it like this and maybe switch those two tables so I will put the customer below just perfect it looks nice okay so let's keep moving let's go now to the other source
58:27
system we have the Erp and the first one is ARB cost and we have this cryptical name let's go and select the data so now here it's small table and we have only three informations so we have here something called C and then we have something I think this is the birthday
58:43
and the gender information so we have here male female and so on so it looks again like the customer informations but here we have like extra data about the birthday and now if you go and compare it to the customer table that we have from the other source system let's go and query it you can see the new table
58:59
from the Erb don't have IDs it has actually the customer number or the key so we can go and join those two tables using the customer key let's go and document this information so I will just go and copy paste and put it here on the right side I will just go and change the
59:15
color now since we are now talking about different Source system and here the table name going to be this one and the key called C ID now in order to join this table with the customer info we cannot join it with the customer ID we need the customer key that's why here we
59:31
have to go and add a new row so contrl enter and we're going to say customer key and then we have to go and make a nice Arrow between those two keys so we're going to go and give it a description customer information and here we have the birth
59:47
dates okay so now let's keep going we're going to go to the next one we have the Erp location let's go and query this table so what do you have over here we have the CID again and as you can see we have country informations and this is of
00:02
course again the customer number and we have only this information the country so let's go and docment this information this is the customer location table name going to be like this and we still have the same ID so we have here still the customer ID and we can go and join it
00:18
using the customer key and we have to give it the description locate of customers and we can say here the country okay so now let's go to the last table and explore it we have the Erp PX catalog so let's go and query those
00:35
informations so what do we have here we have like an ID a category a subcategory and the maintenance here we have like either yes and no so by looking to this table we have all the categories and the subcategories of the products and here we have like special identifier for
00:52
those informations now the question is how to join it so I would like to join it actually with the product informations so let's go and check those two tables together okay so in the products we don't have any ID for the categories but we have these informations actually in the product key
01:08
so the first five characters of the product key is actually the category ID so we can use this information over here in order to join it with the categories so we can go and describe this information like this and then we have to go and give it a name and then here
01:24
we have the ID and the ID could be joined using the product key so that means for the product information we don't need at all the product ID the primary key all what we need is the product key or the product number and what I would like to do is like to group
01:39
those informations in a box so let's go grab like any boxes here on the left side and make it bigger and then make the edges a little bit smaller let's remove move the fill and the line I will make a dotted line and then let's grab
01:56
another box over here and say this is the CRM and we can go and increase the size maybe something like 40 smaller 35 bold and change the color to Blue and just place it here on top of this box so with that we can understand all those
02:11
tables belongs to the source system CRM and we can do the same stuff for the right side as well now of course we have to go and add the description here so it's going to be the product categories all right so with that we have now clear understanding how the
02:27
tables are connected to each others we understand now the content of each table and of course it can to help us to clean up the data in the silver layer in order to prepare it so as you can see it is very important to take time understanding the structure of the tables the relationship between them
02:44
before start writing any code all right so with that we have now clear understanding about the sources and with that we have as well created a data integration in the dro so with that we have more understanding about how to connect the sources and now in the next two task we will go back to SQL where
03:00
we're going to start checking the quality and as well doing a lot of data Transformations so let's go okay so now let's have a quick look to the specifications of the server layer so the main objective to have clean and standardized data we have to
03:17
prepare the data before going to the gold layer and we will be building tables inside the silver layer and the way of loading the data from the bronze to the silver is a full load so that means we're going to trate and then insert and here we're going to have a lot of data Transformations so we're going to clean the data we're going to
03:33
bring normalizations standardizations we're going to derive new columns we will be doing as well data enrichment so a lot of things to be done in the data transformation but we will not be building any new data model so those are the specifications and we have to commit ourself to this scope okay so now
03:50
building the ddl script for the layer going to be way easier than the bronze because the definition and the structure of each table in the silver going to be identical to the bronze layer we are not doing anything new so all what you have to do is to take the ddl script from the
04:05
bronze layer and just go and search and replace for the schema I'm just using the notepad++ for the scripts so I'm going to go over here and say replace the bronze dots with silver dots and I'm going to go and replace all so with that now all the ddl is targeting the schema
04:22
silver layer which is exactly what we need all right now before we execute our new ddl script for the silver we have to talk about something called the metadata columns they are additional columns or fields that the data Engineers add to each table that don't come directly from
04:38
the source systems but the data Engineers use it in order to provide extra informations for each record like we can add a column called create date is when the record was loaded or an update date when the the record got updated or we can add the source system
04:55
in order to understand the origin of the data that we have or sometimes we can add the file location in order to understand the lineage from which file the data come from those are great tool if you have data issue in your data warehouse if there is like corrupt data
05:10
and so on this can help you to track exactly where this issue happens and when and as well it is great in order to understand whether I have Gap in my data especially if you are doing incremental mod it is like putting labels on everything and you will thank yourself
05:25
later when you start using them in hard times as you have an issue in your data warehouse so now back to our ddl scripts and all what you have to do is to go and do the following so for example for the first table I will go and add at the end one more extra column so it start with
05:41
the prefix DW as we have defined in the naming convention and then underscore let's have the create dates and the data tabe going to be date time to and now what we can do is we can go and add a default value for it I want the database to generate these informations
05:57
automatically we don't have to specify that in any ETL scripts so which value it's going to be the get datee so each record going to be inserted in this table will get automatically a value from the current date and time so now as you can see the naming convention it is
06:12
very important all those columns comes from the source system and only this one column comes from the data engineer of the data warehouse okay so that's it let's go and repeat the same thing for all other tables so I will just go and add this piece of information for each
06:29
ddl all right so I think that's it all what you have to do is now to go and execute the whole ddl script for the silver layer let's go into that all right perfect there's no errors let's go and refresh the tables on the object Explorer and with that as you can see we have six tables for the silver layer it
06:46
is identical to the bronze layer but we have one extra column for the metadata all right so now in the server layer before we start writing any data Transformations and cleansing we have first to detect the quality issues in
07:03
the pron without knowing the issues we cannot find solution right we will explore first the quality issues only then we start writing the transformation scripts so let's [Music]
07:19
go okay so now what we're going to do we're going to go through all the tables over the bronze layer clean up the data and then insert it to the server layer so let's start with the first table the first bronze table from The Source CRM so we're going to go to the bronze CRM
07:34
customer info so let's go and query the data over here now of course before writing any data and Transformations we have to go and detect and identify the quality issues of this table so usually I start with the first check where we go and check the primary key so we have to
07:51
go and check whether there are nulls inside the primary key and whether there are duplicates so now in order to detect the duplicates in the primary key what we have to do is to go and aggregate the primary key if we find any value in the primary key that exist more than once that means it is not unique and we have
08:07
duplicates in the table so let's go and write query for that so what we're going to do we're going to go with the customer ID and then we're going to go and count and then we have to group up the data so Group by based on the primary key and of course we don't need all the results we need only where we
08:23
have an issue so we're going to say having counts higher than one so we are interested in the values where the count is higher than one so let's go and execute it now as you can see we have issue in this table we have duplicates
08:38
because all those IDs exist more than one in the table which is completely wrong we should have the primary key unique and you can see as well we have three records where the primary key is empty which is as well a bad thing now there is an issue here if we have only one null it will not be here at the
08:55
result so what I'm going to do I'm going to go over here and say or the primary key is null just in case if we have only one null I'm still interested to see the results so if I go and run it again we'll get the same results so this is equality check that you can do on the
09:11
table and as you can see it is not meeting the expectation so that means we have to do something about it so let's go and create a new query so here what we're going to do we can to start writing the query that is doing the data transformation and the data cleansing so let's start again by selecting the
09:28
[Music] data and excuse it again so now what I usually do I go and focus on the issue so for example let's go and take one of those values and I focus on it before start writing the transformation so we're going to say where customer ID
09:44
equal to this value all right so now as you can see we have here the issue where the ID exist three times but actually we are interested only on one of them so the question is how to pick one of those usually we search for a timestamp or date value to help us so if you check
10:00
the creation date over here we can understand that this record this one over here is the newest one and the previous two are older than it so that means if I have to go and pick one of those values I would like to get the latest one because it holds the most
10:16
fresh information so what we have to do is we have to go and rank all those values based on the create dates and only pick the highest one so that means we need a ranking function and for that in scale we have the amazing window functions so let's go and do that we
10:32
will use the function row number over and then Partition by and here we have to divide the table by the customer ID so we're going to divide it by the customer ID and in order now to rank those rows we have to sort the data by
10:49
something so order by and as we discussed we want to sort the data by the creation date so create date and we're going to sort it descending so the highest first then the lowest so let's go and do that and now we're going to go and give it the name
11:04
flag last so now let's go and executed now the data is sorted by the creation date and you can see over here that this record is the number one then the one that is older is two and the oldest one is three of course we are interested in the rank number one now let's go and
11:21
moove the filter and check everything so now if you have a look to the table you can see that on the flag we have everywhere like one and that's because the those primary Keys exist only one but sometimes we will not have one we will have two three and so on if there's like duplicates we can go of course and
11:37
do a double check so let's go over here and say select star from this query we're going to say where flag last is in equal to one so let's go and query it and now we can see all the data that we don't need because they are causing duplicates in the
11:52
primary key and they have like an old status so what we're going to do we're going to say equal to one and with that we guarantee that our primary key is unique and each value exist only once so if I go and query it like this you will see we will not find any duplicate inside our table and we can go and check
12:09
that of course so let's go and check this primary key and we're going to say and customer ID equal to this value and you can see it exists now only once and we are getting the freshest data from this key so with that we have defined like transformation in order to remove
12:25
any D Kates okay so now moving on to the next one as you can see in our table we have a lot of values where they are like string values now for these string values we have to check the unwanted spaces so now let's go and write a query that's going to detect those unwanted
12:41
spaces so we're going to say select this column the first name from our table bronze customer information so let's go and query it now by just looking to the data it's going to be really hard to find those unwanted spaces especially if they are at the end
12:58
of the world but there is a very easy way in order to detect those issues so what we're going to do we're going to do a filter so now we're going to say the first name is not equal to the first name after trimming the values so if you use the function trim what it going to do it's going to go and remove all the
13:15
leading and trailing spaces so the first name so if this value is not equal to the first name after trimming it then we have an issue so it is very simple let's go and execute it so now in the result we will get the list of all first names
13:31
where we have spaces either at the start or at the end so again the expectation here is no results and the same thing we can go and check something else like for example the last name so let's go and do that over here and here let's go and
13:48
execute it we see in the result we have as well customers where they have like space in their last name which is not really good and we can go and keep checking all the string values that you have inside the table so for example the gender so let's go and check
14:04
that and execute now as you can see we don't have any results that means the quality of the gender is better and we don't have any unwanted spaces so now we have to go and write transformation in order to clean up those two columns now what I'm going to do I'm just going to go and list all the column in the query
14:22
instead of the star all right so now I have a list of all the columns that I need and now what we have to do is to go to those two columns and start removing The Unwanted spaces so we'll just use the trim it's very simple and give it a name of course the
14:37
same name and we will trim as well the last name so let's go and query this and with that we have cleaned up those two colums from any unwanted spaces okay so now moving on we have those two informations we have the marital status and as well the gender if you check the
14:53
values inside those two columns as you can see we have here low cardinality so we have limited numbers of possible values that is used inside those two columns so what we usually do is to go and check the data consistency inside those two columns so it's very simple
15:09
what we're going to do we're going to do the following we're going to say distinct and we're going to check the values let's go and do that and now as you can see we have only three possible values either null F or M which is okay we can stay like this of course but we
15:24
can make a rule in our project where we can say we will not be working with data abbreviations we will go and use only friendly full names so instead of having an F we're going to have like a full word female and instead of M we're going to have like male and we make it as a
15:40
rule for the whole project so each time we find the gender informations we try to give the full name of it so let's go and map those two values to a friendly one so we're going to go to the gender of over here and say case when and we're going to say the gender is equal to F
15:57
then make it a female and when it is equal to M then M it to male and now we have to make decision about the nulls as you can see over here we have nulls so do we
16:13
want to leave it as a null or we want to use always the value unknown so with that we are replacing the missing values with a standard default value or you can leave it as a null but let's say in our project that we are replacing all the missing value with a default value so
16:29
let's go and do that we going to say else I'm going to go with the na not available or you can go with the unknown of course so that's for the gender information like this and we can go and remove the old one and now there is one thing that I usually do in this case
16:46
where sometimes what happens currently we are getting the capital F and the capital M but maybe in the the time something changed and you will get like lower M and lower F so just to make sure in those cases we still are able to map those values to the correct value what we're going to do we're going to just
17:01
use the function upper just to make sure that if we get any lowercase values we are able to catch it so the same thing over here as well and now one more thing that you can add as well of course if you are not trusting the data because we
17:17
saw some unwanted spaces in the first name and the last name you might not trust that in the future you will get here as well unwanted spaces you can go and make sure to trim everything just to make sure that you are catching all those cases so that's
17:33
it for now let's go and excute now as you can see we don't have an m and an F we have a full word male and female and if we don't have a value we don't have a null we are getting here not available now we can go and do the same stuff for the Merial status you can see as well we
17:49
have only three possibil ities the S null and an M we can go and do the same stuff so I will just go and copy everything from here and I will go and use the marital status I just remove this one from here and now what are the possible values we have the S so it's
18:05
going to be single we have an M for married and we have as well a null and with that we are getting the not available so with that we are making as well data standardizations for this column so let's go and execute it now as
18:21
you can see we don't have those short values we have a full friendly value for the status and as well for the gender and at the same time we are handling the nulls inside those two columns so with that we are done with those two columns and now we can go to the last one that create date for this type of
18:36
informations we make sure that this column is a real date and not as a string or barar and as we defined it in the data type it is a date which is completely correct so nothing to do with this column and now the next step is that we're going to go and write the insert statement so how we're going to
18:53
do it we're going to go to the start over here and say insert into silver do SRM customer info now we have to go and specify all the columns that should be inserted so we're going to go and type it so something like this and then we
19:08
have the query over here let's go and execute it so let's do that so with that we have inserted clean data inside the silver table so now what we're going to do we're going to go and take all the queries that we have used used in order to check the quality of the bronze and let's go and take it to another query
19:25
and instead of having bronze we're going to say silver so this is about the primary key let's go and execute it perfect we don't have any results so we don't have any duplicates the same thing for the next one so the silver and it was for the first name so let's go and
19:43
check the first name and run it as you can see there is no results it is perfect we don't have any issues you can of course go and check the last name and run it again we don't have any result over here and now we can go and
19:58
check those low cardinality columns like for example the gender let's go and execute it so as you can see we have the not available or the unknown male and female so perfect and you can go and have a final look to the table to the silver
20:14
customer info let's go and check that so now we can have a look to all those columns as you can see everything looks perfect and you can see it is working this metadata information that we have added to the table definition now it says when we have inserted all those three cords to the table which is really
20:31
amazing information to have a track and audit okay so now by looking to the script we have done different types of data Transformations the first one is with the first name and the last name here we have done trimming removing unwanted spaces this is one of the types of data cleansing so we remove
20:47
unnecessary spaces or unwanted characters to to ensure data consistency now moving on to the next transformation we have this casewin so what we have done here is data normalization or we call it sometimes data standardization so this transformation is type of data
21:03
cleansing where we can map coded values to meaningful userfriendly description and we have done the same transformation as well to the agender another type of transformation that we have done as well in the same case when is that we have handled the missing values so instead of
21:19
nulls we can have not available so handling missing data is as well type of data cleansing where we are filling the blanks by adding for example a default value so instead of having an empty string or a null we're going to have a default value like the not available or
21:35
unknown another type of data and Transformations that we have done in this script is we have removed the duplicates so removing duplicates is as well type of data cleansing where we ensure only one record for each primary key by identifying and retaining only
21:50
the most relevant role to ensure there is no duplicates inside our data and as we are removing the duplicates of course we are doing data filtering so those are the different types of data Transformations that we have done in this
22:06
script all right moving on to the second table in the bronze layer from the CRM we have the product info and of course as usual before we start writing any Transformations we have to search for data quality issues and we start with the first one we have to check the primary key so we have to check whether
22:22
we have duplicates or nulls inside this key so what you have to do we have to group up the data by the primary key or check whether we have nulls so let's go and execute it so as you can see everything is safe we don't have dcat or nulls in the primary key now moving on to the next one we have the product key
22:38
here we have in this column a lot of informations so now what you have to do is to go and split this string into two informations so we are deriving new two columns so now let's start with the first one is the category ID the first five characters they are actually the category ID and we can go and use the
22:55
substring function in order to extract part of a string it needs three arguments the first one going to be the column that we want to extract from and then we have to define the position where to extract and since the first part is on the left side we going to
23:10
start from the first position and then we have to specify the length so how many characters we want to extract we need five characters so 1 2 3 4 five so that's set for the category ID category ID let's go and execute it now as you can see we have a new column called the
23:27
category ID and it contains the first part of the string and in our database from the other source system we have as well the category ID now we can go and double check just in order to make sure that we can join data together so we're going to go and check the ID from the
23:43
pron table Erp and this can be from the category so in this table we have the category ID and you can see over here those are the IDS of the category and in the C layer we have to go and join those two tables but here we still have an
23:58
issue we have here an underscore between the category and the subcategory but in our table we have actually a minus so we have to replace that with an underscore in order to have matching informations between those two tables otherwise we will not be able to join the tables so
24:14
we're going to use the function replace and what we are replacing we are replacing the m with an underscore something like this and if you go now and execute it we will get an underscore exactly like the other table and of course we can go and check
24:31
whether everything is matching by having very simple query where we say this new information not in and then we have this nice subquery so we are trying to find any category ID that is not available in the second table so let's go and execute
24:48
it now as you can see we have only one category that is not matching we are not finding it in this table which is maybe correct so if you go over here you will not find this category I just make it a little bit bigger so we are not finding this one category from this table which
25:03
is fine so our check is okay okay so with that we have the first part now we have to go and extract the second part and we're going to do the same thing so we're going to use the substring and the three argument the product key but this time we will not start cutting from the first position we have to be in the
25:19
middle so 1 2 2 3 4 5 6 7 so we start from the position number seven and now we have to define the length how many characters to be extracted but if you look over here you can see that we have different length of the product keys it
25:35
is not fixed like the category ID so we cannot go and use specified number we have to make something Dynamic and there is Trick In order to do that we can to go and use the length of the whole column with that we make sure that we are always getting enough characters to be extra Ed and we will not be losing
25:51
any informations so we will make it Dynamic like this we will not have it as a fixed length and with that we have the product key so let's go and execute it as you can see we are now extracting the second part from this string now why we
26:07
need the product key we need it in order to join it with another table called sales details so let's go and check the sales details so let me just check the column name it is SLS product key so from bronze
26:24
CRM sales let's go and check the data over here and it looks wonderful so actually we can go and join those informations together but of course we can go and check that so we're going to say where and we're going to take our new column and we're going to say not in
26:40
the subquery just to make sure that we are not missing anything so let's go and execute so it looks like we have a lot of products that don't have any orders well I don't have a nice feelings about it let's go and try something like this
26:55
one here and we say where LS BRD key like this value over here so I'll just cut the last three just to search inside this table so we really don't have such a keys let me just cut the second one so
27:12
let's go and search for it we don't have it as well so anything that starts with the FK we don't have any order with the product where it starts with the F key so let's go and remove it but still we are able to join the tables right so if I go and say in instead of not in so
27:30
with that you are able to match all those products so that means everything is fine actually it's just products that don't have any orders so with that I'm happy with this transformation now moving on to the next one we have here the name of the product we can go and
27:46
check whether there is unwanted spaces so let's go to our quality checks make sure to use the same table and we're going to use the product name and check whether we find any unmatching after trimming so let's go and do it well it looks really fine so we don't have to
28:03
trim anything this column is safe now moving on to the next one we have the costs so here we have numbers and we have to check the quality of the numbers so what we can do we can check whether we have nulls or negative numbers so negative costs or negative prices which
28:19
is not really realistic depend on the business of course so let's say in our business we don't have any negative costs so it's going to be like this let's go and check whether is something less than zero or whether we have costs that is null so let's go and check those
28:37
informations well as you can see we don't have any negative values but we have nulls so we can go and handle that by replacing the null with a zero of course if the business allow that so in SQL server in order to replace the null with a zero we have a very nice function
28:52
called is null so we are saying if it is null then replace this value with a zero it is very simple like this and we give it a name of course so let's go and execute it and as you can see we don't have any more nulls we have zero which
29:09
is better for the calculations if you are later doing any aggregate functions like the average now moving on to the next one we have the product line This is again abbreviation of something and the cardinality is low so let's go and check all possible values inside this
29:24
column so we're just going to use the distinct going to be BRD line so let's go and execute it and as you can see the possible values are null Mr rst and again those are abbreviations but in our data warehouse we have decided to give full nice names so we have to go and
29:41
replace those codes those abbreviations with a friendly value and of course in order to get those informations I usually go and ask the expert from the The Source system or an expert from the process so let's start building our case win and then let's use the upper and as
29:58
well the trim just to make sure that we are having all the cases so the BRD line is equal to so let's start with the first value the M then we will get the friendly value it's going to be Mountain
30:14
then to the next one so I will just copy and paste here if it is an R then it is rods and another one for let me check what do we have here we have Mr and then s the S stands for other sales and we
30:32
have the T so let's go and get the T so the T stands for touring we have at the end an else for unknown not available so we don't need any nulls so that's it and we're going to name it as before so product line so
30:48
let's remove the old one and let's execute it and as you can see we don't have here anymore those shortcuts and the abbreviations we have now full friendly value but I will go and have here like capital O it looks nicer so
31:03
that we have nice friendly value now by looking to this case when as you can see it is always like we are mapping one value to another value and we are repeating all time upper time upper time and so on we have here a quick form in the case when if it is just a simple mapping so the syntax is very simple we
31:19
say case and then we have the column so we are evaluating this value over here and then we just say when without the equal so if it is an M then make it Mountain the same thing for the next one and so so with that we have the
31:36
functions only once and we don't have to go and keep repeating the same function over and over and this one only if you are mapping values but if you have complex conditions you can do it like this but for now I'm going to stay with the quick form of the case wi it looks nicer and shorter so let's go and execute it we will get the same results
31:53
okay so now back to our table let's go to the last two columns we have the start and end date so it's like defining an interval we have start and end so let's go and check the quality of the start and end dates we're going to go and say select star from our bronze table and now we're
32:09
going to go and search it like this we are searching for the end date that is smaller than the starts so PRT start dates so let's let's go and query this so you can see the start is always like after the end which makes no sense at
32:26
all so we have here data issue with those two dates so now for this kind of data Transformations what I usually do is I go and grab few examples and put it in Excel and try to think about how I'm going to go and fix it so here I took like two products this one and this one over here and for that we have like
32:41
three rows for each one of them and we have this situation over here so the question now how we going to go and fix it I will go and make like a copy of one solution where we're going to say it's very simple let's go and switch the start date with the end date so if I go and grab the end dates and put it at the
32:58
starts things going to look way nicer right so we have the start is always younger than the end but my friends the data now makes no sense because we say it starts from 2007 and ends by 2011 the price was 12 but between 2018 and 2012
33:15
we have 14 which is not really good because if you take for example the year 2010 for 2010 it was 12 and at the same time 14 so it is really bad to have an overlapping between those two dates it should start from 2007 and end with 11
33:32
and then start febe from 12 and end with something else there should be no overlapping between years so it's not enough to say the start should be always smaller than the end but as well the end of the first history should be younger
33:47
than the start of the next records this is as well a rule in order to have no overlapping this one has no start but has already an end which is not really okay because we have always to have a starts each new record in historization
34:02
has to has a start so for this record over here this is as well wrong and of course it is okay to have the start without an end so in this scenario it's fine because this indicate this is the current informations about the costs so again this solution is not working at
34:19
all so now for for the solution to what we can say let's go and ignore completely the end date and we take only the start dates so let's go and paste it over here but now we go and rebuild the end date completely from the start date following the rules that we have defined
34:34
so the rule says the end of date of the current records comes from the start date from the next records so here this end date comes from this value over here from the next record so that means we take the next start date and put it at
34:49
the end date for the previous records so with that as you can see it is working the end date is higher than the start dates and as well we are making sure this date is not overlapping with the next record but as well in order to make it way nicer we can subtract it with one
35:05
so we can take the previous day like this so with that we are making sure the end date is smaller than the next start now for the next record this one over here the end date going to come from the next start date so we will take this one for here and put it as an end Ag and
35:22
subtract it with one so we will get the previous day so now if you compare those two you can see it's still higher than the start and if you compare it with the NY record this one over here it is still smaller than the next one so there is no overlapping and now for the last record
35:39
since we don't have here any informations it will be a null which is totally fine so as you can see I'm really happy with this scenario over here of course you can go and validate this with an exp from The Source system let's say I've done that and they approved it and now I can go and clean up the data using this New Logic so this
35:57
is how I usually brainstorm about fixing an issues if I have like a complex stuff I go and use Excel and then discuss it with the expert using this example it's way better than showing a database queries and so on it just makees things easier to explain and as well to discuss
36:12
so now how I usually do it I usually go and make a focus on only the columns that I need and take only one two scenarios while I'm building the logic and once everything is ready I go and integrate it in the query so now I'm focusing only on these columns and only for these products so now let's go and
36:29
build our logic now in SQL if you are at specific record and you want to access another information from another records and for that we have two amazing window functions we have the lead and lag in this scenario we want to access the next records that's why we have to go with
36:44
the function lead so let's go and build it lead and then what do we need we need the lead or the start date so we want the start date of the next records and then we say over and we have to partition the data so the window going to be focusing on only one
37:02
product which is the product key and not the product ID so we are dividing the data by product key and of course we have to go and sort the data so order by and we are sorting the data by the start dates and ascending so from the lowest to the highest and let's go and give it
37:18
another name so as let's say test for example just to test the data so let's go and execute and I think I missed something here it say Partition by so let's go and execute again and now let's go and check the results for the first partition over here so the start is 2011
37:35
and the end is 2012 and this information came from the next record so this data is moved to the previous record over here and the same thing for this record so the end date comes from the next record so our logic is working and the
37:50
last record over here is null because we are at the end of the window and there is no next data that's why we will get null and this is perfect of course so it looks really awesome but what is missing is we have to go and get the previous day and we can do that very simply using minus one we are just subtracting one
38:07
day so we have no overlapping between those two dates and the same thing for those two dates so as you can see we have just buil a perfect end date which is way better than the original data that we got from the source system now let's take this one over here and put it inside our query so we don't need the
38:24
end H we need our new end dat we just remove that test and execute now it looks perfect all right now we are not done yet with those two dates actually we are saying all time dates because we don't have here any informations about the time always zero so it makes no
38:41
sense to have these informations inside our data so what we can do we can do a very simple cast and we make this column as a date instead of date time so this is for the first one and as well for the next one as dates so let's try that out
38:57
and as you can see it is nicer we don't have the time informations of course we can tell the source systems about all those issues but since they don't provide the time it makes no sense to have date and time okay so it was a long run but we have now cleaned product informations and this is way nicer than
39:14
the original product information that we got from the source CRM so if you grab the ddl of the server table you can see that we don't have a category ID so we have product ID and product key and as well those two columns we just change the data type so it's date time here but we have changed that to a date so that
39:31
means we have to go and do few modifications to the ddl so what we going to do we're going to go over here and say category ID and I will be using the same data type and for the start and end this time it's going to be date and not date and time so that's it for now let's go ah and execute it in order to
39:47
repair the ddl and this is what happen in the silver layer sometimes we have to adjust the metadata if the quality of the data types and so on is not good or we are building new derived informations in order later to integrate the data so it will be like very close to the bronze
40:02
layer but with few modifications so make sure to update your ddl scripts and now the next step is that we're going to go and insert the data into the table and now the next step we're going to go and insert the result of this query that is cleaning up the bronze table into the
40:18
silver table so as we' done it before insert into silver the product info and then we have to go and list all the columns I've just prepared those columns so with that we can go and now run our query in order to insert the data so now as you can see SQL did insert the data
40:35
and the very important step is now to check the quality of the silver table so we go back to our data quality checks and we go switch to the silver so let's check the primary key there is no issues and we can go and check for example here the the trims there is as well no issue
40:51
and now let's go and check the costs it should not be negative or null which is perfect let's go and check the data standardizations as you can see they are friendly and we don't have any nulls and now very interesting the order of the dates so let's go and check that as you
41:07
can see we don't have any issues and finally what I do I go and have a final look to the silver table and as we can see everything is inserted correctly in the correct color colums so all those columns comes from the source system and
41:22
the last one is automatically generated from the ddl indicate when we loaded this table now let's sit back and have a look to our script what are the different types of data Transformations that we have done here is for example over here the category ID and the product key we have derived new columns
41:38
so it is when we create a new column based on calculations or transformations of an existing one so sometimes we need columns only for analytics and we cannot each time go to the source system and ask them to create it so instead of that we derive our own columns that we need
41:54
for the analytics another transformation we have is that is null over here so we are handling here missing information instead of null we're going to have a zero and one more transformation we have over here for the product line we have done here data normalization instead of having a code value we have a friendly
42:12
value and as well we have handled the missing data for example over here instead of having a null we're going to have not available all right moving on to another data transformation we have done data type casting so we are converting the data type from one to another and this considered as well to
42:27
be a data transformation and now moving on to the last one we are doing as well data type casting but what's more important we are doing data enrichment this type of transformation it's all about adding a value to your data so we are adding a new relevant data to our
42:42
data sets so those are the different types of data Transformations that we have done for this table okay so let's keep going we have the sales details and this is the last table in the CRM so what do you have over here we have the order number and this is a
42:59
string of course we can go and check whether we have an issue with the unwanted spaces so we can search whether we're going to find something so we can say trim and something like this and let's go and execute it so we can see that we don't have any unwanted spaces that means we don't have to transform
43:14
this column so we can leave it as it is now the next two columns they are like keys and ideas is in order to connect it with the other tables as we learned before we are using the product key in order to connect it with the product informations and we are connecting the customer ID with the customer ID from
43:30
the customer info so that means we have to go and check whether everything is working perfectly so we can go and check the Integrity of those columns where we say the product key Nots in and then we make a subquery and this time we can work with the silver layer right so we can say the product key from Silver do
43:48
product info so let's go and query this and as you can see we are not getting any issue that means all the product keys from the sales details can be used and connected with the product info the same thing we can go and check the Integrity of the customer ID and we can use not the products we can go to the
44:04
customer info and the name was CST ID so let's go and query that and the same thing we don't have here any issues so that means we can go and connect the sales with the customers using the customer ID and we don't have to do any Transformations for it so things looks
44:19
really nice for those three columns now we come to the challenging one we have here the dates now those dates are not actual dates they are integer so those are numbers and we don't want to have it like this we would like to clean that up we have to change the data type from integer to a DAT now if you want to
44:36
convert an integer to a date we have to be careful with the values that we have inside each of those columns so now let's check the quality for example of the order dates let's say where order dates is less than zero for example something negative well we don't have any negative values which is good let's
44:53
go and check whether we have any zeros well this is bad so we have here a lot of zeros now what we can do we can replace those informations with a null we can use of course the null IF function like this we can say null if and if it is zero then make it null so
45:08
let's execute it and as you can see now all those informations are null now let's go and check again the data so now this integer has the years information at the start then the months and then the day so here we have to have like 1 2 3 4 5 so the length of each number
45:24
should be H and if the length is less than eight or higher than eight then we have an issue let's go and check that so we're going to say or length sales order is not equal to eight that means less or higher let's go and execute it now let's go and check the results over here and
45:41
those two informations they don't look like dates so we cannot go and make from these informations a real dates they are just bad data and of course you can go and check the boundaries of a DAT like for example it should not be higher than for example let's go and get this value
45:57
2050 and then I need for the month and the date so let's go and execute it and if we just remove those informations just to make sure so we don't have any date that is outside of the boundaries that you have in your business or you go for example and say the boundary should be not less than depend when your
46:13
business started maybe something like this we are getting of course those values because they are less than n but if you have values around these dates you will get it as well in the query so we can go and add the rests so all those checks like validate the column that has date informations and it has the data
46:30
type integer so again what are the issues over here we have zeros and sometimes we have like strange numbers that cannot be converted to a dates so let's go and fix that in our query so we can say case when the sales order the order date is equal to zero or of the
46:47
order date is not equal to 8 then null right we don't want to deal with those values they are just wrong and they are not real dates otherwise we say else it's going to be the order dates now what we're going to do we're going to go and convert this to a date we don't want
47:02
this as an integer so how we can do that we can go and cast it first to varar because we cannot cast from integer to date in SQL Server first you have to convert it to a varar and then from varar you go to a dates well this is how we do it in scq server so we cast it
47:19
first to a varar and then we cast it to a date like this that's it so we have end and we are using the same column name so this is how we transform an integer to a date so let's go and query this and as you can see the order date
47:36
now is a real date it is not a number so we can go and get rid of the old column now we have to go and do the same stuff for the shipping dates so we can go over here and replace everything with the shipping date and let's go query well as you can see the shipping date is perfect
47:52
we don't have any issue with this column but still I don't like that we found a lot of issues with the order dates so what we're going to do just in case this happens for the shipping date in the future I will go and apply the same rules to the shipping dates oh let's take the shipping date like this and if you don't want to
48:09
apply it now you have always to build like quality checks that runs every day in order to detect those issues and once you detect it then you can go and do the Transformations but for now I'm going to apply it right away so that is for the shipping date now we go to the due date and we will do the same
48:26
test let's go and execute it and as well it is perfect so still I'm going to apply the same rules so let's get the D everywhere here in the query just make sure you don't miss anything here so let's go and execute now perfect as you
48:41
can see we have the order date shipping date and due date and all of them are date and don't have any wrong data inside those columns now still there is one more check that we can do and is that the order date should be always smaller than the shipping date or the due date because it's makes no sense
48:57
right if you are delivering an item without an order so first the order should happen then we are shipping the items so there is like an order of those dates and we can go and check that so we are checking now for invalid date orders where we going to say the order date is higher than the shipping date or we are
49:15
searching as well for an order where the order date date is higher than the due dates so we going to have it like this due dates so let's go and check well that's really good we don't have such a mistake on the data and the quality looks good so the order date is always
49:30
smaller than the shipping date or the due dates so we don't have to do any Transformations or cleanup okay friends now moving on to the last three columns we have the sales quantity and the price all those informations are connected to each others so we have a business rule or calculation it says the sales must be
49:48
equal to quantity multiplied by the price and all sales quantity and price informations must be positive numbers so it's not allowed to be negative zero or null so those are the business rules and we have to check the data consistency in our table does all those three
50:04
informations following our rules so we're going to start first with our rule right so we're going to say if the sales is not equal to quantity multiplied by the price so we are searching where the result is not matching our expectation
50:20
and as well we can go and check other stuff like the nulls so for example we can say or sales is null or quantity is null and the last one for the price and as well we can go and check whether they
50:35
are negative numbers or zero so we can go over here and say less or equal to zero and apply it for the other columns as well so with that we are checking the calculation and as well we are checking whether we have null0 Z or negative numbers let's go and check our
50:51
informations I'm going to have here A distinct so let's go and query it and of course we have here bad data but we can go and sort the data by the sales quantity and the price so let's do it now by looking to the data we can see in
51:06
the sales we have nulls we have negative numbers and zeros so we have all bad combinations and as well we have here bad calculations so as you can see the price here is 50 the quantity is one but the sales is two which is not correct and here we have as well wrong calculations here we have to have a 10
51:23
and here nine or maybe the price is wrong and by looking to the quantity now you can see we don't have any nulls we don't have any zeros or negative numbers so the quantity looks better than the sales and if you look to the prices we have nulls we have negatives and yeah we
51:39
don't have zeros so that means the quality of the sales and the price is wrong the calculation is not working and we have these scenarios now of course how I do it here I don't go and try now to transform everything on my own I usually go and talk to an expert maybe someone from the business or from the
51:54
source system and I show those scenarios and discuss and usually there is like two answers either they going to tell me you know what I will fix it in my source so I have to live with it there is incoming bad data and the bad data can be presented in the warehouse until the source system clean up those issues and
52:10
the other answer you might get you know what we don't have the budget and those data are really old and we are not going to do anything so here you have to decide either you leave it as it is or you say you know what let's go and improve the quality of the data but here you have to ask for the experts to support you solving these issues because
52:28
it really depend on their rules different rules makes different Transformations so now let's say that we have the following rules if the sales informations are null or negative or zero then use the calculation the formula by multiplying the quality with the price and now if the prices are
52:44
wrong for example we have here null or zero then go and calculate it from the sales and a quantity and if you have a price that is a minus like minus 21 a negative number then you have to go and convert it to a 21 so from a negative to a positive without any calculations so
53:01
those are the rules and now we're going to go and build the Transformations based on those rules so let's do it step by step I will go over here and we're going to start building the new sales so what is the rule Sals case when of course as usual if the sales is null or let's say the sales is
53:20
negative number or equal to zero or another scenario we have a sales information but it is not following the calculation so we have wrong information in the sales so we're going to say the sales is not equal to the quantity multiplied by the price but of course we
53:36
will not leave the price like this by using the function APS the absolute it's going to go and convert everything from negative to a positive then what we have to do is to go and use the calculation so so it's going to be the quantity multiplied by the price so that means we
53:53
are not using the value that come from the source system we are recalculating it now let's say the sales is correct and not one of those scenarios so we can say else we will go with the sales as it is that comes from the source because it is correct it's really nice let's go and
54:08
say an end and give it the same name I will go and rename the old one here as an old value and the same for the price the quantity will not T it because it is correct so like this and now let's go and transform the prices so again as
54:24
usual we go with case wi so what are the scenarios the price is null or the price is less or equal to zero then what we're going to do we're going to do the calculation so it going to be the sales divided by the quantity the SLS quantity
54:42
but here we have to make sure that we are not dividing by zero currently we don't have any zeros in the quantity but you don't know future you might get a zero and the whole code going to break so what you have to do is to go and say if you get any zero replace it with a null so null if if it is zero then make
54:59
it null so that's it now if the price is not null and the price is not negative or equal to zero then everything is fine and that's why we're going to have now the else it's going to be the price as it is from The Source system so that's it we're going to say end as price so
55:15
I'm totally happy with that let's go and execute it and check of course so those are the old informations and those are the new transformed cleaned up informations so here previously we have a null but now we have two so two multiply with one we are getting two so the sales is here correct now moving on
55:31
to the next one we have in the sales 40 but the price is two so two multiplied with one we should get two so the new sales is correct it is two and not 40 now to the next one over here the old sales is zero but if you go and multiply the four with the quantity you will get
55:47
four so the sales here is not correct that's why in the new sales we have it correct as a four and let's go and get a minus so in this case we have a minus which is not correct so we are getting the price multiplied with one we should get here a nine and this sales here is correct now let's go and get a scenario
56:04
where the price is a null like this here so we don't have here price but we calculated from the sales and the quantity so we divided the 10 by two and we have five so the new price is better and the same thing for the minuses so we have here minus 21 and in the output we
56:20
have 21 which is correct so for now I don't see any scenario where the data is wrong so everything looks better than before and with that we have applied the business rules from the experts and we have cleaned up the data in the data warehouse and this is way better than
56:35
before because we are presenting now better data for analyzes and Reporting but it is challenging and you have exactly to understand the business so now what we're going to do we're going to go and copy those informations and integrate it in our query so instead of sales we're going to get our new
56:51
calculation and instead of the price we will get our correct calculation and here I'm missing the end let's go and run the whole thing again so with that we have as well now cleaned sales quantity and price and it is following
57:06
our business rules so with that we are done cleaning up the sales details The Next Step we're going to go and inserted to the sales details but we have to go and check again the ddl so now all what you have to do is to compare those results with the ddl so the first one is the order number it's fine the product
57:22
key the customer ID but here we have an issue all those informations now are date and not an integer so we have to go and change the data type and with that we have better data type than before then the sales quantity price it is correct let's go and drop the table and
57:38
create it from scratch again and don't forget to update your ddl script so that's it for this and we're going to go now and insert the results into our silver table say details and we have to go and list now all the columns I have already prepared the list of all the
57:53
columns so make sure that you have the correct order of the columns so let's go now and insert the data and with that and with that we can see that the SQL did insert data to our sales details but now very important is to check the health of the silver table so what we
58:08
going to do instead here of bronze we're going to go and switch it to Silver so let's check over here so here always the order is smaller than the shipping and the due date which is really nice but now I'm very interested on the calculations so here we're going to switch it from bronze to Silver and I'm
58:25
going to go and get rid of all those calculations because we don't need it this and now let's see whether we have any issue well perfect our data is following the business rules we don't have any nulls negative values zeros now as usual the last step the final check
58:41
we will just have a final look to the table so we have the order number the product key the customer ID the three dates we have have the sales quantity and the price and of course we have our metadata column everything is perfect so now by looking to our code what are the
58:56
different types of data Transformations that we are doing so in those three columns we are doing the following so at the start we are handling invalid data and this is as well type of transformation and as well at the same time we are doing data type casting so we are changing it to more correct data
59:13
type and if you are looking to the sales over here then what we are doing over here is we are handling the missing data and as well the invalid data by deriving the column from already existing one and it is as well very similar for the price we are handling as well the invalid data
59:31
by deriving it from specific calculation over here so those are the different types of data Transformations that you have done in these scripts all right now let's keep moving to the next our system we have the customer AZ 12 so here we have we have
59:48
like only three columns and let's start with the ID first so here again we have the customers informations and if we go and check again our model you can see that we can connect this table with the CRM table customer info using the customer key so that means we have to go
00:03
and make sure that we can go and connect those two tables so let's go and check the other table we can go and check of course the silver layer so let's query it and we can query both of the tables now we can see there is here like exract characters that are not included in the
00:20
customer key from the CRM so let's go and search for example for this customer over here where C ID like so we are searching for customer has similar ID now as you can see we are finding this
00:35
customer but the issue is that we have those three characters in as there is no specifications or explanation why we have the nas so actually what we have to do is to go and remove those informations we don't need it so let's again check the data so it looks like the old data have an Nas at the start
00:53
and then afterward we have new data without those three characters so we have to clean up those IDs in order to be able to connect it with other tables so we're going to do it like this we're going to start with the case wiin since we have like two scenarios in our data so if the C ID is like the three
01:10
characters in as so if the ID start with those three characters then we're going to go and apply transformation function otherwise eyes it's going to stay like it is so that's it so now we have to go and build the transformation so we're
01:25
going to use substring and then we have to define the string it's going to be the C ID and then we have to define the position where it start cutting or extracting so we can say 1 2 3 and then four so we have to define the position number four and then we have to define
01:42
the string how many characters should be extracted I will make it Dynamic so I will go with the link I will not go and count how much so we're going to say the C ID so it looks good if it's like an as then go and extract from the CID at the position
01:57
number four the rest of the characters so let's go and execute it and I'm missing here a comma again where we don't have any Nas at the start and if you scroll down you can see those as well are not affected so with that we
02:13
have now a nice ID to be joined with other table of course we can go and test it like this where and then we take the whole thing the whole transformation and say not in we remove of course the alas name we don't need it and then we make very simple substring select distinct
02:30
CST key the customer key from the silver table can be silver CRM cost info so that's it let's go and check so as you can see it is working fine so we are not able to find any unmatching data between
02:46
the customer info from ERB and the CRM but of course after the transformation if you don't use the transformation so if I just remove it like this we will find a lot of unmatching data so this means our transformation is working perfectly and we can go and remove the original value so that's it for the
03:03
First Column okay now moving on to the next field we have the birthday of their customers so the first thing to do is to check the data type it is a date so it's fine it is not an integer or a string so we don't have to convert anything but still there is something to check with the birth dates so we can check whether
03:19
we have something out of range so for example we can go and check whether we have really old dates at the birth dates so let's take 1900 and let's say 24 and we can take the first date of the month so let's go and check that well it looks like that we have customers that are
03:36
older than a 100 Year well I don't know maybe this is correct but it sounds of course strange to bit of the business of course this is Creed and he is in charge of something that is correct say hi to the kids hi kids yay and then we can go and
03:53
check the other boundary where it is almost impossible to have a customer that the birthday is in the future so we can say birth date is higher than the current dates like this so let's go and query this information well it will not
04:09
work because we have to have like an or between them and now if we check the list over here we have dates that are invalid for the birth dates so all those dates they are all birthday in the future and this is totally unacceptable so this is an indicator for bad data
04:25
quality of course you can go and report it to the source system in order to correct it so here it's up to you what to do with those dates either leave it as it is as a bad data or we can go and clean that up by replacing all those dates with a null or maybe replacing only the one that is Extreme where it is
04:41
100% is incorrect so let's go and write the transformation for that as usual we're going to start with case whenn per dates is larger than the current date and time then null otherwise we can have an else where we have the birth dat as
04:58
it is and then we have an end as birth date so let's go and excuse it and with that we should not get any customer we the birthday in the future so that's it for the birth dates now let's move to the next one we have the gender now
05:13
again the gender informations is localities so we have to go and check all the possible values inside this column so in order to check all the possible values we're going to use select distinct gen from our table so let's go and execute it and now the data doesn't look really good so we have here
05:30
a null we have an F we have here an empty string we have male female and again we have the m so this is not really good what we going to do we're going to go and clean up all those informations in order to have only three values male female and not available so
05:46
we're going to do it like this we're going to say case when and now we're going to go and trim the values just to make sure there is like no empty spaces and as well I'm going to go and use the upper function just to make sure that in the future if we get any lower cases and so on we are covering all the different
06:01
scenarios so case this is in F4 let's say female then make it as female and we can go and do the same thing for the male like this so if it is an M or a male
06:17
make sure it is capital letters because here we are using the upper then it is a male otherwise all other scenarios it should be not available so whether it is an empty string or nulls and so on so we have to have an end of course as gen so
06:32
now let's go and test it and check whether we have covered everything so you can see the m is now male the empty is not available the f is female the empty string or maybe spaces here is not available female going to stay as it is and the same for the male so with that
06:47
we are covering all the scenarios and we are following our standards in the project so I'm going to go and cut this and put it in our original query over here so let's go and execute the whole thing and with that we have cleaned up all those three columns now the question
07:03
is did we change anything in the ddl well we didn't change anything we didn't introduce any new column or change any data type so that means the next step is we're going to go and insert it in the server layer so as usual we're going to say here insert into silver Erp the
07:20
customer and then we're going to go and list all the column names so C ID birth dat and the gender all right so let's go and execute it and with that we can see it inserted all the data and of course the very important step as the next is to check that data quality so let's go
07:36
back to our query over here and change it from bronze to Silver so let's go and check the silver layer well of course we are getting those very old customers but we didn't change that we only change the birthday that is in the future and we don't see it here in the results so that
07:52
means everything is clean so for the next one let's go and check the different genders and as you can see we have only those three values and of course we can go and take a final look to our table so you can see the C ID here the birth date the gender and then we see our metadata column and
08:08
everything looks amazing so that's it what are the different types of data Transformations that we have done first with the ID what you have done we have handled inv valid values so we have removed this part where it is not needed and the same thing goes for the birth dates we have handled as well invalid
08:25
values and then for the last one for the gender we have done data normalizations by mapping the code to more friendly value and as well we have handled the missing values so those are the types that we have done in this
08:41
code okay moving on to the second table we have the location informations so we have Erp location a101 so now here the task is easy because we have only two columns and if you go and check the integration model we can find our table over here so we can go and connect it together with the
08:58
customer info from the other system using the CI ID with the customer key so those two informations must be matching in order to join the tables so that means we have to go and check the data so let's go and select the data CST key
09:13
from let's go and get the silver Data customer info so let's now if you go and check the result you can see over here that we have an issue with the CI ID there is like a minus between the characters and the numbers but the customer ID the customer number we don't
09:29
have anything that splits the characters with the numbers so if you go and join those two informations it will not be working so what we have to do we have to go and get rid of this minus because it is totally unnecessary so let's go and fix that it's going to be very simple so what we're going to do we're going to say C ID so we're going to go and search
09:46
for the m and replace it with nothing it's very simple like this so let's go and quer it again and with that things looks very similar to each others and as well we can go and query it so we're going to say where our transformation is not in then we can go
10:02
and use this as a subquery like this so let's go and execute it and as you can see we are not finding any unmatching data now so that means our transformation is working and with that we can go and connect those two tables together so if I take take the transformation away you can see that we
10:19
will find a lot of unmatching data so the transformation is okay we're going to stay with it and now let's speak about the countries now we have here multiple values and so on what I'm going to do this is low cardinality and we have to go and check all possible values inside this column so that means we are
10:36
checking whether the data is consistent so we can do it like this distinct the country from our table I'm just going to go and copy it like this and as well I'm going to go s the data by the country so let's go and check the informations now
10:52
you can see we have a null we have an empty string which is really bad and then we have a full name of country and then we have as well an abbreviation of the countries well this is a mix this is not really good because sometimes we have the E and sometimes we have Germany
11:08
and then we have the United Kingdom and then for the United States we have like three versions of the same information which is as well not really good so the quality of the is not really good so let's go and work on the transformation as usual we're going to start with the case win if trim
11:24
country is equal to D then we're going to transform it to Germany and the next one it's going to be about the USA so if trim country is in so now let's go and get those two values the US and the USA
11:41
so us and USA then it's going to be the United States States states so with that we have covered as well those three cases now we have to talk about the null and the empty string so we're going to say when trim country is equal to empty
11:59
string or country is null then it's going to be not available otherwise I would like to get the country as it is so trim country just to make sure that we don't have any leading or trailing spaces so that's it let's go and say
12:16
this is the country so it is working and the country information is transformed and now what I'm going to do I'm going to take the whole new transformation and compare it to the old one let me just call this as old country and let's go and query it so now we can check those
12:33
value State as before so nothing did change the de is now Germany the empty string is not available the null the same thing and the United Kingdom State as like it's like before and now we have one value for all those information so
12:48
it's only the United States so it looks perfect and with that we have cleaned as well the second column so with that we have now clean results and now the question did we change anything in the ddl well we haven't changed anything both of them are varar so we can go now immediately and insert it into our table
13:06
so insert into silver customer location and here we have to specify the columns it's very simple the ID and the country so let's go and execute it and as you can see we got now inserted all those values of course as a next we go and double check those informations I would
13:23
just go and remove all those stuff as well here and instead of bronze let's go with the silver so as you can see all the values of the country looks good and let's have a final look to the table so like this so we have the IDS without the
13:38
separator we have the countries and as well our metadata information so with that we have cleaned up the data for the location okay so now what are the different types of data transformation that we have done here is first we have handled invalid values so we have removed the minus with an empty string
13:54
and for the country we have done data normalization so we have replaced codes with friendly values and as well at the same time we have handled missing values by replacing the empty string and null with not available and one more thing of course we have removed the unwanted
14:11
spaces so those are the different types of transformation that we have done for this table okay guys now keep the energy up keep the spirit up we have to go and clean up the last table in the bronze layer and
14:26
of course we cannot go and Skip anything we have to check the quality and to detect all the errors so now we have a table about the categories for the products and here we have like four columns let's go and start with the first one the ID as you can see in our integration model we can connect this
14:42
table together with the product info from the CRM using the product key and as as you remember in the silver layer we have created an extra column for that in the product info so if you go and select those data you can see we have a column called category ID and this one
14:57
is exactly matching the ID that we have in this table and we have done the testing so this ID is ready to be used together with the other table so there is nothing to do over here and now for the next columns they are string and of course we can go and check whether there
15:13
are any unwanted spaces so we are checking for The Unwanted spaces is so let's go and check select star from and we're going to go and get the same table like this here and first we are checking the category so the category is not equal to the category after trimming The
15:30
Unwanted spaces so let's go and execute it and as you can see we don't have any results so there are no unwanted spaces let's go and check the other column for example the subcategory the next one so let's get the subcategory and the under
15:45
query as well we don't have anything so that means we don't have unwanted spaces for the subcategory let's go now and check the last column so I will just copy and paste now let's get the maintenance and let's go and execute and as well no results perfect we don't have
16:01
any unwanted spaces inside this table so now the next step is that we're going to go and check the data standardizations because all those columns has low cardinality so what we're going to do we're going to say select this thing let's get the cat category from our table I'll just copy
16:19
and paste it and check all values so as you can see we have the accessories bikes clothing and components everything looks perfect we don't have to change anything in this column let's go and check the subcategory and if you scroll down all values are friendly and nice as
16:35
well nothing to change here and let's go and check the last column the maintenance perfect we have only two values yes and no we don't have any nulls so my friends that means this table has really nice data quality and we don't have to clean up anything but
16:50
still we have to follow our process we have to go and load it from the bronze to the silver even if we didn't transform anything so our job is really easy here we're going to go and say insert into silver dots Erp PX and so on and we're going to go and Define The
17:07
Columns so it's going to be the ID the category sub category maintenance so that's it let's go and insert the data now as usual what we're going to do we're going to go and check the data so silver Erp PX let's have a look all
17:23
right so we can see the IDS are here the categories the subcategories the maintenance and we have our meta column so everything is inserted correctly all right so now I have all those queries and the insert statements for all six tables and now what is important before
17:40
inserting any data we have to make sure that we are trating and emptying the table because if you run this qu twice what's going to happen you will be inserting duplicates so first truncate the data and then do a full load insert all data so we're going to have one step
17:56
before it's like the bronze layer we're going to say trate table and then we will be trating the silver customer info and only after that we have to go and insert the data and of course we can go and give this nice information at the start so first we are truncating the table and then inserting so if I go and
18:13
run the whole thing so let's go and do it it will be working so if I can run it again we will not have any duplicates so we have to go and add this tip before each insert so let's go and do that all right so I'm done with all tables so now let's go and run everything so let's go
18:30
and execute it and we can see in the messaging everything working perfectly so with that we made all tables empty and then we inserted the data so perfect with that we have a nice script that loads the silver layer but
18:46
of course like the bronze layer we're going to put everything in one stored procedure so let's go and do that we'll go to the beginning over here and say create or alter procedure and we're going to put it in the schema silver and using the naming convention load silver
19:02
and we're going to go over here and say begin and take the whole code end it is long one and give it one push with a tab and then at the end we're going to say and perfect so we have our s procedure but we forgot here the US with that we will not have any error let's go and
19:18
execute it so the thir procedure is created if you go to the programmability and you will find two procedures load bronze and load silver so now let's go and try it out all what you have to do is now only to execute the Silver Load silver so let's execute the start
19:35
procedure and with that we will get the same results this thir procedure now is responsible of loading the whole silver layer now of course the messaging here is not really good because we have learned in the bronze layer we can go and add many stuff like handling the
19:51
error doing nce messaging catching the duration time so now your task is to pause the video take this thir procedure and go and transform it to be very similar to the bronze layer with the same messaging and all the add-ons that we have added so pause the video now I
20:07
will do it as well offline and I will see you soon okay so I hope you are done and I can show you the results it's like the bronze layer we have defined at the star few variables in order to catch the
20:23
duration so we have the start time the end time patch start time and Patch end time and then we are printing a lot of stuff in order to have like nice messaging in the outut so at the start we are saying loading the server layer and then we start splitting by The Source system so loading the CRM tables
20:40
and I'm going to show you only one table for now so we are setting the timer so we are saying start time get the dat date and time informations to it then we are doing the usual we are truncating the table and then we are inserting the new informations after cleaning it up and we have this nice message where we
20:56
say load duration where we are finding the differences between the start time and the end time using the function dat diff and we want to show the result in the seconds so we are just printing how long it took to load this table and we're going to go and repeat this
21:12
process for all the tables and of course we are putting everything in try and Cat so the SQL going to go and try to execute the tri part and if there are any issues the SQL going to go and execute the catch and here we are just printing few information like the error
21:27
message the error number and the error States and we are following exactly the same standard at the bronze layer so let's go and execute the whole thing and with that we have updated the definition of the S procedure let's go now and execute it so execute silver do load
21:44
silver so let's go and do that it went very fast like few than 1 second again because we are working on local machine loading the server layer loading the CRM tables and we can see this nice messaging so it start with trating the table inserting the data and we are
22:00
getting the load duration for this table and you will see that everything is below 1 second and that's because at in real project you will get of course more than 1 second so at the end we have low duration of the whole silver layer and now I have one more thing for you let's
22:15
say that you are changing the design of this thr procedure for the silver layer you are adding different types of messaging or maybe are creating logs and so on so now all those new ideas and redesigns that you are doing for the silver layer you have always to think about bringing the same changes as well
22:32
in the other store procedure for the pron layer so always try to keep your codes following the same standards don't have like one idea in One S procedure and an old idea in another one always try to maintain those scripts and to keep them all up to date following the
22:48
same standards otherwise it can to be really hard for other developers to understand the cause I know that needs a lot of work and commitments but this is your job to make everything following the best practices and following the same naming convention and standards that you put for your projects so guys
23:04
now we have very nice two ETL scripts one that loads the pron layer and another one for the server layer so now our data bear house is very simple all what you have to do is to run first the bronze layer and with that we are taking all the data from the CSV files from the
23:20
source and we put it inside our data warehouse in the pron layer and with that we are refreshing the whole bronze layer once it's done the next step is to run the start procedure of the servey layer so once you executed you are taking now all the data from the bronze
23:35
layer transforming it cleaning it up and then loading it to the server layer and as you can see the concept is very simple we are just moving the data from one layer another layer with different tasks all right guys so as you can see in the silver layer we have done a lot of data Transformations and we have
23:52
covered all the types that we have in the data cleansing so we remove duplicates data filtering handling missing data invalid data unwanted spaces casting the data types and so on and as well we have derived new columns we have done data enrichment and we have
24:07
normalized a lot of data so now of course what we have not done yet business rules and logic data aggregations and data integration this is for the next layer all right my friends so finally we are done cleaning up the data and checking the quality of our data so we can go and close those
24:24
two steps and now to the next step we have to go and extend the data flow diagram so let's go okay so now let's go and extend our data flow for the silver layer so what I'm going to do I'm just going to go and
24:40
copy the whole thing and put it side by side to the bronze layer and let's call it silver layer and the table names going to stay as before because we have like one to one like the bronze layer but what we're going to do we're going to go and change the coloring so I'm going to go and Mark
24:55
everything and make it gray like silver and of course what is very important is to make the lineage so I'm going to go now from the bronze and take an arrow and put it to the server table and now with that we have like a lineage between three layers and you are checking this
25:11
table the customer info you can understand aha this comes from the bronze layer from the customer info and as well this comes from the source system CRM so now you can see the lineage between different layers and without looking to any scripts and so on
25:26
in one picture you can understand the whole projects so I don't have to explain a lot of stuff by just looking to this picture you can understand how the data is Flowing between sources bronze layer silver layer and to the gold layer of course later so as you can see it looks really nice and clean all
25:44
right so with that we have updated the data flow next we're going to go and commit our work in the get repo so let's go okay so now let's go and commit our scripts we're going to go to the folder scripts and here we have a server layer
26:00
if you don't have it of course you can go and create it so first we're going to go and put the ddl scripts for the server layer so let's go and I will paste the code over here and as usually we have this comment at the header explaining the purpose of this scripts so let's go and commit our work work and
26:17
we're going to do the same thing for the start procedure that loads the silver layer so I'm going to go over here I have already file for that so let's go and paste that so we have here our stored procedures and as usual at the start we have as well so this script is doing the ETL process where we load the
26:34
data from bronze into silver so the action is to truncate the table first and then insert transformed cleans data from bronze to Silver there are no parameters at all and this is how you can use the start procedure okay so we're going to go and commit our work
26:50
and now one more thing that we want to commit in our project all those quaries that you have built to check the quality of the server layer so this time we will not put it in the scripts we're going to go to the tests and here we're going to go and make a new file called quality checks silver and inside it we're going
27:06
to go and paste all the queries that we have filled I just here reorganize them by the tables so here we can see all the checks that we have done during the course and at the header we have here nice comments so here we are just saying that this script is going to check the
27:21
quality of the server layer and we are checking for nulls duplicates unwanted spaces invalid date range and so on so that each time you come up with a new quality check I'm going to recommend you to share it with the project and with other team in order to make it part of multiple checks that you do after
27:38
running the atls so that's it I'm going to go and put those checks in our repo and in case I come up with new check I'm going to go and update it perfect so now we have our code in our repository all right so with that our code is safe and we are done with the whole epic so we
27:55
have build the silver layer now let's go and minimize it and now we come to my favorite layer the gold layer so we're going to go and build it the first step as usual we have to analyze and this time we're going to explore the business objects so let's
28:12
go all right so now we come to the big question how we going to build the gold layer as usual we start with analyzing so now what we're going to do here is to explore and understand what are the main business objects that are hidden inside our source system so as you can see we have two sources six files and here we
28:28
have to identify what are the business objects once we have this understanding then we can start coding and here the main transformation that we are doing is data integration and here usually I split it into three steps the first one we're going to go and build those business objects that we have identified
28:43
and after we have a business object we have to look at it and decide what is the type of this table is it a dimension is it a fact or is it like maybe a flat table so what type of table that we have built and the last step is of course we have now to rename all the columns into
28:59
something friendly and easy to understand so that our consumers don't struggle with technical names so once we have all those steps what we're going to do it's time to validate what we have created so what we have to do the new data model that we have created it should be connectable and we have to check that the data integration is done
29:15
correctly and once everything is fine we cannot skip the last step we have to document and as well commit our work in the git and here we will be introducing new type of documentations so we're going to have a diagram about the data model we're going to build a data dictionary where we going to describe
29:31
the data model and of course we can extend the data flow diagram so this is our process those are the main steps that we will do in order to build the gold layer okay so what is exactly data modeling usually usually the source
29:46
system going to deliver for you row data an organized messy not very useful in its current States but now the data modeling is the process of taking this row data and then organize it and structure it in meaningful way so what we are doing we are putting the data in
30:03
a new friendly and easy to understand objects like customers orders products each one of them is focused on specific information and what is very important is we're going to describe the relationship between those objects so by connecting them using lines so what you
30:19
have built on the right side we call it logical data model if you compare to the left side you can see the data model makes it really easy to understand our data and the relationship the processes behind them now in data modeling we have three different stages or let's say three different ways on how to draw a
30:34
data model the first stage is the conceptual data model here the focus is only on the entity so we have customers orders products and we don't go in details at all so we don't specify any columns or attributes inside those boxes we just want to focus what are the
30:50
entities that we have and as well the relationship between them so the conceptual data model don't focus at all on the details it just gives the big picture so the second data model that we can build is The Logical data model and here we start specifying what are the
31:05
different columns that we can find in each entity like we have the customer ID the first name last name and so on and we still draw the relationship between those entities and as well we make it clear which columns are the primary key and so on so as you can see we have here more details but one thing we don't
31:20
describe a lot of details for each column and we are not worry how exactly we going to store those tables in the database the third and last stage we have the physical data model this is where everything gets ready before creating it in the database so here you have to add all the technical details
31:37
like adding for each column the data types and the length of each data type and many other database techniques and details so again if if you look to the conceptual data model it gives us the big picture and in The Logical data model we dive into details of what data we need and the physical layer model
31:54
prepares everything for the implementation in the database and to be honest in my projects I only draw the conceptual and The Logical data model because drawing and building the physical data model needs a lot of efforts and time and there are many tools like in data bricks they
32:10
automatically generate those models so in this project what we're going to do we're going to draw The Logical data model for the gold layer all right so now for analytics and specially for data warehousing and business intelligence we need a special
32:26
data model that is optimized for reporting and analytics and it should be flexible scalable and as well easy to understand and for that we have two special data models the first type of data model we have the star schema it has a central fact table in the middle and surrounded by Dimensions the fact
32:43
table contains transactions events and the dimensions contains descriptive informations and the relationship between the fact table in the middle and the dimensions around it forms like a star shape and that's why we call it star schema and we have another data model called snowflake schema it looks
33:00
very similar to the star schema so we have again the fact in the middle and surrounded by Dimensions but the big difference is that we break the dimensions into smaller subdimensions and the shape of this data model as you are extending the dimensions it's going to look like a snowflake so now if you
33:17
compare them side by side you can see that the star schema looks easier right so it is usually easy to understand easy to query it is really perfect for analyzes but it has one issue with that the dimension might contain duplicates and your Dimensions get bigger with the time now if you compare to the snowflake
33:34
you can see the schema is more complex you so you need a lot of knowledge and efforts in order to query something from the snowflake but the main advantage here comes with the normalization as you are breaking those redundancies in small tables you can optimize the storage but to be honest who care about the storage
33:51
so for this project I have chose to use the star schema because it is very commonly used perfect for reporting like for example if you're using power pii and we don't have to worry about the storage so that's why we going to adapt this model to build our gold
34:08
layer okay so now one more thing about those data models is that they contain two types of tables fact and dimensions so when I I say this is a fact table or a dimension table well the dimension contains descriptive informations or like categories that gives some context
34:23
to your data for example a product info you have product name category subcategories and so on this is like a table that is describing the product and this we call it Dimension but in the other hand we have facts they are events like transactions they contain three
34:39
important informations first you have multiple IDs from multiple dimensions then we have like the informations like when the transaction or the event did happen and the third type of information you're going to have like measures and numbers so if you see those three types
34:54
of data in one table then this is a fact so if you have a table that answers how much or how many then this is a fact but if you have a table that answers who what where then this is a dimension table so this is what dimension and fact
35:13
tables all right my friends so so far in the bronze layer and in the silver layer we didn't discuss anything about the business so the bronze and silver were very technical we are focusing on data Eng gestion we are focusing on cleaning up the data quality of the data but
35:28
still the tables are very oriented to the source system now comes the fun part in the god layer where we're going to go and break the whole data model of the sources so we're going to create something completely new to our business that is easy to consume for business reporting and analyzes and here it is
35:45
very very important to have a clear understanding of the business and the processes and if you don't know it already at this phase you have really to invest time by meeting maybe process experts the domain experts in order to have clear understanding what we are talking about in the data so now what
36:01
we're going to do we're going to try to detect what are the business objects that are hidden in the source systems so now let's go and explore that all right now in order to build a new data model I have to understand first the original data model what are the main business objects that we have how things are
36:17
related to each others and this is very important process in building a new model so now what I usually do I start giving labels to all those tables so if you go to the shapes over here let's go and search for label and if you go to more icons I'm going to go and take this label over here so drag and drop it and
36:34
then I'm going to go and increase maybe the size of the font so let's go with 20 and bold just make it a little bit bigger so now by looking to this data model we can see that we have a bradu for informations in the CRM and as well in the ARP and then we have like customer informations and transactional
36:51
table so now let's focus on the product so the product information is over here we have here the current and the history product informations and here we have the categories that's belong to the products so in our data model we have something called products so let's go and create this label it's going to be
37:07
the products and so let's go and give it a color to the style let's Pi for example the red one now let's go and move this label and put it beneath this table over here that I have like a label saying this table belongs to the objects
37:23
called products now I'm going to do the same thing for the other table over here so I'm going to go and tag this table to the product as well so that I can see easily which tables from the sources does has informations about the product business object all right now moving on
37:38
we have here a table called customer information so we have a lot of information about the customer we have as well in the ARB customer information where we have the birthday and the country so those three tables has to do with the object customer so that means we're going to go and label it like that so let's call it customer and I'm going
37:55
to go and pick different color for that let's go with the green so I will tag this table like this and the same thing for the other tables so copy tag the second table and the third table now it is very easily for me to see which table
38:11
to belong to which business objects and now we have the final table over here and only one table about the sales and orders in the ARB we don't have any informations about that so this one going to be easy let's call it sales and let's move it over here and as well
38:27
maybe change the color of that to for example this color over here now this step is very important by building any data model in the gold layer it gives you a big picture about the things that you are going to module so now the next step with that we're going to go and build those objects step by step so let's start with the first objects with
38:44
our customers so here we we have three tables and we're going to start with the CRM so let's start with this table over here all right so with that we know what are our business objects and this task is done and now in The Next Step we're going to go back to SQL and start doing data Integrations and building
39:00
completely new data model so let's go and do that now let's have a quick look to the gold layer specifications so this is the final stage we're going to provide data to be consumed by reporting and Analytics and this time we will not be
39:16
building tables we will be using views so that means we will not be having like start procedure or any load process to the gold layer all what you are doing is only data transformation and the focus of the data transformation going to be data integration aggregation business
39:31
logic and so on and this time we're going to introduce a new data model we will be doing star schema so those are the specifications for the gold layer and this is our scope so this time we make sure that we are selecting data from the silver layer not from the bronze because the bronze
39:48
has bad data quality and the server is everything is prepared and cleaned up in order to build the good layer going to be targeting the server layer so let's start with select star from and we're going to go to the silver CRM customer info so let's go and hit execute and now
40:04
we're going to go and select the columns that we need to be presented in the gold layer so let's start selecting The Columns that we want we have the ID the key the first name
40:19
I will not go and get the metadata information this only belongs to the Silver Perfect the next step is that I'm going to go and give this table an ilas so let's go and call it CI and I'm going to make sure that we are selecting from this alas because later we're going to go and join this table with other tables
40:36
so something like this so we're going to go with those columns now let's move to the second table let's go and get the birthday information so now we're going to jump to the other system and we have to join the data by the CI ID together with the customer key so now we have to go and join the data with another table
40:52
and here I try to avoid using the inner join because if the other table doesn't have all the information about the customers I might lose customers so always start with the master table and if you join it with any other table in order to get informations try always to
41:08
avoid the inner join because the other source might not have all the customers and if you do inner join you might lose customers so iend to start from the master table and then everything else is about the lift join so I'm going to say Lift join silver Erp customer a z12 so
41:24
let's give it the ls CA and now we have to join the tables so it's going to be by C from the first table it going to be the customer key equal to ca and we have the CI ID now of course we're going to get matching data because we checked the silver layer but if we haven't prepared
41:41
the data in the silver layer we have to do here preparation step in order to join Jo the tables but we don't have to do that because that was a preep in the silver layer so now you can see the systematic that we have in this pron silver gold so now after joining the tables we have to go and pick the
41:56
information that we need from the second table which is the birth dat so B dat and as well from this table there is another nice information it is the gender information so that's all what we need from the second table let's go and
42:11
check the third table so the third table is about the location information the countries and as well we connect the tables by the C ID with the key so let's go and do that we're going to say as well left join silver Erp location and
42:26
I'm going to give it the name LA and then we have to join while the keys the same thing it's going to be CI customer key equal to La a CI ID again we have prepared those IDs and keys in the server layer so the joint should be working now we have to go and pick the
42:43
data from the second table so what do we we have over here we have the ID the country and the metadata information so let's go and just get the country perfect so now with that we have joined all the three tables and we have picked all the columns that we want in this object so again by looking over here we
43:00
have joined this table with this one and this one so with that we have collected all the customer informations that we have from the two Source systems okay so now let's go and query in order to make sure that we have everything correct and in order to understand that your joints are correct you have to keep your eye in
43:17
those three columns so if you are seeing that you are getting data that means you are doing the the joints correctly but if you are seeing a lot of nulls or no data at all that means your joints are incorrect but now it looks for me it is working and another check that I do is
43:34
that if your first table has no duplicates what could happen is that after doing multiple joints you might now start getting dgates because the relationship between those tables is not clear one to one you might get like one to many relationship or many to many relationships so now the check that I
43:50
usually do at this stage advance I have to make sure that I don't have duplicates from their results so we don't have like multiple rows for the same customer so in order to do that we go and do a quick group bu so we're going to group by the data by the customer ID and then we do the
44:07
counts from this subquery so this is the whole subquery and then after that we're going to go and say Group by the customer ID and then we say having counts higher than one so this query
44:25
actually try to find out whether we have any duplicates in the primary key so let's go and executed we don't have any duplicate and that means after joining all those tables with the customer info those tables didn't didn't cause any issues and it didn't duplicate my data
44:42
so this is very important check to make sure that you are in the right way all right so that means everything is fine about the D Kates we don't have to worry about it now we have here an integration issue so let's go and execute it again and now if you look to the data we have two sources for the gender informations
44:58
one comes from the CRM and another where come from the Erp so now the question is what are we going to do with this well we have to do data integration so let me show you how I do it first I go and have a new query and then I'm going to go and remove all other stuff and I'm going to
45:14
leave only those two informations and use it distinct just to focus on the integration and let's go and execute it and maybe as well to do an order bu so let's do one and two let's go and execute it again so now here we have all the scenarios and we can see sometimes
45:30
there is a matching so from the first table we have female and the other table we have as well female but sometimes we have an issue like those two tables are giving different informations and the same thing over here so this is as well an issue different informations another scenario where we have a from the first
45:45
table like here we have the female but in the other table we have not available well this is not a problem so we can get it from the first table but we have as well the exact opposite scenario where from the first table the data is not available but it is available from the second table and now here you might
46:02
wonder why I'm getting a null over here we did handle all the missing data in the silver layer and we replace everything with not available so why we are still getting a null this null doesn't come directly from the tables it just come because of joining tables so
46:17
that means there are customers in the CRM table that is not available in the Erb table and if there is like no match what's going to happen we will get a null from scel so this null means there was no match and that's why we are
46:32
getting this null it is not coming from the content of the tables and this is of course an issue but now the big issue what can happen for those two scenarios here we have the data but they are different and here again we have to ask the experts about it what is the master here is it the CRM system or the ARP and
46:50
let's say from their answer going to say the master data for the customer information is the CRM so that means the CRM informations are more accurate than the Erp information and this is only about the customers of course so for this scenario where we have female and
47:06
male then the correct information is the female from the First Source system the same goes over here and here we have like male and female then the correct one is is the mail because this Source system is the master okay so now let's go and build this business rule we're going to start as usual with the case wi
47:23
so the first very important rule is if we have a data in the gender information from the CRM system from the master then go and use it so we're going to go and check the gender information from the CRM table so customer gender is not equal to not available so that means we
47:40
have a value male or female let me just have here a comma like this then what going to happen go and use it so we're going to use the value from the master CRM is the master for gender info now
47:55
otherwise that means it is not available from the CRM table then go and use and grab the information from the second table so we're going to say ca gender but now we have to be careful this null over here we have to convert it to not
48:11
available as well so we're going to use the Calis so if this is a null then go and use the not available like this so that's it let's have an end let me just push this over here so let's go and call it new chin for now let's go and excute it and
48:28
let's go and check the different scenarios all those values over here we have data from the CRM system and this is as well represented in the new column but now for the second parts we don't have data from the first system so we are trying to get it from the second
48:44
system so for the first one is not available and then we try to get it from the Second Source system so now we are activating the else well it is null and with that the CIS is activated and we are replacing the null with not available for the second scenario as well the first system don't have the
49:02
gender information that's why we are grabbing it from the second so with that we have a female and then the third one the same thing we don't have information but we get it from the Second Source system we have the mail and the last one it is not available in in both Source systems that's why we are getting not
49:17
available so with that as you can see we have a perfect new column where we are integrating two different Source system in one and this is exactly what we call data integration this piece of information it is way better than the source CRM and as well the source ARP it
49:34
is more rich and has more information and this is exactly why we Tred to get data from different Source system in order to get rich information in the data warehouse so do we have a nice logic and as you can see it's way easier to separate it in separate query in
49:49
order first to build the logic and then take it to the original query so what I'm going to do I'm just going to go and copy everything from here and go back to our query I'm going to go and delete those informations the gender and I will put our new logic over here so a comma
50:05
and let's go and execute so with that we have our new nice column now with that we have very nice objects we don't have delates and we have integrated data together so we took three three tables and we put it in one object now the next step is that we're going to go and give nice friendly names the rule in the gold
50:22
layer that to use friendly names and not to follow the names that we get from The Source system and we have to make sure that we are following the rules by the naming conventions so we are following the snake case so let's go and do it step by step for the first one let's go and call it the customer ID and then the
50:39
next one I will get rid of using keys and so on I'm going to go and call it customer number because those are customer numbers then for the next one we're going to call it first name without using any prefixes and the next one last name and we have here marital
50:58
status so I will be using the exact name but without the prefix and here we just going to call it gender and this one we going to call it create date and this one birth dat and the last one going to be the country so let's go and execute
51:16
it now as you can see the names are really friendly so we have customer ID customer numbers first name last name material status gender so as you can see the names are really nice and really easy to understand now the next step I'm going to think about the order of those columns so the first two it makes sense
51:32
to have it together the first name last name then I think the country is very important information so I'm going to go and get it from here and put it exactly after the last name it's just nicer so let's go and execute it again so the first name last name country it's always nice to group up relevant columns
51:48
together right so we have here the status of the gender and so on and then we have the CATE date and the birth date I think I'm going to go and switch the birth date with the CATE date it's more important than the CATE dates like this and here not forget a comma so execute again so it looks wonderful now comes a
52:06
very important decision about this objects is it a fact table or a dimension well as we learned Dimensions hold descriptive information about an object and as you can see we have here a descriptions about the customers so all those columns are describing the
52:22
customer information and we don't have here like transactions and events and we don't have like measures and so on so we cannot say this object is a fact it is clearly a dimension so that's why we're going to go and call this object the dimension customer now there is one
52:37
thing that if you creating a new dimension you need always a primary key for the dimension of course we can go over here and the depend on the primary key that we get from The Source system but sometimes you can have like Dimensions where you don't have like a primary key that you can count on so
52:53
what we have to do is to go and generate a new primary key in the data warehouse and those primary Keys we call it surrogate keys serate keys are system generated unique identifier that is assigned to each record to make the record unique it is not a business key
53:10
it has no meaning and no one in the business knows about it we only use it in order to connect our data model and in this way we have more control on how to connect our data model and we don't have to depend all way on the source system and there are different ways on
53:25
how to generate surrogate Keys like defining it in the ddl or maybe using the window function row number in this data warehouse I'm going to go with a simple solution where we're going to go and use the window function so now in order to generate a Sur key for this Dimension what we're going to do it is
53:42
very simple so we're going to say row number over and here if we have to order by something you can order by the create date or the customer ID or the customer number whatever you want but in this example I'm going to go and order by the
53:58
customer ID so we have to follow the naming convention that's all surate keys with the key at the end as a suffix so now let's go and query those informations and as you can see at the start we have a customer key and this is a sequence we don't have here of course
54:13
any duplicates and now this sgate key is generated in the data warehouse and we going to use this key in order to connect the data model so now with that our query is ready and the last step is that we're going to go and create the object and as we decided all the objects
54:28
in the gold layer going to be a virtual one so that means we're going to go and create a view so we're going to say create View gold. dim so follow damic convention stand for the dimension and we're going to have the customers and then after that we have us so with that
54:45
everything is ready let's go and excuse it it was successful let's go to the Views now and you can see our first objects so we have the dimension customers in the gold layer now as you know me in the next of that we're going to go and check the quality of this new objects so let's go and have a new query
55:03
so select star from our view temp customers and now we have to make sure that everything in the right position like this and now we can do different checks like the uniqueness and so on but I'm worried about the gender information
55:19
so let's go and have a distinct of all values so as you can see it is working perfectly we have only female male and not available so that's it with that we have our first new dimension okay friends so now let's go
55:36
and build the second object we have the products so as you can see product information is available in both Source systems as usual we're going to start with the CRM informations and then we're going to go and join it with the other table in order to get the category informations so those are the columns
55:52
that we want from this table now we come here to a big decision about this objects this objects contains historical informations and as well the current informations now of course depend on the requirement whether you have to do analysis on the historical informations but if you don't have such a requirements we can go and stay with
56:09
only the current informations of the products so we don't have to include all the history in the objects and it is anyway as we learned from the model over here we are not using the primary key we are using the product key so now what we have to do is to filter out the historical data and to stay only with
56:26
the current data so we're going to have here aware condition and now in order to select the current data what we're going to do we're going to go and Target the end dates if the end date is null that means it is a current data let's take this example over here so you can see here we have three record for the same
56:42
product key and for the first two records we have here an information in the end dates because it is historical informations but the last record over here we have it as a null and that's because this is the current information it is open and it's not closed yet so in
56:58
order to select only the current informations it is very simple we're going to say BRD in dat is null so if you go now and execute it you will get only the current products you will not have any history and of course we can go and add comment to it filter out all
57:15
historical data and this means of course we don't need the end date in our selection of course because it is always a null so with that we have only the current data now the next step that we have to go and join it with the product categories from the Erp and we're going
57:31
to use here the ID so as usual the master information is the CRM and everything else going to be secondary that's why I use the Live join just to make sure I'm not losing I'm not filtering any data because if there is no match then we lose data so let's join
57:48
silver Erp and the category so let's call it PC and now what we're going to do we're going to go and join it using the key so PN from the CRM we have the category ID equal to PC ID and now we have to go and pick columns from the
58:04
second table so it's going to be the PC we have the category very important PC we have the subcategory and we can go and get the maintenance so something like this let's go and query and with that we have all those
58:19
columns comes from the first table and those three comes from the second so with that we have collected all the product informations from the two Source systems now the next step is we have to go and check the quality of these results and of course what is very important is to check the uniqueness so
58:37
what we're going to do we're going to go and have the following query I want to make sure that the product key is unique because we're going to use it later in order to join the table with the sales so from and then we have to have group by
58:53
product key and we're going to say having counts higher than one so let's go and check perfect we don't have any duplicates the second table didn't cause any duplicates for our join and as well this means we don't have historical data
59:09
and each product is only one records and we don't have any duplicates so I'm really happy about that so let's go in query again now of course the next step do we have anything to integrate together do we have the same information twice well we don't have that the next
59:25
step is that we're going to go and group up the relevant informations together so I'm going to say the product ID then the product key and the product name are together so all those three informations are together and after that we can put all the category informations together
59:41
so we can have the category ID the category itself the subcategory let me just query and see the results so we have the product ID key name and then we have the category ID name and the subcategory and then maybe as well to put the maintenance after the
59:58
subcategory like this and I think the product cost and the line can start could stay at the end so let me just check so those three four informations about the category and then we have the cost line and the start date I'm really happy with that the next step we're going to go and give n names friendly
00:14
names for those columns so let's start with the first one this is the product ID the next one going to be the product number we need the key for the surrogate key later and then we have the product name and after that we have the category
00:31
ID and the category and this is the subcategory and then the next one going to stay as it is I don't have to rename it the next one going to be the cost and the line and the last one will be the start
00:47
dates so let's go and execute it now we can see very nicely in the output all those friendly names for the columns and it looks way nicer than before I don't have even to describe those informations the name describe it so perfect now the next big decision is what do we have
01:04
here do we have a effect or Dimension what do you think well as you can see here again we have a lot of descriptions about the products so all those informations are describing the business object products we don't have like here transactions events a lot of different
01:19
keys and ideas so we don't have really here a facts we have a dimension each row is exactly describing one object describing one products that's why this is a dimension okay so now since this is a dimension we have to go and create a primary key for it well actually the
01:36
surrogate key and as we have done it for the customers we're going to go and use the window function row number in order to generate it over and then we have to S the data I will go with the start dates so let's go with the start dates and as well the product key and we're
01:53
going to gra it a name products key like this so let's go and execute it with that we have now generated a primary key for each product and we're going to be using it in order to connect our data model all right now the next step we
02:08
does we're going to go and build the view so we're going to say create view we're going to say go and dimension products and then ask so let's go and create our objects and now if you go and refresh the views you will see our second object the second
02:25
dimension so we have here in the gold layer the dimension products and as usual we're going to go and have a look to this view just to make sure that everything is fine so them products so let's execute it and by looking to the data everything looks nice so with that
02:41
we have now two dimensions all right friends so with that we have covered a lot of stuff so we have covered the customers and the products and we are left with only one table where we have the transactions the sales
02:57
and for the sales information we have only data from the CRM we don't have anything from the Erp so let's go and build it okay so now I have all those informations and now of course we have only one table we don't have to do any Integrations and so on and now we have to answer the big question do we have here a dimension or a fact well by
03:14
looking to those details we can see transactions we can see events we have a lot of dates informations we have as well a lot of measures and metrics and as well we have a lot of IDs so it is connecting multiple dimensions and this is exactly a perfect setup for effects
03:31
so we're going to go and use those informations as effects and of course as we learned effect is connecting multiple Dimensions we have to present in this fact the surrogate keys that comes from the dimensions so those two informations the product key and the customer ID
03:47
those informations comes from the searce system and as we learned we want to connect our data model using the surate keys so what we're going to do we're going to replace those two informations with the surate keys that we have generated and in order to do that we have to go and join now the two
04:02
dimensions in order to get the surate key and we call this process of course data lookup so we are joining the tables in order only to get one information so let's go and do that we will go with the lift joint of course not to lose any transaction so first we're going to go
04:18
and join it with the product key now of course in the silver layer we don't have any ciruit Keys we have it in the good layer so that means for the fact table we're going to be joining the server layer together with the gold layer so gold dots and then the dimension
04:34
products and I'm going to just call it PR and we're going to join the SD using the product key together with the product number [Music] from the dimension and now the only information that we need from the dimension is the key the sget key so
04:51
we're going to go over here and say product key and what I'm going to do I'm going to go and remove this information from here because we don't need it we don't need the original product key from The Source system we need the circuit key that we have generated in our own in this data warehouse so the same thing
05:07
going to happen as well for the customer so gold Dimension customer again again we are doing here a look up in order to get the information on SD so we are joining using this ID over here equal to
05:23
the customer ID because this is a customer ID and what we're going to do the same thing we need the circuit key the customer key and we're going to delete the ID because we don't need it now we have the circuit key so now let's go and execute it and now with that we
05:40
have in our fact table the two keys from the dimensions and now this can help us to connect the data model to connect the facts with the dimensions so this is very necessary Step Building the fact table you have to put the surrogate keys from the dimensions in the facts so that
05:55
was actually the hardest part building the facts now the next step all what you have to do is to go and give friendly names so we're going to go over here and say order number then the surrogate keys are already friendly so we're going to go over here and say this is the order
06:10
date and the next one going to be shipping date and then the next one due date and the sales going to be I'm going to say sales amount the quantity and the final one is the price
06:28
so now let's go and execute it and look to the results so now as you can see the columns looks very friendly and now about the order of the columns we use the following schema so first in the fact table we have all the surrogate keys from the dimensions then second we have all the dates and at the end you
06:45
group up all the measures and the matrics at the end of The Facts so that's it for the query for the facts now we can go and build it so we're going to say create a view gold in the gold layer and this time we're going to use the fact underscore and we're going
07:01
to go and call it sales and then don't forget about the ass so that's it let's go and create it perfect now we can see the facts so with that we have three objects in the gold layer we have two dimensions and one and facts and now of course the next step with this we're going to go and check the quality of the
07:18
view so let's have a simple select fact sales so let's execute it now by checking the result you can see it is exactly like the result from the query and everything looks nice okay so now one more trick that I usually do
07:33
after building a fact is try to connect the whole data model in order to find any issues so let's go and do that we will do just simple left join with the dimensions so gold Dimension customers C and we will use the
07:50
[Music] keys and then we're going to say where customer key is null so there is no matching so let's go and execute this and with that as you can see in the results we are not getting anything that means everything is matching perfectly
08:05
and we can do as well the same thing with the products so left join C them products p on product key and then we connect it with the facts product key and then we going to go and check the product key
08:22
from the dimension like this so we are checking whether we can connect the facts together with the dimension products let's go and check and as you can see as well we are not getting anything and this is all right so with that we have now SQL codes that is tested and as well creating the gold
08:38
layer now in The Next Step as you know in our requirements we have to make clear documentations for the end users in order to use our data model so let's go and draw a data model of the star schema so let's go and draw our data
08:54
model let's go and search for a table and now what I'm going to do I'm going to go and take this one where I can say what is the primary key and what is the for key and I'm going to go and change little bit the design so it's going to be rounded and let's say I'm going to go and change to this color and maybe go to
09:11
the size make it 16 and then I'm going to go and select all the columns and make it as well 16 just to increase the size and then go to our range and we can go and increase it 39 so now let's go and zoom in a little bit for the first
09:26
table let's go and call it gold Dimension customers and make it a little bit bigger like this and now we're going to go and Define here the primary key it is the customer key and what else we're going to do we're going to go and list all the columns in the dimension is little bit annoying but the results
09:42
going to be awesome so what do we we have the customer ID we have the customer number and then we have the first name now in case you want a new rows so you can hold control and enter and you can go and add the other columns so now pause the video and then go and
09:59
create the two Dimensions the customers and the products and add all the columns that you have built in the [Music] view welcome back so now I have those two Dimensions the third one one going to be the fact table now for the fact
10:16
table I'm going to go with different color for example the blue and I'm going to go and put it in the middle something like this so we're going to say gold fact sales and here for that we don't have primary key so we're going to go and delete it and I have to go and add all The Columns of the facts so order
10:33
number products key customer key okay all right perfect now what we can do we can go and add the foreign key information so the product key is a foreign key key for the products so you're going to say fk1 and the customer key going to be the foreign key for the
10:48
customers so fk2 and of course you can go and increase the spacing for that okay so now after we have the tables the next step in data modeling is to go and describe the relationship between these tables this is of course very important for reporting and analytics in order to understand how I'm going to go and use
11:05
the data model and we have different types of relationships we have one to one one too many and in Star schema data model the relationship between the dimension and the fact is one too many and that's because in the table customers we have for a specific customer only one record describing the
11:20
customer but in the fact table the customer might exist in multiple records and that's because customers can order multiple times so that's why in fact it is many and in the dimension side it is one now in order to see all those relationships we're going to go to the menu to the left side and as you can see
11:37
we have here entity relations and now you have different types of arrows so here for example we have zero to many one one to many one to one and many different types of relations so now which one we going to take we're going to go and pick with this one so it says one mandatory so that means the customer
11:53
must exist in the dimension table too many but it is optional so here we have three scenarios the customer didn't order anything or the customer did order only once or the customer did order many things so that's why in the fact table it is optional so we're going to take
12:08
this one and place it over here so we're going to go and connect this part to the customer Dimension and the many parts to the facts well actually we have to do it on the customers so with that we are describing the relationship between the dimensions and fact with one to many one
12:25
is mandatory for the customer Dimension and many is optional to the facts so we have the same story as well for the products so the many part to the facts and the one goes to the products so it's going to look like this each time you are connecting new dimension to the fact
12:41
table it is usually one too many relationship so you can go and add anything you want to this model like for example a text like explaining something for example if you have some complicated calculations and so on you can go and write this information over here so for example we can say over here sales
12:58
calculation we can make it a little bit smaller so let's go with 18 so we can go and write here the formula for that so sales equal quantity multipli with a price and make this a little bit bigger so it is really nice info that we can
13:15
add it to the data model and even we can go and Link it to the column so we can go and take this arrow for example with it like this and Link it to the column and with that you have as well nice explanation about the business rule or the calculation so you can go and add any descriptions that you want to the
13:32
data model just to make it clear for anyone that is using your data model so with that you don't have only like three tables in the database you have as well like some kind of documentations and explanation in one Blick we can see how the data model is built and how you can connect the tables together it is
13:49
amazing really for all users of your data model all right so now with that we have really nice data model and now in The Next Step we're going to go and create quickly a data catalog all right great so with that we have a data model and we can say we have
14:05
something called a data products and we will be sharing this data product with different type of users and there's something that's every every data product absolutely needs and that is the data catalog it is a document that can describe everything about your data model The Columns the tables maybe the
14:23
relationship between the tables as well and with that you make your data product clear for everyone and it's going to be for them way easier to derive more insights and reports from your data product and what is the most important one it is timesaving because if you don't do that what can happen each
14:39
consumer each user of your data product will keep asking you the same question questions about what do you mean with this column what is this table how to connect the table a with the table B and you will keep repeating yourself and explaining stuff so instead of that you prepare a data catalog a data model and
14:55
you deliver everything together to the users and with that you are saving a lot of time and stress I know it is annoying to create a data catalog but it is Investments and best practices so now let's go and create one okay so now in order to do that I've have created a new file called Data catalog in the folder
15:11
documents and here what we're going to do is very St straightforwards we're going to make a section for each table in the gold layer so for example we have here the table dimension customers what you have to do first is to describe this table so we are saying it stores details about the customers with the demographics and Geographics data so you
15:27
give a short description for the table and then after that you're going to go and list all your columns inside this table and maybe as well the data type but what is way important is the description for each column so you give a very short description like for example here the gender of the customer
15:43
now one of the best practices of describing a column is to give examples because you can understand quickly the purpose of the columns by just seeing an example right so here we are seeing we can find inside it a male female and not available so with that the consumer of your table can immediately understand
15:58
uhhuh it will not be an M or an F it's going to be a full friendly value without having them to go and query the content of the table they can understand quickly the purpose of the column so with that we have a full description for all the columns of our Dimension the same thing we're going to do for the
16:13
products so again a description for the table and as well a description for each column and the same thing for the facts so that's it with that you have like data catalog for your data product at the code layer and with that the business user or the data analyst have better and clear understanding of the
16:30
content of your gold layer all right my friends so that's all for the data catalog in The Next Step we're going to go back to Dro where we're going to finalize the data flow diagram so let's go okay so now we're going to go and extend
16:46
our data flow diagram but this time for the gold layer so now let's go and copy the whole thing from the silver layer and put it over here side by side and of course we're going to go and change the coloring to the gold and now we're going to go and rename stuff so this is the
17:02
gold layer but now of course we cannot leave those tables like this we have completely new data model so what do we have over here we have the fact sales we have dimension customers and as well we have Dimension products so now what I'm
17:18
going to do I'm going to go and remove all those stuff we have only three tables and let's go and put those three tables somewhere here in the center so now what you have to do is to go and start connecting those stuff I'm going to go with this Arrow over here direct connection and start connecting stuff so
17:34
the sales details goes to the fact table maybe put the fact table over here and then we have the dimension customer this comes from the CRM customer our info and we have two tables from the Erp it comes from this table as well and the location
17:49
from the Erp now the same thing goes for the products it comes from the product info and comes from the categories from the Erp now as you can see here we have cross arrows so what we going to do we can go and select everything and we can say line jumps with a gap and this makes
18:06
it a little bit like Pitter individual for the arrows so now for example if someone asks you where the data come from for the dimension products you can open this diagram and tell them okay this comes from the silver layer we have like two tables the product info from the CRM and as well the categories from
18:23
the Erp and those server tables comes from the pron layer and you can see the product info comes from the CRM and the category comes from the Erp so it is very simple we have just created a full data lineage for our data warehouse from the sources into the different layers in
18:38
our data warehouse and data lineage is is really amazing documentation that's going help not only your users but as well the developers all right so with that we have very nice data flow diagram and a data lineage all right so we have completed the data flow it's really feel like progress like achievement as we are
18:54
clicking through all those tasks and now we come to the last task in building the data warehouse where we're going to go and commit our work in the get repo okay so now let's put our scripts in the project so we're going to go to
19:09
the scripts over here we have here bronze silver but we don't have a gold so let's go and create a new file we're going to have gold/ and then we're going to say ddl gold. SQL so now we're going to go and paste our views so we have here our three views and as usual at the
19:24
start we going to describe the purpose of the views so we are saying create gold views this script can go and create views for the code layer and the code layer represent the final Dimension and fact tables the star schema each view perform Transformations and combination data from the server layer to produce
19:40
business ready data sets and those us can be used for analytics and Reporting so that it let's go and commit it okay so with that as you can see we have the PRS the silver so we have all our etls and scripts in the reposter and now as
19:56
well for the gold layer we're going to go and add all those quality checks that we have used in order to validate the dimensions and facts so we're going to go to The Taste over here and we're going to go and create a new file it's going to be quality checks gold and the file type is SQL so now let's go and
20:12
paste our quality checks so we have the check for the fact the two dimensions and as well an explanation about the script so we are validating the integrity and the accuracy of the gold layer and here we are checking the uniqueness of the circuit keys and whether we are able to connect the data
20:27
model so let's put that as well in our git and commit the changes and in case we come up with a new quality checks we're going to go and add it to our script here so those checks are really important if you are modifying the atls or you want to make sure that after each ATL those script SC should run and so on
20:43
it is like a quality gate to make sure that everything is fine in the gold layer perfect so now we have our code in our repo story okay friends so now what you have to do is to go and finalize the get repo so for example all the documentations that we have created during the projects we can go and upload
21:01
them in the docs so for example you can see here the data architecture the data flow data integration data model and so on so with that each time you edit those pages you can commit your work and you have likey version of that and another thing that you can do is that you go to the read me like for example over here I
21:17
have added the project overview some important links and as well the data architecture and a little description of the architecture of course and of course don't forget to add few words about yourself and important profiles in the different social medias all right my friends so with that we have completed
21:32
our work and as well closed the last epek building the gold layer and with that we have completed all the faces of building a data warehouse everything is 100% And this feels really nice all right my friends so if you're still here and you have built with me the data
21:49
warehouse then I can say I'm really proud of you you have built something really complex and amazing because building a data warehouse is usually a very complex data projects and with that you have not only learned SQL but you have learned as well how we do a complex
22:04
data projects in real world so with that you have a real knowledge and as well amazing portfolio that you can share with others if you are applying for a job or if you are showcase that you have learned something new and with that you have experienced different rules in the project what the data Architects and the
22:19
data Engineers do in complex data projects so that was really an amazing journey even for me as I'm creating this project so now in the next and with that you have done the first type of data analytics projects using SQL the data warehousing now in The Next Step we're going to do another type of projects the
22:35
exploratory data analyzes Eda where we're going to understand and explore our data sets if you like this video and you want me to create more content like this I'm going to really appreciate it if you support the channel by subscribing liking sharing commenting all those stuff going to help the
22:51
Channel with the YouTube algorithm and as well my content going to reach to the others so thank you so much for watching and I will see you in the next tutorial bye