Multimodal RAG: Chat with PDFs (Images & Tables) [2025]

Transcript

00:00

good morning everyone how's it going today welcome back to the channel and welcome to this new video in today's video I want to show you how to chat with a PDF and take into account the images the tables the plots and everything else that can be in your in

00:15

your document for the generation of the response okay and we're going to be doing that and that's going to be looking something like this so in essence you're going to be querying your your pipeline so here I have an sample with the attention is all un need paper

00:33

from Google and I qued what do the authors mean by attention Okay and as you can see that the retrieved part of the document was this part right here attention and you can see that we also have the images that

00:50

will retrieved and this is everything that is going to be sent to the language model including the images so that the language model can give us a response uh based on the this okay um so yeah everything everything in here is going

01:05

to go into the language model as a context and it's going to give us an answer and in order to do this we're going to be explaining the whole process with this very nice diagram that you see right here we're going to be using unstructured or parsing our document into images tables and text and we're of

01:23

course going to be using a language model that has multimodal input okay uh that is to say a language model with vision uh in this case we're going to be using GPT 40 mini uh in order to interpret the images the tables and the

01:38

text and as you can see we're going to be uh creating this very sophisticated multi Vector store uh using uh L chain very cool very convenient and it's actually easier than you may think um so we're going to be going through the entire process of this notebook and uh

01:56

the notebook of course is available in the description and uh if you have any questions don't hesitate to ask me and if you don't understand um what rack is because you probably want to know how to do rag or retrieval augmented generation

02:12

without the images only with text you should probably watch that video first and that is also in the description of this video um and then take a look at multimodel Rack okay so there you go without any further Ado let's get right

02:27

into the video [Music] all right so let's uh explain a little bit about what we're going to be uh what

02:42

method we're going to be introducing right here as I mentioned before this is not the only method and I am only going to be explaining to you in this uh particular video the process we're going to be covering the code for how to do this in the next lesson for now let's

02:58

just cover the process that's going that's uh the one that uh that's going to be going on in this spip plane um and the process is actually um very well represented in this diagram right here

03:13

um we're going to be using this library right here that you can see right here it is unstructured unstructured is an open source library that allows you to extract uh structured data from your unstructured documents in other words it

03:29

allows you to take your unstructured and semi-structured data that can be coming from PDF files HTML files like your websites uh it can come from a CSV file for example from an Excel file um uh basically pretty much any uh kind of any

03:46

format of a file that you that you can think of um as long as it is uh unstructured or semi-structured and it will split it into different components so you you will have your PDF document you will pass through the

04:02

unstructured um the unstructured uh library that we're going to be using and we're going to get uh one array with all of the images from from your entire document another array with all of your tables from your entire document and

04:17

another array with all of the text of your entire document and this is going to be very convenient because this will allow us to treat those types of um elements differently and to embed them them and to load them to our database differently depending on what they

04:34

actually are and what they need uh to be loaded and what transformation we need to do in order to use them in our rag pipeline okay um so that's the first step the extraction

04:50

uh once you have extracted this that is probably the most uh difficult thing to do and the most magical one because I'm structured is just so amazing for this uh we're going to be uh we're going to be loading this to we're going to be using a language

05:06

model in order to summarize them okay so we're going to attach a summary to every single element that we have right here uh what I did in this particular example is I used a regular uh language model for the tables and the text so I converted I mean of course the text and

05:23

the tables are regular text representation actually unstructured allows you to extract the HTML representation of a table that is within your PDF so um you will send what what we're going to be doing is we're going to be

05:39

sending the HTML representation of all of our tables to a very quick language model in this case I'm going to be using language models from Gro I think I was using liap uh 3.1 and create a summary for the table and then we're going to do

05:55

the same thing for every uh long piece of text we're going to be creating a summary for that piece of text that is going to allow us to embed the summary instead of embedding the entire text this is going to help us with retrieval usually that's a good technique uh

06:10

because it allows you to to focus on the keywords that are actually relevant to the text that you're going to be embedding and for the image we're going to do the same thing we're going to also summarize or more than summarize describe the image that we're going to be covering and in order to do that

06:27

we're going to use a language model that has multimodel capabilities in the case of this example I was using GPT 40 mini okay but feel free to use any language model that has uh multimodal input you can use for example Gemini 3 uh sorry

06:43

Gemini 1.5 from Google you can use um Lava for example if you want an open source model um but yeah just to be clear up until this part right here we have already extracted all of the images

06:58

all of the T all of the texts and we have also tagged them with a summary and it is the summary the one that we're going to embed using the embeddings model okay so that is the next step once you have extracted every single thing we're going

07:14

to uh tie them together the summary and the original element using a doc ID okay this is going to be just a string with a with a Unicode uh very not a unic code but a a very specific ID that is going

07:32

to link the original document to its summary okay so we're going to have all of our summaries linked um we're going to create documents from the summaries and in the metadata of each one of our documents we're going to put the ID and

07:49

the same thing we're going to create an element right here and we're going to tag them uh the metadata of each one of these elements is going to have the dock the same dock ID pass their summary and then these summaries are the ones that are going to go into our Vector store

08:05

and uh this right here is pretty much the same thing that we have been doing so far we're going to be vectorizing them using a text embeddings model then we're going to load it into a vector database in this case if I remember correctly I was using chrom

08:21

ADB and uh the documents actually they're not going to go into the vector store because we are not going to be embedding them as I told you before that could be a possibility and that is actually one way of doing this if you want you could embed the whole thing

08:37

using a multimodel embeddings model but in our case we're not using a multimodel embeddings model we're only using a text embeddings model and we're only going to uh vectorize the summaries okay so we're going to put the summaries into a vector

08:52

database and the documents we're going to load them to a different database we're going to call it our documents tour okay but remember that even if they are in two different databases they are still linked by this doc ID metadata

09:09

that we have assigned to them okay and this is very relevant because now retrieval becomes much more um becomes possible because now you can query your vector database like you would query a regular rack Pipeline and then the

09:26

vector database will return to you the most relevant documents for your query so for example let's say that you're embedding uh let's say that you're loading a document uh for example you're loading the the document um the research paper attention

09:42

is whole unit from um from Google right and um you're querying something like what is multi-ad attention so you're going to get the you're that is going to retrieve the summaries of the documents

09:57

that talk about multi-ad attention okay so that's the the retrieve documents but in this case remember that our Vector database only contains the summaries not the documents themselves that is very important because we have the doc ID metadata assigned to them and

10:14

this doc ID is the one that we're going to use to fetch from the vector store the documents that are actually um that we're actually looking for so then we're going to fetch those documents and since those documents uh remember that could be images tables and

10:31

text we could actually uh get from the retrieve documents not only text but also images tables and texts so that is essentially uh what we're going to be doing and how multimodel retrieval Works

10:46

in this particular uh way of doing this so we have our uh yeah I mean so just by the end as you can see we have a very simple retrieval pipeline in which we just ask a question send a query in text and

11:03

as a return we get documents that can be images tables uh whatever we want whatever we loaded to our Vector store and whatever we were able to tag with a summary okay um and then we use that to

11:19

generate an answer if we get images as context then we're probably going to have to use a language model that has multimodel input uh capabilities if you want to use the images as context which I suppose that that is what we want to

11:34

do so yeah that is the whole process that we're going to be building in the next lesson we're going to be showing the code about how all of this works all right so let's actually start with the code right now and uh the first

11:51

thing that I want you to do is to install the packages that we're going to need um in order to run the dependencies Okay so so it depends on what uh OS you're running uh but you're going to need popler TCT and lip magic here are just the quick instructions to do this

12:08

to install them for Mac or Linux if you were using Windows there's also like U the instructions to to build this I'm probably going to to post them um under this video but uh yeah just get that installed I already have it installed but uh you can do that the next thing to

12:25

do is to install the dependencies that we're going to need and and the ones that we're going to need in this case is unstructured uh oh by the way this notebook is of course available uh like in this lesson you can go uh right under this video there is the link to open it

12:42

and um and uh yeah you will have access to the entire thing that the entire code that is right here so that you can run it on your on your end and implement it to your own pipeline um so as as I was telling you we are going to be

12:58

installing onru shirt while we're going to be installing pillow lxml lxml um I installed pillow twice because why not uh well in my case I'm going to be using Chrome ADB uh we're going to need to have t token installed

13:13

for the tokening tokenization uh we're going to be using Lang chain we're going to be using Lang chain Community Lang chain open AI because I'm going to be using the as my vision model where I'm going to be using uh GPT 40 uh mini fire

13:28

remember correctly and for the just text based language models I'm going to be using uh L 3 if I'm not mistaken and uh just python. tnv for my um for my uh library for my environment

13:47

variables and um if you're if you're um if you have any questions about it so I'm going to be of course initializing a Gro API key to use my uh open source llm models from The Croc API the open AI API key of course

14:04

which you can get from the platform uh dashboard and then I also initialized a lang chain API tracing and the Lang chain API and Lang chain API key which are the API keys that we're going to use for lsmith now because I want to be able

14:21

to trace what's going on behind the scenes in this Pipeline and uh once that is run I'm actually going to run it like this um you know I'm actually going to restart the kernel because I realized that uh the whole thing I had already run it so I'm on uh run 101 right now so

14:40

just to make sure that we're all working with a similar environment now I'm going to rerun the whole thing uh so once you have installed everything the first thing that we're going to do is we're going to partition our PDF okay now in

14:55

our case uh we're going to be dealing with with uh the attention is all un need PDF which we have right here and uh this is just an example of course um feel free to use any PDF you want but um I figured that this was a

15:13

good example because even if it's a bit um a bit short it's just 15 pages it has some images it has a lot of text it has some equations as well if I remember correctly and uh yeah I mean it's just a multimodel uh document that is actually

15:29

like more similar to the PDFs that you would see in real life here we have a table as well we're going to see how uh unstructured and how our pipeline extracts it but um this is the file that we're going to be loading uh so the first thing that we're going to do is

15:45

we're going to partition it and in order to partition it we're going to be using unstructured Okay so let's get into how we're going to be using unstructured and what each um parameter right here does and before I forget to mention this uh

16:01

this video is actually a lesson from the AI engineering cohort that I host H where I teach you how to go from beginner to this level of creating multimodel rack applications and from this level also all the way to creating

16:17

uh multi-agent systems okay and this is a program that not only includes uh pre-recorded material with all of the content but also includes my personal help if you if you get stuck or something I can be there to solve your

16:33

questions there are live sessions it is a cohort course which means that you will be uh interacting with me live so be sure to join that if you're interested in that and joining the community it is pretty fun and I can't wait to meet you there if you're interested just let me know if you have

16:48

any questions about that too okay so on with the video okay so what do I have right here I have this um this function right here I mean I'm this call I'm calling this method from unstructured uh remember

17:04

that we have already installed unstructured and I actually installed support for all kinds of documents uh in this case I'm only going to be using PDF so I could have technically just R just install the PDF um uh packages but I I

17:19

mean if you you're going to want to install all dogs if you're going to be parsing a lot of different formats and in this case I'm going to be using the Partition p F uh method from unstructured now something that I uh might um it might be important to

17:36

mention is that right here I am not using the loader from Lang chain okay Lang chain actually has a loader for unstructured and it works pretty well uh you can use it both locally and also with the serverless API but I wanted to

17:53

show you a little bit more of the flexibility that you have with unstructured uh so that we can can see actually how all of these parameters work with the unstructured API uh so that you can actually feel and see what's actually going on under the hood

18:09

when you're using unstructured so the partition PDF uh method and pretty much any partitioning method um from unstructured basically takes the file path to the file that you want to partition and it Returns the partitioned elements okay that's the only mandatory

18:28

uh parameter that you're going to pass but here are other arguments that are available and let's take a look at them okay um so first of all we have the strategy which can be highr or it can be normal and uh in this case we're going

18:45

to be choosing highr because we're setting this um this um this parameter to True infer table structure and this parameter essentially just means if you want to extract tables from your document or not and if you want to to

19:02

extract tables from your document you're going to have to set it to true and if you want to extract tables then the high resolution strategy is mandatory you're going to have to choose this one if you want to extract tables okay so that's the first thing and in our case we are

19:18

going to want to extract table so we're going to choose this settings right here now uh something else is that you're going to want to choose the the kind of images that you want to extract okay so in this case I have actually yeah I

19:34

don't have it here so that's good uh I have this right here and um in this particular case uh what I have seen in previous examples of this tutorial because I am basing this on a cookbook

19:50

from langing is they were using this parameter right here which is extract images in PDF and I said it to true however that parameter is actually deprecated or in uh on the way of being deprecated so you don't need to add it anymore I just added it it here for

20:07

context because you're probably going to find it uh in the wild or in other tutorials if you don't uh mean if you're um so you don't get like um uh confused about it um so the new way of doing this

20:23

I mean the updated way of doing this is with this parameter right here which is extract image block type and right here you're going to set it to image if you want uh to extract the images from your PDF if you want to extract the tables to for example you're

20:39

going to tap table two like this now in our case we only want to extract the images this is not necessarily not going this is not necessarily not going to extract the tables it's just not going to extract the tables as images okay

20:56

it's still going to extract the tables as I'm going to show you you in a minute and uh if you want to extract the images to an actual folder to an actual directory in your computer you can enable this and you can pass in an

21:12

output path to get the to save the images to okay that's also a possibility in my case I'm not going to enable it just going to set it like this because I don't want to have the images uh from my PDF downloaded to my computer just want to have them in the partitioned

21:29

uh element but uh feel free to enable this and play around with it and you will see that this will create a new folder right here with all your images okay um next thing that we have right here is extract image block uh to payload and this

21:46

essentially means that we are going to be extracting the images and the image is going to be uh is going to have a metadata uh element that is going to contain the base 64 object of the image okay so if you

22:02

set this to false you're not going to be extracting the Bas 64 um representation of your image which would be terrible because you're going to need the B 64 representation of your images if you want to send them to your

22:17

language model because if you want to send an image to your language model you're going to have to send it uh to the API using a base 64 representation so this is the way to do it then this part right here is actually very interesting and um actually I'm

22:34

going to show you how this works without it first um because it's it's super powerful and super cool but let me just show you how this works without it so without it I'm going to run this and um I mean it's going to start running it's probably going to take a a few seconds

22:50

but um what is going to happen right now is that it is going to extract every single element from my PDF so it's going to return the table it's going to return this paragraph this other paragraph This title right here this other table and it's going to return everything just at once all the elements of my of my entire

23:08

document are going to be returned into a single array and that's okay that's that could be what you want to do but um the unstructured uh service actually allows you to do something super cool which is to chunk it

23:25

to to similar um um yeah to chunk it by at by by a strategy that you can choose okay you have by title or basic and this essentially means chunking usually

23:40

you're you might be used to thinking of it but that it means making things smaller but actually in this case in their case when they when you implement a chunking strategy in unstructured it means you're putting elements together so right now without the chunkin

23:57

strategy we're going to see that we if we see what kind of documents were returned to us you can see that we have title documents we have narrative documents we have footer documents image documents Etc so all of this are

24:12

actually um the ones that are are available to us okay um yeah sorry this one's right here and um that's great that's that's great we have all the documents in a single in a single in a

24:27

single array but you probably don't want that actually let me just show you uh what this looks like so if I do length of chunks you're going to see that we have 218 uh different elements inside the document so it split the entire PDF

24:44

document into 218 uh different uh elements so we have this one might be one this one might be another one this title might be another one the table might be another one so you don't really want that what you want is to have them together uh to have the

25:01

elements together that are um related to each other and that's where the chunking strategy comes into play I will let you play around with this to see what it actually does but uh if we enable chunking going to rerun this but with chunking this time I'm going to set the

25:17

chunking strategy to by title a maximum um size of the chunk to 10,000 characters uh we're going to combine text character we're going to combine different elements um when they are under 2,000

25:34

characters and we're going to start a new uh part of the of the document after 6,000 characters if you can take a look at the documentation if you want to um delve into this a little bit more deeply

25:50

but what this is essentially doing is that it is taking the elements from the 218 elements that we have in the document it's putting together those that are related inside the document and if you choose by title then it's probably going to go right here to our

26:06

document and it's going to be like okay so this is one title so all of the documents inside this title are going to be assigned to a single chunk that's how they call it and then all of the elements assigned to this other title are going to be under a sing a single

26:22

chunk as well and this is actually super useful for rag because if you're dealing with a like this one you're going to have um a single I mean a title A titled chapter talk about one single topic one

26:38

single uh like it's going to have uh cohesive uh meaning and that's going to be super useful for rag So you you're going to be able to embed an entire chunk that is related um it's basically extracting chapter by chapter of the

26:53

document so that's pretty useful uh it actually finished uh exporting it and you will see that in this case we don't have 218 uh documents anymore we only have 17 and uh the same way we don't have all

27:11

of the different types we only have this two types which is composite element and table okay and composite element actually let me just show you what it looks like um going to go right here um going to go right here and

27:31

um let's say so from chunks I'm going to see the first one let's see so the first one is a composite element I'm going to say to dictionary and you you can see right

27:46

here the the elements of this uh composite element you can see that it's type of composite it has an ID it has some text it has some metadata it says which page number it is it comes from Etc which is very cool but uh

28:04

interestingly it has this um this property inside of its metadata and that property right there let me show you um actually yeah metadata and I'm

28:19

going to use this property right here which is original documents and that one right there a b show you not like that that that one right there actually contains all of the documents that are related or Associated to this particular

28:37

chunk okay so remember that we had 218 documents um or elements inside the entire uh PDF so for 15 pages unstructured extracted 215 or 17 uh

28:53

elements and then it Associated them together using a by tile chunking strategy and the 17 chunks that it returned to us are actually sets of these components right here and since we

29:11

used the by tile um technique these are supposed to be one after another under the same section of the PDF I'm going to show you how this looks like in the actual PDF in just a moment but just I mean start figuring start visualizing it um

29:28

actually just going to show you right now like how it how it looks um where do we have this I think it is here uh was just running some text tests but here as you can see here is one chunk um I'm going to be displaying one

29:44

chunk and as you can see this chunk right here has uh it starts at this title called attention and it goes all the way up to here so as you can see it is kind of um uh a chapter in the in the document and

30:03

it's it contains one two 3 4 5 6 7 8 9 10 11 12 13 elements inside of this chunk and they are all Associated to the same title that is because we used the chunking strategy okay and that is also

30:22

the reason why we only have composite element and tables uh that were extracted okay because the composite elements are the ones that contain all of the other elements inside of their metadata and they under the key of

30:39

original documents okay so so far so good we have successfully extracted our documents I'm going to erase this one because I had already shown you shown it to you right here but as you can see inside of the metadata um inside of the

30:56

metadata uh original elements key we have a title narrative text a footer some a couple of images for this one in particular we have images uh the title Etc okay and uh let me show you how the

31:12

images look like in inside of an unstructured uh document so in this in this um in this cell right here what I'm doing is essentially just extracting only the chunks I mean only the elements

31:28

that have and that are of Type image okay so in here I basically listed all of the elements in chunk three um here we have a title a narrative Etc so what I'm doing right here is I'm just extracting those elements that have

31:44

images uh so I'm taking all of the chunk all of the elements from from the chunk three and then I am extracting the images from that chunk um and I'm just selecting the first one and converting it to a dictionary and here you can see the

32:01

representation of an image that was extracted by unstructured so you can see that it has a type of image it has some text because it is able to extract the text inside within the image it has the coordinates within the document itself

32:17

this is going to be very useful um afterwards if you want to highlight where in the document this particular element is located and then very importantly right here we have the image base 64 representation and as you can see it's super long um but that's

32:34

exactly what we want and we are only getting this because remember we set this parameter to True extract image block to payload to true and that is the only reason why we are getting this um

32:51

this key right here okay and it is of course very important because this is the one that we're going to be sending to our multimodel language model okay so so far so good we have successfully extracted the elements and we can actually now split

33:08

them uh so by the end of this splitting technique we're going to essentially have three different arrays one of tables one of texts and one of images just like we had in the diagram that we showed before so for Chunk in chunks

33:27

like so for the 17 chunks that we extracted we're going to append the thing the element into um into table if it's a table and we're going to append it to texts if

33:43

it's part of the composite element now this is technically a shortcut because remember that inside composite element there is also images but we're going to treat the images differently I mean of course you can improve on this spy plan if you want and actually pars the images

33:59

within the composite element as well you feel free to do that but in this case I'm just going to be extracting the images and the composite elements and treating them um as these two different elements okay and then third I'm going to extract the images and in order to

34:15

extract the images I'm going to extract them from the composite element right here going to tap into every single composite element if within the composite element I have an image element I'm going to add it to my images array so that way by the

34:32

end I have these three different arrays one for tables one for texts and one for images and we have successfully completed the partitioning part or the extraction part now what we're going to do is we're going to have to is we're

34:47

going to have to uh transform it okay and here I just have a very quick uh function that displays any image in base uh 64 so here just very quick function made with Chad GPT but essentially just takes the base 64 code of an image and

35:04

it displays it um and here you can see that one from the array that we created the first element of the array I'm going to show it and as you can see this is the first image that was extracted from the document uh seems to be working

35:19

pretty well so now we have this three arrays now it is time to actually go to the next part of this um this exercise which is summarizing the data that we just extracted so we're going to be creating a summary for each image for each table and for each piece of

35:38

text all right so now it is time to start summarizing the data okay and that's what I'm going to do right here in order to summarize it I'm going to be using uh first of all a model from Gro I'm going to be using L uh 3.1 if I

35:54

remember correctly and um in order to do that I'm going to be be installing of course Gro I'm going to be using chat Gro importing it from here I import chat PR template and my regular string output procer to create a chain

36:09

okay now uh the chain to that is going to generate the summaries for my um for my text elements is going to be uh this one right here here an assistant tasked with summarizing the tables Etc just a very simple chain it's

36:26

going to pass into prompt and model and then the output parer and as you can see I'm initializing LMA 3.1 from chat Grog okay uh so going to run this right here I actually think I forgot to execute this one right here there we go and uh now just show

36:44

you what texts uh look like because remember that we um split all of our elements into tables texts and images within the retrieved documents uh text itself remember that it is a composite

37:01

ele I mean it is a collection of composite elements right and um what does that mean remember that I told you that it means that in its metadata there are the original elements and that essentially means uh I have to tap into the first

37:18

one just to show you that essentially means that all of the elements are within original n elements like this okay but however you can still print it oops you can still print it like

37:33

this and it's going to show you all the elements of that um of that collection of elements okay of that chunk so here I have my scroll B element um and as you can see here's all the collection of all the text for this

37:49

first chunk as you can see it's just the title and the abstract that's the first chunk um now what we're going to to do right now is we're going to use that in order to summarize it okay so we're going to

38:04

pass in every single one of those texts every single one of those composite elements and we're going to summarize them and I'm going to show you what it looks like on Langs Smith a little bit later on if you want but the idea right here is that it is going to take the entire contents of all of the elements

38:22

within the composite element it's going to batch um execute the summarized chain and that's going to go to summarize and the same thing is going to go to the tables however let me show you something quick about the tables

38:38

so the tables are the tables basically look like this okay we have four tables in the document and let me show you what it looks like so two dictionary and actually oh I'm going to

38:53

have to click on the first one right here and you can say this is the first table that we have uh we have the element ID we have some text within it and then it has this very convenient feature I mean

39:09

very convenient property which is text as HTML and that essentially means that it is the extracted table but in HTML format and this is the only thing that we actually need to send to our language model in order to summarize it right we

39:25

don't really need uh it original elements because if we try to tap into the original elements for example um let's see um metad dat original elements let's see what it looks

39:41

like um to dictionary um so look like anything no there's no to dictionary here um it's a take apparently I don't know

39:58

why I have a b 64 thing right here but um yeah I mean what I wanted to show you is that we have the HTML code inside of here and this is the one that we're going to send to our language model in order to actually get the summary because remember how mean if you have a

40:13

language model you have to send it text uh you cannot send it just or an image if it's a multimodel but you cannot send it just the text like this it's probably not going to understand the divisions there is no headings or anything right here you want want to send it the actual

40:29

mark down uh the actual markup language uh so that it can understand where is the header what is the table cell Etc so these are the ones that we're going to be tapping into and that's what's actually going on right here so I say that the tables HTML is

40:45

actually going to be the property text as HTML of each table in the array tables and um then I batch that too let me just execute this take a few seconds and let me just show

41:00

you what it looks like have it right here so uh that's where the text summary is as you can see we have one two three bunch of text summaries like a about of composite elements all of these are the composite elements um and here you have

41:16

every single summary and then we're going to do let's check the same thing for the tables table summaries table summaries and here we have four tables the table Compares four types of neural

41:32

network layer self attention concurrent Etc and as you can see these are the summaries that I am going to be uh vectorizing and embedding uh and adding to my Vector database and now let's do the same thing for our images and in

41:48

order to do that we're going to be using open AI of course so first thing to do is to install it there here we go and similarly to the previous examples uh we're going to be also

42:04

creating a chain that is going to summarize the image but in this case we're not going to be sending a regular um PR a regular uh prompt like we did before but we're going to be loading the message with the image itself and if you

42:20

have I mean if you you can check the API documentation of whichever llm you're using to see how you you can send um an image to it um in Lang chain it is pretty uniform you just send a user a

42:36

user message and uh you send whatever you want to send it as text within the type text dictionary and then you create another dictionary within it with Type image URL and then you send it in base

42:51

uh 64 you know so that it understands that uh image and that's essentially how we're going to be sending that image so as you can see here we have a prompt template that is going to take only one variable which is image and here we have

43:07

another template that is not taking any variable so that's convenient and um then right here this part right here is going to be the base 64 code of the image that we want to convert and in order to do that we just like initialize our chain in this case we're going to be

43:24

using GPT 40 Mini because we want a language model that has multimodal input okay and then just batch the summaries let's execute that that actually takes a little bit of

43:40

time when it is trying to ingest all of those uh images you can see I think we have uh 1 2 3 4 5 six seven images right here so let's see how long it takes take 16 seconds to um to process all those 16

43:58

images now let's see the summaries here we have all the summaries let's print the the first uh the third the fourth one there we have it then image appears to illustrate the attention mechanism used in Transformer architecture then we

44:15

have the key elements words and tokens attention weights highlighted tokens Etc let's actually take a look at that one to see what it actually looks like so what was the name of the function that I had up here that displayed the image display base 64 image so let's use it

44:33

right here and let's print images number three uh here we have the image number three let's check the number one in the Strat key concept of the Transformer

44:50

architecture trying to find an easier one to visualize let's see which one was this one it was image zero uh so image zero is of course still this one that we saw before and um if we check the summary of that one we can see

45:07

that the overall structure is a diagram structured into two many sections we have the encoder and the decoder then some errors and connections okay so this is exactly what uh we want um to embed okay these are the summaries that we're going to be vectorizing and adding to

45:24

our Vector database um I'm going to remove these two samples right here just to make it easier but there we go so that was creating the summaries of all of our elements and as you can see we already have um the three arrays with um

45:42

the images the text and the tables and then we have other three arrays with the corresponding summaries for the text the summaries for the tables and the summaries for the images now what we're going to have to do as we saw in the

45:59

diagram that we saw before we're going to have to link them together using an ID and that's what we're going to be doing right now and then we're going to be loading them to our Vector database and to our document

46:15

store okay so now it's actually time to start talking about how to load those summaries and the elements that we want to load into our Vector store and to our document store uh and in order to do this is

46:31

actually very simple we're going to be using this langine abstraction called multi Vector retriever it's actually pretty straightforward and uh what we're going to be doing with this one is what we saw before we're going to be creating an ID

46:46

for every single document and we're going to add it as metadata to both our summary and to our document and the document is going to go to the vector store which is right here and the sorry the document is going to go to the

47:02

document store which is right here and the summary is going to go to the vector store and it is the summary which we are going to uh retrieve using semantic uh search semantic similarity and uh once we have retrieved the summary we're

47:17

going to check the ID of the document that it's in its metadata and go fetch the corresponding document in the document store that has the same ID okay and that's essentially all that we're doing and uh this is what the multiv vector uh

47:35

retriever does in langing now of course uh you can code this yourself if you want you you're not forced to use multi Vector retriever I just feel like this is a good level of abstraction to start to to stop that uh because I feel like

47:51

what's going on under the hood is pretty self-explanatory and um this one right here does is very very simply so you just pass it the vector store you pass it the document store and you pass it the ID um that it's going to add to the metadata I mean the key that is going to

48:07

add to the metadata to connect both of them that's essentially all that we're doing so we're pass we're initializing a chroma Vector store we're initializing then a document store in memory in this case and we're initializing a metadata

48:23

ID which is going to be document ID like this one right here and uh then we're just loading everything into I mean just creating this abstraction that is going to help us link them together and this retriever is actually going to just

48:38

return to us the documents that are going to be relevant for this it's not even going to return to us to summaries it's only going to return us to documents so let's execute this right here now we have created this and now we can actually start loading our documents

48:54

now that we have created it this is actually empty for now okay and now let's actually just load every single thing that we want to load so the first thing that we want to load is the documents and just as I showed you before we have to create some IDs for each one of them uh sorry the first

49:10

thing that we're going to load is the texts which are the composite elements and um as I told you before well first we're going to have to create an ID for each one of them so this array right here is going to create a u ID D for every single uh element in

49:30

text and then we're going to append that ID we're going to add that ID to the metadata of every single document that we're going to be generating so this essentially is just a oneliner that creates a document uh a lang chain

49:46

document for every single composite element that was returned to us from unstructured which is in with which is in the text um variable okay so this is creating the

50:02

summary texts um documents then we're loading those documents to our our Vector store and then the actual text the actual uh composite um the actual

50:17

composite element the composite element that we extracted from unstructured is going to go to the document store and this one is the one that is going to be retrieved not the summary the summary is only used for finding it but the one that we're going to actually get from

50:33

the retriever is this one right here okay uh we're going to do exactly the same thing with tables we're creating a langing document in case you don't remember I imported document from here from Lang chain. schema actually I think this is wrong that is old school I think

50:49

now it's from Lang chain core documents we import document yeah W this this is all code all right uh let me just fix this from L chain open AI uh we're going to import open

51:06

embeddings yeah I don't know what this is old code L from line chain core. retrievers we're going to import multi Vector

51:21

retrie there we go um is this working correctly uh multiv Vector retriever um actually I think this is yeah sorry this is the only one that actually comes from lanch and retrieval

51:37

mode Vector so there we go and then we're going to add every single thing uh we're going to be doing exactly the same thing for the images just generating an ID for every single image creating a document for every single summary and then just adding the images themselves

51:53

to the document store and it is the image themselves it itself in Bas 64 which is going to be retrieved okay so it's going to take a few seconds to load everything and now we have everything within our uh retriever our document

52:08

store which has a assigned a vector store assigned to it and now we can actually start testing it so now if I do what is multi-head attention on this retriever that I have right here remember that my retriever is a multi Vector retriever okay so now I can

52:24

essentially just execute this and now the chunks are going to are going to be right here actually let's just see there we go so the first chunk is a composite element the second chunk

52:40

is actually a base 64 thing which has to be an image because that is what we added to our document store here we have another composite element and another composite element so pretty convenient now right here this is some

52:56

extra code uh you don't necessarily need it this is high code some code that actually was available at um one of Lang Chain's um documentation Pages I'll add a link to that one in the description essentially just renders the

53:12

page and highlights uh whatever um elements you send to it I just had to update that a little bit but uh let me show you I'm just going to create the function and let me show you what each one of those chunks that we just retrieved

53:27

has inside of it okay so these are our four chunks okay and remember that for the composite elements and they actually have some more elements within it okay so this first one we're going to tap into this first one is a composite element

53:43

remember that it has a lot of things within it and uh let's actually just take a look at what it has inside of it so for every single composite element we're going to check first we're going to print the number of it and uh then and we're going to

53:58

just uh write down what type of element it is and which page it is on okay just to just to see so you can see that the first chunk that it retrieved has a title narrative text and all the way to a list item and you can see that it

54:14

spans from page four to page five pretty good here we have the chunk number two Let me just run this a scrollable object here we have chunk number two you can see it has a title a narrative text the footer all the way to a narrative text

54:30

and it actually spans from page two all the way to page four pretty good same for chunk three all right pretty good now let's actually take a look at the first chunk right here um I think I forgot to do something right here

54:48

chunks. metadata oh yeah because this one right here is actually the image uh let's open the first one all right so the first chunk and the first original element from the chunk is a title just like we

55:03

saw before it's in page number four there we go it came and actually right here I just coded a couple of quick uh functions I don't want to to to confuse you with this functions essentially all they do is they extract which Pages a given

55:20

chunk has because remember that a chunk in itself uh contains a bunch of elements right so the first first one contains uh as we saw right here the first one contains the title from page four all the way to a list item in page five uh the chunk number two contains um

55:37

all the way from the title in page number three to a narrative text in page number four um and here I just coded a very quick two functions that actually display the IM the the picture the pages of whichever um chunks you pass to it so

55:54

here I'm going to pass the chunk number the first chunk and um wait what uh this one was actually not useful all right so there we go uh the first the the fourth chunk uh contains an

56:12

introduction background Etc okay so it's only one page this one right here uh let's see let's check the chunk number two it is supposed to span from page number three all the way to page number four so let's check that one right here so which one was it chunk

56:30

number chunk number two so let's see that one chunk number two so it spans from here from attention all the way to page number four and these are the elements that

56:47

were retrieved so as you can see this Chunk in itself is pretty self-contained and it contains all of the information that we need uh that is related to this particular topic and that is why it is so useful and so important to use uh

57:02

chunking uh by title in this kind of documents when you're using unstructured it becomes super easy because everything I mean the entire chunk is interconnected it's not like it's not like it split the text randomly and it sto chunking here and then the next text

57:19

the next uh chunk starts here it literally is by titled which is very very convenient and it actually has the images here here um now in this particular example I am not extracting the images from the chunks themselves because as you may have noticed so far I

57:35

am embedding the images separately but you could just extract the images from here and that would work pretty well uh so now that we have that we can actually start uh creating the rack pipeline which uh we have all the element so far

57:50

we have the retriever that is working um the retriever is multimodel actually let me show you that it is actually multimodel uh where was the function right here that we had that could display images uh this one right

58:08

here so this um this query retrieved four documents and as you can see the second document is a b a base 64 document so let's take a look at it so we're going to check chunks and we're

58:25

going to check the second one right here this is going to tap into this element right here okay so let's execute so there we go we have retrieve the image we have retrieved all of the documents themselves and um now

58:40

we can essentially uh start creating our rack pipeline so let's actually get to do that right so now in order to create our rack pipeline essentially all that we're doing here is we're actually going to be

58:56

using a couple of helper functions uh first of all we're going to be importing the runnable pass through and runable Lambda uh things we're going to be using and um and remember that this chain right here is supposed to actually give you an answer based on the

59:13

retrieved documents okay and uh that essentially means that the docu the since the documents that will be retrieved some of them will be images um the chain has to include a language model that has multimodel input

59:30

uh which is the case with chat with GPT 40 mini right here and that's the model that I am using for this one right here okay so I created a couple of um chains right here the first one is the simpler

59:46

one and um probably the one that you that you would uh intuitively uh create and uh the idea right here is essentially just a very regular uh rag pipeline like the ones that we have built before okay so the entire chain

00:03

actually takes just a single um input which is going to be the query from the user and in this case that query is going to be passed through the runable pass through uh ass a question and um for the context we're

00:19

going to be using the retriever however the retriever we're going to be parsing it um using the function par stocks okay and as you can see right here we have a runnable Lambda and a runnable Lambda is essentially just the same

00:35

thing as creating a Lambda function um and let me show you what that one does so essentially it just splits our retrieved documents to know whether or not they're and to uh whether or not

00:51

they are an image okay so it essentially tries to to decode it if it was space 64 and if it cannot do it that means that it is not an image and that it is text so it appends that to my text array and it returns images and texts okay

01:07

basically this function right here so this one right here will return just the retrieved documents like this okay so that is what the retriever is going to return and an object like this is going to be passed to my parse documents

01:25

function and that par documents function is going to try to decode every single element in this array and if it can decode it it's going to add that to the images array and if it cannot decode it it's going to be added to the texts

01:40

array and the texts are going to be added I mean texts within the text one we're going to have the tables and the text okay um those texts are going to be right here and um they images are going to be

01:57

right here and this is the dictionary that is going to be returned and that is pretty important because our prompt is going to take these two keywords right here which is the ones that are going to be sent to this other runable Lambda that we have right here okay so by the

02:13

end of this thing right here this one right here is going to get a Contex a dictionary with the key context and question and within the context um key the value is going to be another

02:29

dictionary with these two keys okay so far so good there we go and this function right here is going to take whatever is within the context and send it to this new prompt that is creating

02:44

okay so very simple um here we have the context uh text and essentially we're just appending more context text if um the context from our uh retriever uh

03:01

the element in the context from our retriever is of type texts okay so we're tapping into the documents by type right here and so we're just appending those right here and if it is not text if it is image we're going to be appending this

03:19

part right here to the message because remember that if you want to send an image to a language model you're going to be adding it like this okay so other than the message which we have right here of type text we're going to have the image Type image URL and then

03:35

appending the base 64 image right here and um then just creating a human message with this prompt content that we have right here okay that's all we're doing and if there are no images then this is going to be ignore and we're just going to be appending the composite

03:52

elements inside of the prompt and that's all that we're going to be doing okay so far so good perfect um I mean this can of course be improved I feel like I'm not dealing with a tables correctly like perfectly I feel like I could um I

04:09

could um um like structure this better to deal with the tables a little bit more neatly but uh this is just a very quick example to show you how this would work uh so this is what the chain would look like after that we just pass everything to

04:25

the language model model and just string output pause it that's all and then here right here just have a very very quick example of doing exactly the same thing but with retrieving the sources so we're creating a rubble pass through that is going to create remember that if you run

04:41

assign on a runnable pass through this is all of course by the way in the Lang chain module that we have in the previous part of the course okay so if you're having trouble with the runable pass through um there is uh a video about that in the previous module that

04:58

we talked where we had um where we were covering Lang chain um and it essentially this thing right here is going to return a three key dictionary context question and response and the response is of course going to be

05:13

returned uh the response from the language model um so here we have the first example which is chain invoke what do the authors mean by attention and then it gives us the answer but we don't really see what's going on right here so actually let me just execute

05:31

this so there we go um it's going to take a little bit of time because it is retrieving the images and if if it finds images it's going to retrieve them uh the authors Define attention as mechanism that Maps a query instead of key value pairs Etc okay so here we have

05:48

the answer um in the next part right here we're going to be start we're going to be checking the lsmith logs of these two qu ques but uh just to show you what's going on right here and um let's try this other one which is essentially actually let's try the same exact

06:07

question there we go we're going to try the same exact question but this time we're going to execute it with the run with sources sorry chain with sources and this one essentially is going to return a three key um dictionary with response context

06:24

and and question so what I did right here is I basically just printed the response then I print context actually gone to add another empty line another couple of empty lines right here and then for each text inside of it

06:42

we're going to just print it and you're going to print the page number as well because um the page number is if you might remember inside of the metadata inside of the page number of the document retrieved because we build the document using the unstructured

06:58

composite elements that would return to us um and then if it is an image we're going to just display it using this function that we had up there okay so let's execute

07:14

this again going to take a few seconds and there we go so here we have the response in the context provided attention refers to mechanism used in neural networks Etc so working pretty good good and then for the context right here we have attention and we have it so

07:30

it essentially fetched uh this part right here so attention tuck tuck tuck and then it fetched a couple of images so let's see let's see if I I mean if it did exactly the same thing it should have returned this chunk so let's see uh

07:45

it had and it fetched these two images right here all right not bad there we go and it fetched um a Time mention it actually apparently fetched only two text

08:00

documents uh one from page number three and the other one from page number 10 and I suppose that it actually fetched this one up kind of like by mistake because I mentioned authors what do the authors meant what did the authors meant

08:16

by this mean by this and um here's just a list of a lot of authors um and then it fetched a couple of images that I'm not sure if it actually helped it answer it but uh you can see how this is going to be useful like if your document if your document contains a

08:33

bunch of charts for example um this is going to be super useful because in the response you're going to be able to get the image itself and this is something that you can L that you can uh display in the front end of your application right you don't really um you don't

08:50

necessarily need to just site it uh with a short u um uh with a short reference to the page you can even display the image because it is being returned to you in Bas 64 right and um and you can

09:05

even use the image within the answer as well if you allow the language model to to print the B 64 uh elements that it is returning so that is very convenient and I can't wait to see how you um build

09:21

your own workflows for this kind of uh implementation and yeah I mean essentially this is going to make you this going to allow you to work with more real life um more real life scenarios where you have more

09:37

realistic PDFs with images tables graphs plots equations Etc and it's going to be able to parse them so this is a rudimentary explanation I mean there's of course a lot to go uh deeper into but

09:54

uh I hope that it was clear here and of course there are like multiple ways to improve on this particular uh code that we have right here but I feel like it'll give you a very good starting point so that you can start experimenting with it

10:09

and creating your own workflows all right so there you go that was how to create a multimodel rack um Pipeline and I hope that you learned a lot from it now let's actually start putting this into action so the next

10:27

part that we're going to be doing is going to be implementing this thing right here into a front end uh we can use stream lead for that right so let's do that um afterwards

10:43

[Music] [Music]

Summary

Transcript