π Add to Chrome β Itβs Free - YouTube Summarizer
Category: AI Technology
Tags: AIDatabaseEmbeddingsRAGSearch
Entities: AI assistantChroma DBCodeCloudFlaskOpenAIRAGSentence Transformers
00:00
So, your company has 500 gigabytes of documents in their server and you're asked to connect an AI assistant just like chatpt to answer questions about these documents and you think to yourself, man, how am I supposed to get this done? From your experience, you know that typical chat applications can't accept more than a dozen files.
00:16
So, you have to use a different method to allow the AI to search, read, and understand the entire files. But how?
Maybe you think you can create a clever algorithm to search the title of the documents and its contents to rank them by relevance. But you soon realize that that means that every time the user
00:32
searches, it would need to search the entire 500 GB of documents. And this is a very inefficient way to get it done.
So maybe you try to do something else by doing some pre-processing work before so that preemptively you summarize all documents into searchable chunks. But you also realize in this case that it's
00:49
not likely going to be an accurate way to get things done. Let's try a different method.
Why don't we try to merge these two ideas together and get the best of both worlds? Starting with the large language model, we know that the core idea behind how LLMs actually take an input is word embedding.
Meaning human language is turned into numerical
01:05
representation because computers can't think in words but in numbers. So is it possible that instead of searching through the entire 500 GB of documents, we essentially store these documents by preserving the semantics which means meaning of those words into a vector
01:21
embedding and store those into a database as vectors. And if we can do that, maybe we can retrieve these faster by splitting the context into chunks in the vector database so that AI assistant can fit them into their context window and generate output from it.
This method is called rag or retrieval augmented
01:37
generation. Let's say one of the use cases for the company to use the AI assistant was to ask questions like can you tell me about last year's service agreement with codecloud.
In order to understand how rag works, we need to break them down into three different steps. Retrieval, augmented, and generation.
Starting with retrieval,
01:54
just like how we converted the documents into vector embeddings to store them into the database, we do the exact same step for the question that reads, "Can you tell me about last year's service agreement with CodeCloud?" Once the word embedding for the question is generated, the embedding for the question is compared against the embedding of the
02:10
documents. This type of search is called semantic search where instead of searching by static keywords to find relevant contents based on its matching the meaning and the context of the query is used to match against the existing document.
Moving on to augmentation. Augmentation in rag refers to the
02:26
process where the retrieved data is injected into the prompt at runtime. And you might think why is this all that special?
Typically AI assistants rely on what they learn during pre-training which is static knowledge that can become outdated really fast. Instead, our goal here is to have the AI
02:43
assistant rely on up-to-date information in the vector database. So, on runtime, we need to be able to provide the AI assistant with important details that could help answer the question like above.
In the case of Rag, the semantic search results pend to the prompt that essentially serves as an augmented
02:59
knowledge. So, for your company, the AI assistant is given details about your company's documents that are real, up-to-date, and private data set.
All of this can occur without needing to fine-tune the AI model and modify the large language model. The final step of rag is generation.
This step is where AI
03:17
assistant generates the response given the semantic relevant data retrieved from the vector database. And we have an upcoming video on vector database soon.
So make sure to subscribe to be notified when the video is out. So the initial prompt that says, can you tell me about last year's service agreement with
03:32
CodeCloud? The AI assistant will now demonstrate its understanding of your company's knowledge base by using the documents that relate to service agreements in codecloud.
And since the initial prompt specifies the criteria of last year, the generation step will use
03:48
its own reasoning to wrestle with the data that was provided to get the best answer for the question. Now, rack is a very powerful system that can instantly improve the depth of knowledge beyond its training data.
But just like any other system, learning how to calibrate is an acquired skill that needs to be
04:04
learned to get better result. For example, knowing how to chunk your data before storing them into the vector database is a critical decision that will determine the efficacy of rag.
In order to set up a rag system, you have to employ different strategies like chunking strategy where you determine
04:20
the size and overlap of each chunk. embedding strategy to decide which embedding model to use and to convert your documents into factor embeddings and retrieval strategy where you control the threshold of how similar the words need to be as well as additional filters
04:35
that you might want to add in the data set. Setting up a rack system will look different from one system to another because it heavily depends on the data set that you're trying to store.
For example, legal documents will require a different chunking strategy than say customer support transcript document.
04:50
This is because legal documents often have long structured paragraph that needs to be preserved. While conversational transcript can be just fine with sentence level chunking with high overlap to preserve context.
Now that we covered the conceptual elements of rag, let's look at how it looks like on a practical level. To better
05:06
understand this, we can look over at this lab specifically geared towards how to use Rag. When I open the lab, I'm dropped right into a real world mission.
500 GB of company dogs need to be turned into instant accurate answers through a ragged system. Access the labs using the
05:22
link in the description below and follow along with me. In the first question, we're asked to set up a development environment.
I created a Python virtual environment, activated, install UV, and then pull in Chroma DV Sentence Transformers, OpenAI, and Flask. A tiny
05:38
marker that says ready confirms that I'm set. The tests check V and V exist.
UV is available. All four packages are installed nice and clean.
In the next question, we're asked to review TechC Corps document vault. I skimmed the simulated repo of Markdown docs,
05:54
employee handbook, product specs, meeting notes, and frequently asked questions. The key takeaway is that we'll treat these like a real enterprise corpus and make them searchable by meeting, not just keywords.
Explore this question by yourself to get a real feel for the data set. In the following
06:10
question, we're asked to initialize our vector database. I spin up Chromma DB locally using a persistent client and create a collection named tech corpse_docs.
The test verifies a chroma db directory exists and the initialization script is present. This
06:25
is our AI brain storage. In the next question, we're asked to learn the chunking strategy.
I write a small script that chunks text with a size 500 and overlap 100. This preserves context across boundaries and improve retrieval quality.
It prints chunk stats and
06:41
writes a verification file with a chunk count. The test checks the script exists and output file is valid.
Chunking is critical for accuracy. In the next question, we're asked to understand embedding.
I load all mini LML L6V2 from sentence transformers, encode a few
06:57
short sentences, and compute similarities. The test checks that a result file exists and contains similarity values.
The big idea here is that questions and documents both become vectors. So we can measure meaning not just words.
So dogs allowed and pets permitted have a high similarity.
07:13
However, remote work does not have similarity. In the next question, we're asked to feed the AI brain.
This is where it comes together. I iterate through Techqu's document chunking file size 500, stride 400, embed each chunk with all mini LML6V2,
07:29
and store vectors plus metadata into the techp docs collection. It logs per file progress and writes a summary.
documents are processed in total chunks. The test confirm the ingest script exists.
The completion file is created and the format is valid. This is our knowledge
07:46
ingestion pipeline. In the next question, we're asked to activate semantic search.
I built a tiny search engine script, load the collection, embed three CEO style queries, and fetch top results by semantic similarity. It writes results to a file, prints structured output.
The test ensure the
08:03
script exist, the result files exist, and all three queries ran. Explore this question by yourself to read the results and see how well meaningbased search works.
In the following question, we're asked to launch a simple web interface. I ask a flask app port 5000, then run a
08:19
running marker. [Music]
08:43
[Music] We're asked to test like the CEO. I open the app and try questions such as what's the pet policy?
I watch the rag flow, retrieve, augment, generate with sources. Then I mark the test as tested.
This is where demo value shines. answers
09:00
grounded in our private docs. With retrieval plus augmentation plus generation in place, we're ready for a UI to ask questions.
We've got an end-to-end rack system that's fast, grounded, and extensible. A few things I pay special attention to along this way.
The model is all mini LML6 V2, which is
09:18
compact and effective. For chunking, size 500 with overlap 400 for test, stride 400 in ingestion.
Both preserve context for better recall. For storage, Chromma persists client with documents tech corpse_docs.
For web, simple flask app on port 5000
09:36
for quick evaluation. For safety, similarity threshold keeps lowquality matches out, reducing hallucination.
Before wrapping up, explore these questions by yourself and notice how retrieval quality and answer helpfulness change with small tuning. That's it.
We
09:51
went from zero to working rag system backed by real tests, clean structure, and a demo interface. Go and try it yourself.
[Music]