-
Notifications
You must be signed in to change notification settings - Fork 215
Expand file tree
/
Copy pathcourse4_script.txt
More file actions
25 lines (24 loc) · 78 KB
/
course4_script.txt
File metadata and controls
25 lines (24 loc) · 78 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Course Title: Prompt Compression and Query Optimization
Course Link: https://www.deeplearning.ai/short-courses/prompt-compression-and-query-optimization/
Course Instructor: Richmond Alake
Lesson 0: Introduction
Lesson Link: https://learn.deeplearning.ai/courses/prompt-compression-and-query-optimization/lesson/c14k8/introduction
Welcome to Prompt Compression and Query Optimization. Built in partnership with MongoDB and taught by Richmond Alake. Richmond is a developer advocate at MongoDB and has worked as a machine learning architect and taught AI and ML for many years. Thanks, Andrew. This course shows you how to combine features of a mature, established database with vector search to reduce the cost of serving a large RAG application. Say you're building a conversational RAG application that helps users select a rental property. A user might enter a text query for one level ranch on a quiet street. You can use semantic search to find a close match to the user description. Using an embedding of the user requests and searching a vector database for homes with descriptions that match. But the user may also have hard requirements like three bedrooms, two bathrooms, and maybe no swimming pool. These are better handled with a more traditional retrieval by selecting data based on fields in the database and explicitly store the number of bedrooms, bathrooms, and so on. In this course, you learn to use the best of both worlds, a traditional database with an added vector index. In RAG applications to retrieve results that provide an LLM for final processing. If the retrieve context is very long, this results in a very long prompt and can thus be costly where retrieval to return, say 10,000 tokens. If you were to run a rental comparison website to search, say, a million queries per day, and if LLM input tokens cost $10 per million tokens, you could be spending over $36 million a year. So, to help you reduce costs, this course, will also cover ways to keep the retrieved results as small and relevant as possible. Thanks, Andrew. Let me describe some of the techniques you will learn. Let's consider your rental app filtering on the number of bedrooms or bathrooms can be done with a pre-filter or post-filter. Efficient pre-filter is done in the database index creation stage. You build a new index of entries that match common queries. So for example, if you know you frequently get queries for bedroom units, you can build an index that includes the bedroom field. So that's pre-filtering. In contrast, post filtering is done often a vector search query is performed where you then apply a filter to this result to select the sub set matching the required condition. Large scale applications may use both of these techniques simultaneously. Another technique to minimize the size of the output is something called projection, which selects a subset of the fields returned from a query. For example, out of 15 fields of a potential rental, you may want to return only three of them. Name, number of bedrooms, and price. Now, you could implement all of this operation directly in your application, but the database can optimize all this operation for performance and enforce role-based access control. So they are best accomplished there. And another powerful technique is reranking the results of a search. For example, after using the text embeddings of the renter description to perform a semantic search, you can rerank the results based on other data fields such as average star rating or number of ratings, to move the more desired results higher up the list of results. In order to then generate better context for the LLM. One final technique is prompt compression. If the retrieve information is very lengthy, seeking all this context into an LLM prompt results in a very long, prompt length, which is expensive to process. To reduce this costs, you can use a small, low cost LLM fine tuned to compress prompts before sending them to the final LLM. There are many opportunities to improve relevance and save costs. Thank you Andrew. You will learn all these techniques in the next few lesson. You will start this course by implementing a vanilla vector search and end by implementing prompts compression. Many people have worked to create this course from MongoDB. I'd like to thank Apoorva Joshi, Pavel Duchovny, Prakul Agarwal, Jesse Hall, Rita Rodrigues, Henry Weller and Shubham Ranjan, and also Esmaeil Gargari from DeepLearning.AI had also contributed to this course. I hope you enjoy this course. Please go to the next video and let's dive in.
Lesson 1: Vanilla Vector Search
Lesson Link: https://learn.deeplearning.ai/courses/prompt-compression-and-query-optimization/lesson/y8g9n/vanilla-vector-search
This lesson covers vector search and expands on RAG implementation. You explore MongoDB and pydantic a Python library crucial for data integrity. Understanding these tools will elevate the quality of your AI projects. Let's dive in. You understand that there are vast amount of data on the internet right now, and there are few ways to compare the similarity or how closely one data point is to another. A common method is text search, where you match a query keyword with a parts of the content of a data point to compute a match. This is information retrieval at the basic sense, where you input a keyword or a search term, match against several data points, and detect if the keyword is in the content. And now you are about to learn how to retrieve data based on their context or meaning. The first step is to gather data. This can be structured data like data organizing tables or spreadsheet with defined columns or unstructured data such as audio and image data. The next step involves passing the data as input to an embedding model. The output of an embedding model is a vector. At this point, you can say that the initial data has been vectorized, and you're left with the numerical representation of the data that captures the context and semantics of the data. This is referred to as vector embedding. In a high dimensional space, referred to as a vector space, you can compare the distance between two embedding vectors or more to get an indication of how closely they are similar in semantics or context. So far you understand vector search and information retrieval technique that uses numerical representation of data known as vector embeddings to search and retrieve information. You also understand that traditionally, information retrieval relies on keyword matching, which searches for direct matches between query text and the text within the data set. However, vector search makes use of this embeddings to enable advanced functionality such as semantic search, which understands the context of the query. Recommendation systems that predicts user preference. And Retrieval Augmented Generation, or RAG, which provides additional context for LLM inputs. These capabilities make vector search a powerful tool in various AI applications. Once data, both structured and unstructured, has been collected and encoded into vector embeddings, there is a requirement to store the vectorized data into a specialized data store, referred to as a vector database. Within a vector database, to ensure efficient retrieval of vector data based on vector search queries, it is best practice to index the vector data. A vector search index is a specialized structure that optimizes the storage and retrieval of vector embeddings, allowing for efficient similarity searches. So when a vector search operation is performed, the index facilitates the efficient matching of the query vector against the data set, reducing the time needed to find the most similar vectors. And that takes you down the road of search, specifically vector search in retrieval augmented generation system. Retrieval augmented generation or RAG, is a system design pattern that leverages information retrieval techniques, including vector search and foundation models, to provide accurate and relevant response to user queries. RAG achieved this by retrieving semantically similar data to supplement user queries with additional context, and then combining the retrieved information with the original query as input into large language models. For example, a typical process using a chat interface would be you enter your chat and then you get a response from the LLM. This is not the ideal process, as this doesn't use any relevant data. The ideal process would be, with the input to the LLM, then you add in relevant domain specific data, and the large language model can provide relevant and context-aware response to your query. Now that you have an understanding of RAG, let's get an overview of the key benefits of RAG design pattern for LLM applications. Building AI application that leverages RAG system design pattern provides a number of benefits, such as grounding the LLM response in relevant and up-to-date information, which will reduce the chances of hallucinations when the LLM essentially provides wrong information or irrelevant information. With retrieval augmented generation, you also have the benefit of reducing the amount of information that is passed as input into the LLM. This can reduce the context you pass into the context window. With RAG, you also removed the need for fine tuning LLMs in some scenario, but more specifically using retrieval augmented generation, you can utilize your own private data or domain-specific data to ensure that LLM responses meet your specific requirements and needs. Now that you know, at the LLM, give better answers when supplemented with relevant context. You may wonder where and how to store this data. You may also ask, "How do I implement vector search for information retrieval in the first place?" That's where MongoDB comes in. MongoDB is a developer data platform that offers a NoSQL database with vector such functionalities. In your AI applications, MongoDB can act as a storage solution for vector data acting as a vector database. MongoDB offers even more functionality to act as a data store for operational and transaction data, making it a robust solution as a memory provider for LLM and AI applications, which include RAG and agentic systems. You're likely familiar with traditional relational database. Let's use store in data on a house to illustrate how a relational database works. In a typical relational database, you might have the information of the house, such as the number of rooms and bathrooms on one table and the address information of the house in another. With the document model, you model data based on the interaction that happens on the application and not the other way around. For what is a document in MongoDB, a document is a basic unit of data that is similar to json. Each document is a set of key-value pairs, which is the MongoDB equivalent of a row in a relational database. Let's see this in the house example we talked about earlier where we had the house details and its address attributes. In this example we have all the attributes allocated to a house in one document, including its address. This is an example of a document in MongoDB. Documents are dynamic, meaning they can contain varied fields and structures within the same collection. And the collection in a non-relational database is similar to a table in a relational database. The document model uses a json schema, which is a core data model across layers of the tech stack. For example, json helps transfer the data between website parts and Rest APIs in the application layer and used for function call in in tool definition in the model layer. When implementing an agent. MongoDB enables flexible field and valid data storage with the ability to store different data types. To ensure your documents are structured properly. You should consider data modeling. Data modeling involves designing the structure of documents and collection to effectively represent and organize data. It's about planning how to store and link data across documents to optimize for performance, scalability, and specific data access patterns of your application. At times, the layout of components of your application is dictated by the structure and format of your data in a database. In the diagram here, this is represented by the directional arrow coming from the data to the application layer. This represents implementing the application layer based on the information in the data layer. But ideally, you want to start with the needs of your application first and not the data itself. You have to ask "How would I access my data?" And that should determine how you will model the structure of your data. MongoDB enables you to use a familiar understanding of pipelines, which is present in data processing or machine learning concepts. You can apply the concepts of pipelines to ideas within a database layer. When conducting queries using MongoDB, you construct an aggregation pipeline. You can think of an aggregation pipeline as a sequence of data processing stages, where each stage transforms the data as it passes through. This process allows for complex query composition within MongoDB, as we have various stages of data transformation occurring within the pipeline. Here's an example of an aggregation pipeline query. By the way, a query is just a fancy way of describing how to tell the database to produce the specific information you're looking for. Let's say you're managing data from a social media application with a collection of user posts. You want to find the most popular posts defined by the number of likes in January 2021. And perhaps you're interested in summarizing the average number of comments and likes per posts by category. This aggregation pipeline fills this post from January 2021, groups them by category, calculates the average likes and comments, and sorts the results by the average likes in descending order. By using the aggregation pipeline, you can leverage your understanding of sequential operation from if a machine learning and AI pipelines, and apply a similar logic to managing and analyzing data in MongoDB, making complex queries quite understandable and manageable. In AI application, there is a need for data validation and ensuring that data conforms to a certain model. This reduces the likelihood of having errors in production system. Pydantic is a Python library used for data validation, modeling and management. Pydantic offers features that enables the creation of data schemas that include a definition of the object and its properties. Pydantic also ensures that data conforms to defined schemas, data type, formats and constraints. If a data schema doesn't meet the validation criteria, Pydantic handles the error by raising an exception that details the specific validation issues. Before we dive into coding, let's review the data set. It consists of 5000 Airbnb listing hosted on HuggingFace featuring details like address, description, transportation reviews and comments. For this course, you will use it to build an Airbnb listing recommendation system using RAG techniques. Each record or data point includes image embeddings of the listing, photos and text embeddings from the content of the space attribute. The information and the space attribute has been processed by the OpenAI text-embedding-ada-002 model. Here are the steps we're going to take in the coding section for this lesson. You are going to load the data from HuggingFace. Then you will set up a database connection to access the database and the collection, which you will then insert data or ingest data into the collection. And then you will conduct a vector search query using a query embedding and the embedding within the collection. The last step will handle the user query and visualize any responses. Let's dive in. Before we get to the steps that we outlined in the slides, let's see what you would build in this lesson. In this lesson, you will be building a RAG recommendation system that uses vector search to pull relevant results from a vector database to add as additional context to an LLM. As you can see on the screen. You will also observe the execution time of the vector search query and the user question under system response. The system response is the response from the LLM, and it will include a recommendation listing from the data set to have provided it as additional context And the reason for choosing this recommendation. You will also observe a table of the attributes of the data that was used as additional context, which will include the name, accommodates, address. And this will be shown for all information retrieved from the vector search query. Let's get started. These are libraries you use for this notebook which are pre-installed and available for you on the Deep Learning platform. Here, you will import the OS model and load the environment variable which you have loaded within your development environment. We will load the Mongo URI and the Open AI API key. These have been previously done for you on the development environment. The first step is to load the data set. Here, we'll import the load data set module from the data sets library from HuggingFace, which allows us to access data set from the HuggingFace platform by specifying the path. You will also import the pandas library and specify as PD, which allows conduct data modification and analysis. The first step is to call the load data set function by passing it the path to the data set. In this case, this is the Airbnb embedding data set we spoke about earlier, that contains the text embeddings. You will set the streaming to true and use the training partition of this data set. The output of this operation will be assigned to the variable data set. By calling take on the data set object and specifying the number of data points you want to extract from the data set. You can load a specified amount of data points into your environment. The next line converts the data sets into a pandas dataframe. This allows for analysis and data modifications. The final step is to view the first data points in this data set. As you can see on the screen, we can visualize the first five data points and their attributes, including the values. Pause the video here and take some time to familiarize yourself with the values of each data points. To continue with the visualization of our data set. And as data points, we will visualize the attributes of each data point. Here, you can see the various attributes that are captured in each data point within the data set, including the text embeddings. The next step is to conduct document modeling using Pydantic. First we'll import several modules from Pydantic and also the date time module from Python. In this lesson, you explore the full extent of the code, but in next lessons, you will shorten the code with a new tools function where all the extensive code will be placed in, and you can call within the notebooks. In the modeling step, the first step is to create a class host that essentially represents or defines a creator of a listing. We have attributes such as the host ID, the name, location, and response time. The next models, we will create are the location and the address. These are used to model the location and address data in our data set. And ensure that they are there conform to the type and to the data presence. You would then create another model for the review. This model will essentially hold the date of a review, the lesson ID assigned to the review, and review ID and name and any comments. The final model you will create is the parent model. This will be the listing model that will assign all the previously created model to attributes within this model. This model will also contain its own attributes such as name, summary, description, transit, and other attributes. This is the key model that holds the information of a listing an Airbnb listing. Now that you have created the models for each data point in a data set to conform to, you will now convert them into the appropriate data types. This line converts each data point into a Python dictionary and assigns it to a variable called records. Now, records holds all your listings from the data set. To ensure there are no null values, you will conduct a sanity check and replace any null values with a non. For the final step in the data modeling process, you will convert each listing data points into a dictionary and assign it to a listing variable. You'll also print out the first instance or element within the listings data sets to observe the attributes for each listing. As you can see on the screen, each listing has a name, summary, space, and other attributes. Pause the video here and take some time to familiarize yourself with the attributes. The next step is to create your database and connect to your database cluster. This is a crucial step. For the database creation and connection step. The first step is to import the libraries. You will import the libraries Mongo clients from pymongo and search index model from PyMongo operations module. Mongo client to allow us to create a client instance and a search index model will allow us to define a vector search index in the appropriate format. For the next step, you assign the database and collection name. The database will be called airbnb_data_set, which should be assigned to the variable database name. The collection will be called Listings Review, which you'll be assigned to the variable collection name. Now, you define a function called Get Mongo client, which takes in the Mongo URIs string. This is a string that represents a connection to your cluster. the Get Mongo client function uses the Mongo client constructor taken in the Mongo URI as it's argument and the app name to create an object that represents a connection to the database cluster. Once a successful connection is made, this function will return the client object. Once you've created a get Mongo client function. In the next cell you will use the function, but first conduct a sanity check to ensure you have the Mongo URI within your development environment. You will pass the Mongo URI into the Get Mongo client function. The results from the Get Mongo client will be assigned to a variable called Mongo client. The Mongo client object provides you with the method Get database, which provides a database object which allows you to access the collection by calling the method get collection on the database object. Running the cell will show a successful Mongodb connection. The last step in the database creation in connection stage, is to clean any existing collection. The first time you run this function, the result of this will be zero because the collection has just been created. In future lessons, you will need to clean the collection and you will see records being deleted. The next step is a data ingestion step. For data ingestion, MongoDB provides a function that makes ingesting data into a MongoDB collection a trivial process. Simply call the insert many function on the collection object and pass in the list and collection. Once this cell is completed, you should get a successful indicator that the data ingestion has been completed. The next step is to create the vector search index. This is a crucial step. Remember, the index allows for efficient information retrieval from the vector database. First assign to the variable text embedding field name, the name text embeddings. Text embeddings is the field that holds the vector embedding of the spaces attribute within each document in the collection. Next, assign to the variable vector search index name text the string vector index text. Vector index text is the name of your vector search index, and this will be referenced every time you make a vector search query. Now, you can use the search index model to create an appropriate definition of the vector search index and assign it to the variable vector search index model. In the cell you will be creating your vector search index using the search index model. The result of this function will be assigned to the variable vector search index model. The search index model constructor takes into his argument a definition of your vector search index. The mapping specifies how the fields are going to be indexed within the database. The dynamic field indicates to the database to index new fields that appear in the documents. The fields attribute corresponds to the indication of which field in a document holds the vector embedding. The text embedded field name is the variable that holds the string representation of the text embeddings The dimension holds the value, which indicates the size of a single vector embeddings within our documents. The field similarity indicates the distance function algorithm used to compute the similarities between two vectors. The type knnvector indicates to the database the type of the data stored is a vector. The last argument passed into the constructor is the name. This will allow the database to identify the vector search index created by the given name Vector index text. In the next cell, you conduct a check to ensure that the vector search index name selected doesn't already exist. This is good practice. Before creating any vector search index definitions. Now, you call the create search index function on the collection object to create the vector search index. This is conducted if the index doesn't already exist. You will observe in the screen an indication that the index was created successfully. Before moving on to the next cell, you can wait a minute to allow the vector index to be initialized. The final step in this process is to define a function called get embedding. The function get embedding takes in a text, which is the user query that's entered into the recommendation engine. We conduct a sanity check to ensure the text entered into the get embedding function is a string, and then you call the embeddings dot create function from the OpenAI client to generate an embedding for a single data point. The get embedding function returns a numerical vector representation of the text that was passed into the function. The next step is to compose a vector search query. You start by defining a function called vector search. This function will take in the user query, the database object, collection object, and has a vector index defined. This vector index argument has a default value, which corresponds to the name of the vector index created earlier. The first process inside the vector search function is to transform the user query into a numerical vector representation and assign it to the query embedding variable. You conduct a sanity check to ensure the query embedding is not empty before moving on to other processes within this function. The next step is to define the vector search stage. This will be the stage responsible for conducting the vector search operation that compares vector embeddings, and computes the distance. Assigned to a variable called vector search stage a json document that represents the vector search index query you are constructing. In MongoDB, operators are represented using the dialog command. And operator here is a vector search stage, so this document represents a query for a vector search operation. The index field points to the name of the vector index to utilize for the query. The query vector takes in the query embedding, which is a user query which should be used to compute the distance with other candidate vectors from the database. The part field specifies the field where the vector embedding is held within the documents. The number of candidates, or the amount of documents you want the vector search operation to consider. The limit field constrains the vector operation output to just 20 results. The next step is to define our pipeline. In MongoDB, a pipeline can be constructed by using a Python list and passing in the stages that are defined earlier. To create our pipeline for vector search function, you have the variable pipeline, which takes in a list that includes the vector search stage. The next step is to execute the aggregation pipeline. To do this, call the aggregate function on the collection object and pass in the pipeline created previously. The result of this pipeline will be assigned to the variable results. For the final step of this vector search function, you will compute how long it takes for the vector search operation to complete a millisecond. This is done by accessing and passing into the command method on the database object, the pipeline, the collection name, and an indicator to explain the execution of the command passed into the pipeline. This will provide you with an object that includes the execution stats of the vector search operation. The next lines extract the key information and prints the information onto the notebook. The final step in the vector search function is to return the list of the results. This is a last step of this lesson where you will handle the user query and conduct all the function you've defined earlier. To ensure the search results or the document returned from the database meets a specific format, you use Pydantic to define a search result. It's a model. Each result will take on the name, accommodate, address, summary, and other specified attributes. To handle user queries you define a function called handle user query. Handle user query function would take into user query, the database object, and the collection object as its argument. And now you get to use the vector search function. Call the vector search function, pass in the query the database object and the collection object, and assigning the results to a variable called Get Knowledge. Get knowledge will hold the list of the documents retrieved through the vector search operation. You conduct a sanity check to ensure get knowledge is not empty. Once you've obtained the search results from vector search operation held in the get knowledge variable, you will convert them and ensure they meet the specified model defined in the search result item. The next step is to convert the search results into data frames. This allows for efficient modification of the search results. In this step you will pause the query and the search results to the LLM. In this course, you were using GPT-3.5 turbo as the LLM for the RAG system. Here you are specifying to the system that it's an Airbnb listing recommendation system and passing in the query along with additional context held in the search results. Data frame variable. The following step extracts the response from the LLM. You extract the response from the LLM and assign it to a variable called system response. Next, you will print out the user query and the system response for visualization and observation into the process occurring. The final step here is to display the search result as a table, which holds the additional context passed into the LLM as input. The handle user query function will return the system response. This is the final step for this lesson where you assign a string, representing a query to a variable called query, and pass the query into the handle user function along with the database and collection object. The query you are using for this lesson and course is specifically one that indicates to the system to recommend an Airbnb listing that is warm and friendly and not too far from restaurants. Now, you run to handle user query. Here you can see that the vector search operation took 0.02 milliseconds, which is very fast. From the print statement, you can identify the query, passing into the vector search operation that was then embedded, and a vector search was conducted. The system response can also be observed. It's recommended the coziness heart of Plateau in Canada, and it's provided a reason why. Pause the video here to observe the reason. In this lesson, you learned how to load your data into a development environment, model your data using Pydantic conduct data ingestion into a connected MongoDB database and perform a vector search operation. You essentially built a RAG pipeline. Next lesson, you will explore adding filtering to your vector search operation, including pre and post-filtering. See you in the next lesson!
Lesson 2: Filtering With Metadata
Lesson Link: https://learn.deeplearning.ai/courses/prompt-compression-and-query-optimization/lesson/gj6ye/filtering-with-metadata-
In this lesson, you will develop a multi-stage MongoDB aggregation pipeline. You will discover how to use metadata to refine and limit the search results returned from database operation, enhancing efficiency and relevancy. All right, let's have some fun. Metadata is simply additional information about a specific data. It's meant to supplement the key data point and provide more context. For example, imagine an image of the Mona Lisa. The image is a key data, which can be the vector embedding of the image itself. Other data such as the title, artist, name, location, and more can be added to the image embedded as metadata. Both the key data and the accompanying metadata can be stored in MongoDB as a single document. By pairing the vector embedded with metadata, the overall image data becomes more informative and useful. Let's understand how metadata can be useful within LLM applications and RAG design pattern, Metadata can be used to add additional context to the embedding data for improved relevance and understanding. Metadata also improves the relevance of vector search queries by enabling filtering and sorting based on data attributed to the embedding data. This can reduce the scope of the vector search query operation or its result. In order to streamline the results of vector search operation to improve relevance, you will use metadata within filtering stages compose together with a vector search stage in an aggregation pipeline. You will use MongoDB's aggregation pipeline to create composable queries, which makes it easier to think of and implement complex queries. Creating an aggregation pipeline with filtering stages enables database operation to produce more relevant results. Let's see how. An example of one of the filtering technique is post-filtering. This is where a vector search stage is conducted and the results are reduced based on certain criteria are referred to as filter. Imagine you have a user query that has certain keywords such as seaside and restaurant, but it also contains constraints such as specified quantities of room and capacity requirement. You can examine how a post filtering process will occur by first starting with the full data set, then applying a vector search operation on the full data set to get results that are semantically similar to the user query. Then, in a post-filtering operation, you apply the filter stage after the vector search stage to further reduce the return result based on a specified criteria. Now, let's see another filtering technique known as pre-filtering. Using pre-filtering technique within vector search can produce different results at the end of the database operation. Let's see how. Start with using the same user query and the same data set. But this time the filter operation or stage is applied to the data set to remove results that don't meet the filter criteria. After this initial reduction, then vector search is applied on the filter stage result. The key difference you takeaway is that pre-filtering involves applying a filter to the data set before conducting the vector search. This approach reduces a subset of data that the vector search will process for similarity measurement. But one key takeaway from post-filtering and the pre-filtering technique is that post-filtering might reduce the amount of document you use for semantic similarity with a user query vector. Which means there could be potential loss of information or records that could be semantically similar to the user query, but are not returned as a result of the filter stage. Now let's see what you're going to build in the coding section. You're going to set a simple RAG pipeline, but then you add a post-filtering stage and observe the result. Which you would then under the user query accordingly. You then, will add a pre-filter stage to observe the different result from a post-filter to a pre-filter, and you will handle the user query accordingly. Let's code. You'll start by importing custom details. This module has been created to streamline some of the code you used in lesson one. You'll notice where custom_utils methods are used and an explanation will be provided. The first step you will take is to load the data. For data loading, you're loading the same data you used in lesson one. To load your data and ensure it conforms to the appropriate model you saw in lesson one, you will use the process records within the custom utils module. This method takes in the data set with loaded and then conforms each data point to the model specified in lesson one. The result of this operation is a Python list containing data points, where each data point is an Airbnb listing that conforms to the specified model. The next step is to make a connection to the database. Connecting to the database has been moved to the costume utils module. Calling the function connect to database will execute the process of connecting to the MongoDB cluster and obtaining objects representing the database and collection. You will unpack the return results which will provide you with a database object and the collection object. This is the same step you did in lesson one. Here, we've streamlined it to a simple function call. The next step is to ensure you are working with a clean collection. In lesson one you ingested some data into your MongoDB collection. In this lesson, running the delete menu from the collection object will result in deletion of the data ingested into the collection in lesson one. You will observe the number of data it has been deleted by the delete many operation. Here, we have 100 data points deleted. The next step is to ingest the data. In this step, you ingest all the listing data that was created earlier into your MongoDB collection by calling the insert many method on the collection object and passing in the listings as its argument, the ingestion process will begin. On the screen, you will see a print statement indicating that the ingestion process is completed. The next step is to create a vector search index. You created a vector search index in lesson one. In lesson two, this extensive code has been streamlined into a code within a custom_utils modules named "Set up Vector Search Index". This will take in a collection for which to create the index for. This is an expected result. You've gotten a duplicated index. Recall, you created an index in lesson one already. Don't worry, you can carry on with the lesson. The next step is to compose the vector search query. This is a step you also took in lesson one, and it was one of the most important part of the process. You would just go over the code again. It's still the same as lesson one. Recall, the vector search function takes in a few arguments of the user query, the database, and the collection converts the user query into an embedding, uses the embedding within a vector search operation, and creates a pipeline with the vector search stage and any additional stage, and calls the aggregate method on the pipeline to get the result. You also print out the execution time of the vector search stage, and finally the result of the database operation is returned by the vector search function. The next step is to handle user queries. The code for this lesson is similar to the previous lesson. The only difference is we have different attributes and you are using a custom utils to get the address module. In a similar fashion to the previous lesson, the handle user query is pretty much the same. The handle user query will take in the arguments which are the the user query, the database object, and the collection object. Also some defaults argument of the stages and the vector index name. You start the process by getting some search results from the vector search operation. The handle user query also conforms the results from a database operation to the search results item model specified in the previous cell. The search results are converted into a dictionary and passed into the model, with the query as additional context. Finally, you extract the system response and print it out on a notebook in a structured manner. The handle user query returns the final system response. Now, the fun begins where you implement, post-filtering process that is conducted after the vector search operation. In this cell, we are specifying the path address.country. Essentially, the filtering you'll be conducting will mimic a scenario where your app user only wants to see listings in the United States. First, you specify the path of where the country is located within the document. This is located in the address field, specifically in the country field. You specify the path to this and assign it to the variable search path. To create a match stage, you use the dollar operator with the specific fields search path, which takes in the string to search for. You will notice you're also adding another limitation or filtering of the document based on the capacity a listing can accommodate. In real life scenario, there are situations where a listing would only want to take a certain amount of people, or need a certain amount of people. You can mimic this in your query by specifying conditional operators or conditional statements within your query. For this filter or match stage, you will limit the documents return to listings that can only accommodate greater than one person or less than five persons, and this is the filtering condition you will be adding after the vector search operation. Finally, you can pass in the match stage and assign it to a variable called "Additional" stage. This cell contains the user query. It's similar to the same user query in the previous lesson where the user wants to stay in a place that's warm and friendly. That's not too far from restaurants. The outputs will also be similar where we see, user query and system response and the documents that were used for additional context in the table structure. You will observe that the documents returned are limited to the United States because you are passing the additional stage that contains that match filter. This is the prompt you're passing to the system, you also pass in the additional stage, which is a match stage created in the previous cell. You will observe that the vector search operation took a fraction of a millisecond. The system recommended easy one bedroom in Chelsea. One thing I want you to do is to pause this video and observe the location of the documents, users additional contacts. You will notice they're all from the United States. You also observe that the accommodates for each data point is between the numbers one and five. Here, you can see that the locations are all from the United States. Let's begin with the second half of the fun, which is adding a pre-filter to the vector search operation. Here, you are creating the filter before the vector search operation is conducted. To conduct pre-filtering efficiently, in this cell you are creating a new vector search index. This is similar to the vector such index created in the previous lesson. The difference here is, we are creating the index with the fields you are going to be filtering the results of the data operation on. These are specifically the accommodates field that was added in the match stage previously under bedrooms. It's important to create an efficient vector search index for your collection. This allows retrieval of information all documents to be performant and not take too long. As previously done, you need to name your vector search index. You will call this specific index "vector index with filter". Distinguishing it from the previous index created in a previous lesson. To create the index, you will call the create such index function on the collection object, which will return the created index. Again, you be defining a vector search function. But this vector, search function is going to be different. Let me explain how. We have a similar vector search function you implemented in a previous cell. The difference is, in the vector stage you are adding a filter field. This filter field conducts the filter operation before conducting the vector search operation. This filter is similar to what you implemented in a match stage, where you specified a condition or filter based on the accommodates field. In this filter, you are adding an additional step to limit the results that are considered for vector search operation to ones that have bedrooms that are less than or equal to seven. With MongoDB, you can add conditional operation and let the database handle the logic. Using the and dollar operator. You can pass in conditions that limit the documents returned from database operations. This is essentially the pre-filter in step. And then the vector search process will be conducted. The rest of the code remains the same. Where you create the pipeline, and then execute the pipeline to obtain the result. You will also view the execution time for the vector search query. The final step is to handle the user query, where you will be using the same query as specified previously. But the difference here is in the handle user query and what you pass in as arguments into this function. Observe there is no additional stage you're passing into the handle user query to be considered for the database operation. What you will pass in is a new vector search index that was created with a filter, and also the vector search function now contains a filter component. As you can observe, the database operation, specifically the vector search operation took a fraction of a millisecond. And now the recommendation of the system is very different, as this recommended, a Sydney Hyde Park city apartment. This is different to the recommendation from the previous process earlier. Pause the video here and observe the results returned. One thing you will notice is, the return results meet the specified filters, which include the limits of the accommodates and the limit imposed on the bedrooms. Just to compare the two methods, which is the post filtering and pre-filtering. With the pre-filter, you will observe that the database operation returned to us 20 records. This is because we pre-filter the documents and the vector search operation returned the number of documents we specified. This is where you specified to return 20 documents. You will also notice a difference in the post-filtering, you specified 20 documents in the vector search operation, but only four was returned. This is your post-filtering results where you got just four documents from the database operation. This is because you conducted the vector search operation and then placed a match stage, which filtered the documents, returned to a lower number matching your condition. All right. In this lesson, you implemented a RAG pipeline with vector search that has a post-filtering step. Then you added a vector search operation with a pre-filtering step. And you observed the difference. In the next lesson, you're going to see how to reduce the amount of data returned from the database operation. See you there.
Lesson 3: Projections
Lesson Link: https://learn.deeplearning.ai/courses/prompt-compression-and-query-optimization/lesson/sge50/projections
In this lesson, you'll learn how to streamline the outputs from database operation by incorporating a projection stage into the MongoDB aggregation pipeline. This will effectively reduce the amount of data returned, optimizing performance and data handling. All right, let's get on with it. In the previous lesson, you modified the fields within the documents returned after vector such and other stage operations by using pydantic and specifying the attributes that you wanted in the pydantic model. In this scenario, you're not using the entire fields in the documents returned by results of the aggregation pipeline, but you are leaving it to the application layer to handle the removal of unwanted attributes or fields. This can have disadvantages such as increased network traffic and processing time, as unwanted data must still be transmitted and then filtered out at the application layer. With a MongoDB database, the inclusion or exclusion of specific fields can be handled as another stage to add to the aggregation pipeline. This is done through a technique known as projection, which outputs the same number of documents as the previous stage before it, but reduces the fields return in each document. The projection technique within MongoDB works by specifying fields to include or exclude from the final documents. For example, the document representation of the Mona Lisa painting we used in previous lesson, can be reduced to a select few fields using the project operator in MongoDB, which you will get to implement in the code section soon. There are several advantages to projection with the inclusion of projection, the overall memory usage at the application layer reduces as less data is passed as results from database operations. This can also contribute to reduce query execution time. And there is the case of security and privacy. Take for example, a finance application where personal information and sensitive data are stored in documents. It can be useful to have the database handle the logic of removing sensitive information before being sent to downstream processes. This provides an overall improved sense of security in the application. In the coding section, you will go through familiar steps to implement a RAG system, add a filter stage, but then add an additional projection stage, and then you will proceed to handle the user query. Let's code. You will start by importing the custom utils as you did in the previous lesson. Then move on to downloading the dataset from HuggingFace as you also did in the previous lesson. You also load up the listings dataset by conforming to the pydantic model defined in the customer utils, just like you did in previous lesson, and move on to connect to your database and delete records in the collection and you observe the number of collections deleted. Just like the previous lesson, you will insert a new batch of records. Here, you're using a vector search index or a filter similar to the one you created in the last lesson. This code is moved into the custom utils module and you will load it by calling the set up vector search index of filter function and passing in the collection object. This will create a vector search index that is optimized to retrieve data with the accommodates and bedroom attributes. You will start by defining a search result model similar to what you've done in the previous lesson, this time with some new attributes such as score and notes. In the next cell, you implement the handle user query function. This is similar to the function that you've created in previous lesson. The main difference for this function is that were printing out the list of fields in the first document. You're doing this by accessing the first element returned from get knowledge and iterating through the keys. You are doing this to observe the fields of documents that are allowed through the projection stage before being done limited, and pass into the search result item model. Now, you're going to implement a projection stage and add it to the additional stages that will be passed into the vector search query function. You're going to define a variable called projection stage and assign it to a projection document. A projection stage is executed on a database operation and indicated by the dollar operator and the word project. This command takes in a document that represents the field that are to be projected. One thing to note is every document returned by the aggregation pipeline, will include an underscore ID field. This is returned automatically. You can exclude it by indicating the number zero as the value for the field. This is an exclusion. To include a field in a projection, you would mention the name of the field such as accommodates and assign it to value one to include. This is an inclusion pattern. As you can observe we follow the same pattern for the fields we want to project. By including fields to project, you are automatically excluding fields that are not mentioned. Now that is all you need to do for the projection stage. But also notice you are adding a score field and assigning the value of the vector such score to that field. This is a way to get the similarity score of the vector search operation into the document return from the database operation. Now, placing the projection stage into a Python list and assign that to the additional stage variable. One more thing, these are all the fields and attributes we want in our pydantic model, which should be included in any of your projection documents. This cell will look familiar as you've used it in a previous lesson. Here, you have the user query to look for places warm friendly. And here are the main changes. In the handle user query, you'll pass in the additional stage that contains the list, which includes the projection stage. The handle user query will also print fields of the first document before the documents are processed by the pydantic model. Also, you're using the vector index with filter for this vector the search operation. After running this cell, you'll observe that the database vector search operation was executed in a fraction of a millisecond. You'll also observe the fields that are included in the document are the ones we included in the projection. We have the name, summary space and other fields that we wanted to be projected and included in the documents. The results are still the same, and we're still getting the same documents. I want to show you one thing. When conducting a projection, it's important you maintain the pattern throughout for every field indicated. Meaning, if you're conducting an inclusion, use the one pattern throughout. If you're conducting an exclusion, use a zero pattern throughout. The only exception to this rule is the underscore ID field. Let me show you an example. You can change this number to zero to represent an exclusion. What will happen is a database operation failure. As you can access, you got an operation failure, but scrolling further down you will see the reason for this. The reason for this failure is because of an invalid project document. And this was caused because you can't exclude on an inclusion projection. To fix the operation failure error, simply place the value one to follow the inclusion pattern. Now, your results are back. One thing to note is because we've included a score field that shows the vector similarity search score. We can see that in the results as well. Let's have a look. Here, you can see the field score and the attributed vector search similarity score. The vector search similarity scores between a value 0.1 with one being a very close similarity measure. You can pause the video here to observe the scores of the documents that are returned from database operation. That concludes it for this lesson. In this lesson, you went through the typical pattern of setting up a RAG pipeline with the vector search indexes and you also ingested data into a database collection. And the new thing you've learned in this lesson is you've created and added a projection stage to the aggregation pipeline to limit the fields returned from the aggregation pipeline query. In the next lesson, you will see how you can add, boosting, and improve the relevance of the vector search operation by looking at qualitative and quantitative data and using it to affect the ranking of documents returned from the aggregation pipeline. See you in the next lesson.
Lesson 4: Boosting
Lesson Link: https://learn.deeplearning.ai/courses/prompt-compression-and-query-optimization/lesson/b3sle/boosting
In this lesson, you dive into the techniques of reordering documents to improve information retrieval relevance and quality. You'll learn how to use specific metadata values to determine reordering position. Let's go. There are scenarios where a document can contain other fields that affect its position within such results. Take, for example, an Airbnb listing with rating and number of reviews field. This fields indicates qualitative and quantitative measures that can contribute to the relevance of a document with respect to a user query and search criteria. Taken into consideration, the value of these fields in order to affect the position of a document in the list of return search results, is referred to as boosting. Why should you consider adding a boosting technique in your search queries? Vector search is an effective method of ranking documents based on semantic similarity. Although vector search scores and ranking effective, metadata values can contribute to the document relevance, which can affect the ordering within search results. Using additional qualitative and quantitative measures to rank documents and shows database operation results are credible and relevant to user queries and their search criteria. Boosting can also be used to make sure results meet user specific requirements, which introduce personalization within search results. In the coding section, you're going to go through some familiar steps. The first would be to set up a rank pipeline, and you will add the relevant stages. Then, you add a boosting logic which will use some mathematical operators available within MongoDB database. And as usual, you handle the user query and visualize results. Let's code. Start by importing your custom utils module like you've done in the previous lesson. Move on to load the data. Also, like you've done in previous lessons. You can also take some time to view the attributes of each data points. Move on to the document modeling which loads the listing into a conformed model. This is similar to process that you've carried out in previous lessons. The next step is to get an object of your database and your collection. Then start of a clean collection by calling the delete any on the collection record. This is similar to the process that has been carried out in previous lessons. Go through the data ingestion process and move on to the vector search index definition process. All similar to the previous lesson. Now, you define a search result item model for the results shown in this lesson. The attributes for each results needs to contain a combined score, number of reviews, and average review scores. These new attributes will be explained later. Again, just like in the previous lesson, you have to handle user query function with the exact same code. Now, we can get to the main aspects of this lesson. You'll be implementing a boosting logic and adding it to the vector search operations conducted on the aggregation pipeline. Here, you are assigning to a variable named "review average stage". In this cell, what is happening is, we are adding two new fields to every document returned from the database operation. The first field is the average review score. Now, this is a qualitative measure I was talking about. The average review score is going to go through every review component of a document and take an average of a sum of the review components. So within every document we can see the accuracy, the cleanliness, the check in and other attributes of a listing. Get the score and with the dollar operator, specifically the dollar add which conducts the mathematical operation of an addition, we can get a sum of all the review component, and then we can divide it by the number of review components, which in this case for the listings in our data set six. This gives us an idea of what the average rating of a listing is. That explains the new field that we're adding to every document called the average review score. The second field that has been added to every document is the review count boost. This is a quantity measure, and this field will take the value of the number of reviews attributes in each document. This is how you can pass the value of one field to a new field. Simply using the dollar operator and the name of the field. To add this new field to every document in the database operation, you can add this process as a new stage, specifically the add fields stage. That concludes adding the qualitative measure and the quantitative measure. In the next step, you'll need to add weights and determine how each component of the quantitative measure and the quantitative measure should affect the ranking of a document after a vector search operation. This is done by adding a new stage to our pipeline. This is the weighting stage. Now the weighting stage comes right after the review average stage. So, the weighting stage will then have a reference of the average review score and the review count boost that was added to each document in the review average stage. This is how you can reference the values of these fields from the documents. To implement a weighting logic, you will use several operators enabled by MongoDB database to conduct mathematical operation. The add operator and the multiply operator. For the multiply operator, you will multiply the value of the average review score, which is the qualitative measure by a weight. I'm using the number between 0 and 1 to assign a weight. Then do the same for the review count boost, which is a quantitative measure that will be considered to rank the document after the vector search operation. You then use the add operator to combine the two results from the different multiplication operations. Assign this new additional value to the field combined score. The combined score is the combination of the two multiplied value, and we can add the combined score to each document within the database operation by using the add fields operator. This is the weighting stage. There is one more stage to complete this process. The final stage is the sorting stage. The sorting stage is very simple. Using the dollar operator sort, we can actually rerank the documents based on their combined score or a certain field. In this case, you are using the combined score and you are reranking it in descending order. So, this is indicated by minus one ascending order will be indicated by a one. Now that you have all the additional stages implemented to add to the vector search operation, you can create a new variable called Additional Stages, that takes a list of all the defined stages. The first is the review stage, where we conduct a mathematical operation to gain the qualitative and quantitative measure and add it as a new field to the documents After the vector search operation. Then there is a weighting stage, and then there is a sorting stage. All the stages are executed sequentially after the vector search operation. Remember, the vector search operation we're using in this lesson, is a pre-filter in vector search. Similar to the ones you've created in previous lessons. Now it's time for you to see the results of the boosting logic. Using the same query from previous lessons and also the same function from previous lesson, the handle user query function, you will pass in the additional stage and make note to use the vector index with filter. Here, you can observe that the vector search operation stage was conducted in a fraction of a millisecond. Now, let's observe the documents that were returned from this operation that included a combination of stages to simulate a boosting logic. Here you can see the results of the database operation that included multiple stages. The average review score is included along with the numbers of reviews and the combined score. Remember, the combined score includes a weightage consideration. Now, the documents shown are ordered by the combined score. You can pause the video here and observe the combined score of the other documents. One thing to note, is that because of the weighting logic we added, you will observe that despite this document, having a high rating, it is ranked lower in comparison to the other documents above it because it had a lower number of reviews. This is the impact of adding weights to the components you're considering for your boosting and logic. One more thing. You can play with the weights and adjust the numbers to see how it affects the results. To do this, simply go back to the weighting stage and adjust the weights. Now I'm giving higher weights to the review count and a lower one to the average reviews. Once you've changed the weights, you can observe the results again. As you can observe from the results, because you've added more weightage to the number of reviews a document with a high number of reviews is ranked higher than one with a high number of rating in comparison. Pause the video here to observe the results. In this lesson, you've learned how to implement a typical RAG system, conduct vector search. But now, you've added multiple stages to the aggregation pipeline to simulate a boost in logic, which adds more relevance and context to the ranking of your documents after a database operation. In the next lesson, you'll learn how you can utilize prompt compression to reduce the prompt that are sent to large language models in order to reduce operational costs. See you then!
Lesson 5: Prompt Compression
Lesson Link: https://learn.deeplearning.ai/courses/prompt-compression-and-query-optimization/lesson/ujs5z/prompt-compression
In this lesson, you implement a cost-saving strategy of prompt compression, particularly valuable for applications like RAG and agentic systems. You gain an intuition of what prompt compression is, how to use it, and the operational advantages it brings to the LLM application. Let's get on with it. There are many prompting strategies that have emerged over the recent years such as in-context learning, chain of thought and react prompting. Getting appropriate and quality responses from LLM is an art form. Most of these prompting strategies involved compose an extensive text to the LLM as input. LLMs with large context window are becoming a new norm. Is now coming to see LLMs that can take an input size of over 100,000 tokens and even in some cases, a million token. That is passing an entire novel into an LLM in one inference call. Although useful in some cases utilizing the full context window when accessing LLMs provided through rest API calls can become very expensive. LLMs with large context windows have their place in real world application, but the operational costs of these models can skyrocket. Take for example, paying $10 per 1 million token and applications such as Airbnb, which has several million users per day. We have a huge operational expense just from the volume of interactions alone. Not to mention that there will be an increase latency in response, as the model will have to process more input to extract the appropriate information to respond to user queries. As you continue to learn and build AI application that use LLMs, you come across the idea of prompt compression, sometimes referred to as token compression. You might think I wouldn't have as much volume at the initial development of my AI application, but building robust AI applications requires thinking ahead of scalability and solving for issues that might become bottlenecks. You will implement prompt compression technique in the code section of this lesson, and observe firsthand how easy it is to implement prompt compression alongside existing RAG pipelines. So, prompts compression is a process of reducing the number of tokens. But let's see what this looks like in an example. On the screen, you can observe an original uncompressed prompt that spans across three long sentences. By using the package LLM lingua, which you you will use in the coding section of this lesson, we are able to reduce the uncompressed sentence into two sentences that span across two free rows. This is a power of prompt compression, which you will see first hand in the coding section. I have kept a link of the paper present in the prompt compression technique, in the link on the slides. Feel free to observe and read the research paper after this lesson. In a few minutes, you will compress an extensive prompt of a few thousand tokens down to a few hundred tokens. With prompt compression, passing input into the prompt compression technique is very straightforward. Imagine having an uncompressed prompt of 50,000 tokens, just passing the uncompressed prompt and specifying a few parameters using a prompt compression library LLM lingua, you can reduce the uncompressed prompt down to 10,000 tokens. This is a five times reduction, and then you can pass the input straight into the LLM as you would with the uncompressed prompt, and receive the same quality output as if the name was initially provided with the uncompressed prompt. You are about to see this in code. In the coding section of this lesson, you will go through some familiar steps, which include setting up the right pipeline ad in the relevant MongoDB stages and then implementing a compression logic. And as usual, you will handle the user query and observe the results. Let's code. Start by importing the custom utils module, as you've done in previous lesson. You move on to load your data sets. Where you can observe the attributes. Similar to previous lessons. The next steps are covered in previous lessons, where your model documents connected to your database, extracted objects for your database, and your collection. Deleted existing records within the collection. Ingested new data, and lastly, created your vector search index. Just like in previous lessons you handled the user query. You start by creating the search results item model which will specify the attributes you want from the document to returned from the database operation. For this case, you have the name, address, and other corresponding attributes. Just like in the previous lesson, you add the additional boosting stages. You have the review average stage. The weighting stage and the sorting stage. And then, you add all the additional stages into a variable called additional stages. Orders are the step you took in previous lessons. Now, we're in the main part of this lesson where we have a similar handle user query you've seen in the previous lesson, but you are printing out the uncompressed prompt for observation. This is specified in this two new print statement. Now, you have the same user query used in previous lesson. And also the same handle user query function, but with the difference in this lesson of the print statement where we can see the uncompressed prompt. From the output, you can observe the time it took for the vector such operation to execute, which is a fraction of a millisecond. You can also observe the uncompressed prompt. Here, you can see that the uncompressed prompt is extensive. And do note that it's been truncated to ensure it fits all on the screen. You can also view the full content of the prompt by looking at the documents returned from the vector search operation listed on the table. Pause the video here to take in the share size of what is being passed into the LLM. You will also notice that the system has recommended the homely room in five star new condo. Remember this listing. Now, this is a fun part. We're going to look at a technique that allows us to reduce the extensive prompt you observed before and reduce it by a few hundred tokens. You start this by importing the prompt compressor constructor from the LLM lingua library. Using the prompt compressor constructor, you can specify a smaller large language model that has been fine tuned for prompt compression to do the compression of an uncompressed prompt. You will also specify the utilization of the latest LLM lingua prompt compression logic by specifying true as the value for the argument use LLM Lingua two. You'll also specify to use the CPU to the prompts compression module to ensure you're using the CPU on the device. Now that we've set up our prompts, compressor, specifically set up a smaller language model to do the prompt compression. We can move on to define the compress prompt function. Now, you can define the compressed query prompt function, which you take in the uncompressed prompt as the query for the function. The prompt compress a module requires the input to be structured in a certain way. And that is having a component based structure with the fields specifically demonstration, instruction, and question. I will go over what this means. Demonstration will hold the context as uses additional information that is passed in to the LLM with the user query. This is essentially the documents that has been returned from the database operation. This is going to hold a specific instruction that tells the smaller large language model how to compress the prompt. Finally, the question is specifically the user query itself. Now, you can actually call the compress prompt method on the LLM lingual model that we initialize earlier on. I'm going to explain what each argument does. The first argument specifies how to split each of the contexts up, specifically using the new line. The second argument takes in the instruction. Then the question. Next, We specify the target token we want the uncompressed prompt to be compressed down to. Next, there is a specification of the compression algorithm to utilize. You'll be using the latest compression algorithm from the LLM lingua specifically long LLM lingua. Next, we'll specify the context budget, but allow the budget to overrun by 100 tokens. Finally, we'll specify the compression ratio. This ratio indicates how the compression logic to assign tokens to the context, which is a demonstration and to the overall instruction question. Finally, we enable the compressor to reorder the context using a sort algorithm. This is all the argument for the compressed prompt method. The results of the compressed query prompt function is a json representation of the compressed prompt that will include information such as the token which is the original token of the uncompressed prompt, and the compressed prompt token. You will see this in action in a second. Now that you've specified a method for compressing an uncompressed prompt, you'll begin to specify to handle user query with compression, which would take in the user query and conduct the compression that was defined earlier. This handle user query with compression is similar to previous handle user query function from previous lessons, but the key difference is in this function there's a new input to the LLM specified as query info. This query info follows the structure of our compression logic, which has the demonstration, instruction and question. Remember, the demonstration is just a result from the database operation. The instruction tells the compression module how we want the compression to be executed. And the user query or the question it's simply the user query. This is the structure that is passed into the compressed query prompt by calling the function defined earlier and passing in the query info, and assign the results to a variable called Compress prompt. To visualize the result, you printout the compressed prompt in a structured manner. Finally, the handle user query with compression returns the search results and the compressed prompt itself. The final method you implemented this lesson is the handle system response. The handle system response passes the query along with the compressed prompt as input into the LLM. For visualization, you print out the query and the system response. Now, you can use the handle user query with compression method defined earlier, by passing the query, the database object collection object, additional stages, and specifying the vector search index you're using for this lesson. Execute the cell. The execution for the cell might take some minutes, as you are using a smaller language model to compress a prompt. Although we have an increase latency, the overall operational cost will be reduced. Here is the result of the prompt compression technique. Here, we have a field compressed prompt that has all the prompts. And as you can see, this is much shorter than the prompt we saw earlier. But more importantly, we have the original tokens of the uncompressed prompt that was 4284 tokens long, and the new compressed token is just 512. This is a ratio of eight times compression. There is also an indication of the cost factor that you're saving when using this prompt compression technique, and taking the compressed prompt and passing is input to a GPT-4 model. In this case, for this particular call, we're saving $0.2. If you think about this for a large scale application such as Airbnb, where several millions of inference calls are made to APIs, this can be a saving factor in the hundreds of thousands. The last step of this lesson is to pass the compressed prompt and the user query to the large language model to actually get a system response. Confirm you have a compressed prompt, and then pass the compressed prompt and a query into the handle system response function. Here we can observe the result. The results of this compressed prompt is provided us with a recommendation that meets the user query closely. Specifically, it's in a warm, friendly neighborhood which was included in the spaces of the listing. And also it's next to restaurants, which is what was specified in the user query. We saved on the operational costs of the RAG pipeline and obtained a quality output. This output is not the same as the uncompressed prompt output, but it's of similar quality and meets the requirements specified in the user query. The difference between the results of the uncompressed prompt, which provided us with this recommendation, and the compressed prompt, which provided us with this recommendation, it's quite minimal. This signifies with a lower token count, you can get similar outputs in terms of quality from a large language model. This concludes this lesson. In this lesson, you learned how to create a RAG pipeline, placed a vector search, and also conducted prompt compression to get a quality output, as you would with an uncompressed prompt. Here are some additional resources I recommend you to take a look after completing this lesson. The first resource is the MongoDB Developer Center, which contains tutorials, articles, and video covering a variety of topics related to AI. Next, you have the GenAI Showcase repo. This contains different code and is a repository showcasing different use cases for RAG and agentic systems. And finally, the DeepLearning.AI forum where you can ask questions in regards to this course.
Lesson 6: Conclusion
Lesson Link: https://learn.deeplearning.ai/courses/prompt-compression-and-query-optimization/lesson/bgsip/conclusion
Congratulations on completing this course. In this course you implemented vector search. You optimize RAG systems by using metadata and MongoDB aggregation pipeline to improve system efficiency and output relevance. And finally, you learned how to reduce LLM application operational costs by using prompt compression. I'm looking forward to seeing what you build on your own.