STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
This tutorial demonstrates how to implement a Retrieval-Augmented Generation (RAG) architecture using Python, LangChain and Hugging Face Transformers.
Overview
This tutorial illustrates how to prototype advanced AI systems locally using Hugging Face Transformers, FAISS, and Python, creating a structured framework for building, testing, and iterating on solutions that integrate retrieval-augmented generation (RAG) and sentiment analysis capabilities. By shifting to local processing, this approach significantly reduces costs, ensures privacy, and removes reliance on external APIs. Hugging Face's open-source models enable Data Distiller users to overcome complex implementation challenges and develop functional prototypes efficiently, all while keeping sensitive data within their infrastructure. This approach is particularly valuable for privacy-conscious organizations and cost-sensitive projects.
By leveraging Hugging Face’s modular tools and pretrained models, you can refine specific components of the system, such as document retrieval accuracy or sentiment-aware response generation, without starting from scratch. This accelerates the validation process, enabling iterative improvements and rapid feedback loops. Local prototyping with Hugging Face not only reduces reliance on external APIs, which often incur ongoing costs, but also provides greater control over data flow, ensuring compliance with privacy regulations.
The sentiment-aware RAG tutorial showcases how Python’s ecosystem and Hugging Face Transformers enable seamless integration of sentiment metadata into retrieval and response pipelines. This local-first solution fosters innovative applications across domains, from financial sentiment analysis to product reviews and customer feedback categorization. Hugging Face’s pretrained models make it easy to extend this framework to specific industries, unlocking new possibilities without significant investment in computational resources. With Hugging Face’s accessible tools and Python’s versatility, businesses can rapidly visualize, test, and deploy solutions that provide actionable insights while maintaining cost efficiency and data security.
Case Study
In the e-commerce industry, providing an intuitive and engaging product search experience is critical for customer satisfaction and conversion rates. Customers often rely on product reviews to make informed purchasing decisions but are overwhelmed by the volume of unstructured feedback. This case study demonstrates how a sentiment-aware Retrieval-Augmented Generation (RAG) system can transform the product search experience by enabling conversational, sentiment-driven insights directly on the website.
Customers exploring a product catalog often have specific questions that require dynamic and detailed answers. Traditional search solutions, like keyword-based search bars, fail to provide nuanced responses and leave users frustrated. For example:
A customer might ask, "What do customers think about the durability of this product?" but only receive a list of generic reviews without context.
Another user searching for negative reviews about battery life may struggle to filter out irrelevant or overly positive results.
Beginners looking for summarized feedback might find the sheer number of reviews overwhelming.
To address these pain points, we need a solution that can:
Retrieve relevant reviews quickly and efficiently.
Analyze and incorporate sentiment to prioritize or filter feedback.
Provide conversational, natural language responses that summarize customer insights.
RAG Setup and Architecture
Setup Phase (Steps 1-4): Preparing the Data
Generate Embeddings for Reviews: The reviews (text data) are passed through a pre-trained embedding model, such as
all-MiniLM-L6-v2
. This model converts the reviews into numerical vector representations, known as embeddings. These embeddings capture the meaning of the reviews in a way that enables comparison and similarity detection.Store Embeddings in a FAISS (Facebook AI Similarity Search) Vector Database:: The generated embeddings are stored in a FAISS vector database. FAISS indexes these embeddings to enable efficient similarity searches. Each embedding represents a review and is indexed by its unique ID.
Include Metadata for Reviews: Metadata, such as sentiment or an ID for each review, is paired with the review content to form documents. These documents are stored in an in memory data store. This step ensures that each embedding in the FAISS database is linked to the corresponding review details.
Set Up a Link Between Embeddings and Metadata: A mapping is created between the FAISS vector index and the document store, ensuring that the vector representation (embeddings) can be matched with the original review content and metadata. This mapping enables retrieval of relevant context during a search.
RAG Phase (Steps 5-9): Processing a Query
Generate Embeddings for the Query: When a question (query) is asked, it is converted into an embedding using the same model (
all-MiniLM-L6-v2
). This step ensures the query is represented in the same vector space as the reviews, enabling effective comparison.Find Similar Reviews: The query embedding is compared against the embeddings in the FAISS vector database. FAISS uses Euclidean distance to identify the most similar reviews. This step narrows down the search to the most relevant matches.
Retrieve Review Content: The IDs of the top matches from FAISS are used to fetch the corresponding documents (review content and metadata) from the
InMemoryDocstore
. This step ensures that the retrieved results include both the vectorized data and the human-readable review content.Use an LLM to Generate an Answer: The retrieved reviews are passed to a language model (LLM) for contextual understanding. The LLM processes these documents, understands their content, and generates a response based on the question.
Deliver the Final Answer: The LLM outputs the final answer to the query. This answer is grounded in the context of the retrieved reviews, ensuring it is relevant and informed.
Prerequisites
Download the dataset and ensure it is located in the same working directory where your Python script is running.
Python installed based on
Install Hugging Face Transformers from the Terminal
If you have JupyterLab running, you will need to restart it so that it can recognize these libraries. Go to the Terminal window and press Ctrl+C to kill the process and relaunch by typing injupyter lab
at the command prompt.
Hugging Face provides a robust ecosystem for working with machine learning models, particularly for natural language processing (NLP). It offers:
Pre-trained Models: Hugging Face hosts thousands of models (e.g., GPT-2, BERT, T5) for tasks like text generation, translation, sentiment analysis, and more.
Transformers Library: The
transformers
library simplifies loading and using these models with pre-builtpipeline
functions, so you can perform tasks with minimal code.Flexibility: You can fine-tune models for specific use cases or use them as-is.
Make sure you have installed the following as well from the Terminal
LangChain is a framework designed for integrating language models into complex, multi-step workflows. It enables:
Chains: Sequences of tasks, such as retrieving documents, processing context, and generating responses.
Vector Stores: Storing and searching through text embeddings for efficient document retrieval.
Retrieval-Augmented Generation (RAG): Combining retrieval and generation, so models can answer queries using both context and generation capabilities.
Interoperability: LangChain wraps external tools (like Hugging Face models) into its ecosystem for seamless integration.
FAISS (Facebook AI Similarity Search) is a library designed to efficiently handle vector similarity searches and clustering of large datasets. When used with Hugging Face and LangChain, FAISS acts as the retrieval backbone for managing and searching through vector embeddings.
Hugging Face Transformers Library
Hugging Face Transformers is an open-source library that provides access to a wide variety of pretrained transformer models, including BERT, GPT, and T5, among others. It is a versatile tool for tasks such as text generation, classification, question answering, and embeddings, making it a powerful alternative to OpenAI's closed ecosystem.
One key advantage of Hugging Face Transformers is its cost-effectiveness; since models can be run locally without relying on APIs, businesses save on recurring cloud costs and avoid rate limits. Additionally, using Hugging Face Transformers locally ensures data privacy, as no sensitive information needs to leave the organization’s infrastructure. This feature is especially valuable for industries with strict compliance requirements, such as healthcare or finance.
Here are the key Hugging Face models ideal for marketing applications, such as customer sentiment analysis, personalized recommendations, and content creation:
GPT-2: Suited for text generation tasks.
BERT: Ideal for understanding tasks like question answering, sentiment analysis, and classification.
T5: Versatile for both text generation and understanding tasks, following a text-to-text framework.
GPT-2 (Generative Pre-trained Transformer 2)
Content Generation: Generate engaging ad copy, product descriptions, and blog posts.
Chatbots: Power conversational AI for customer service and lead nurturing.
Personalized Messaging: Craft tailored email content or social media posts.
We will be using the basic GPT-2 (117M parameter model) in this tutorial.
Give this a try:
You should get the following:
GPT-2 is a powerful language model that excels in generating coherent text but has several limitations. It is computationally intensive, especially in larger versions, requiring significant memory and processing power, which can hinder deployment on resource-constrained devices. GPT-2 struggles with understanding long-term dependencies in extended texts, limiting its effectiveness with very long documents. Without proper fine-tuning, it may underperform in domain-specific tasks due to a lack of specialized vocabulary understanding. Additionally, GPT-2 can produce grammatically correct but factually incorrect or nonsensical outputs because it lacks true reasoning capabilities, and it may reflect biases present in its training data, necessitating careful evaluation and post-processing in sensitive applications.
BERT (Bidirectional Encoder Representations from Transformers)
Sentiment Analysis: Analyze customer reviews, social media sentiment, or survey responses.
Search Optimization: Improve product search by understanding query intent and context.
Customer Segmentation: Classify and cluster customers based on behavior or preferences.
Give this a try and see how the answer is different:
BERT is not designed for open-ended text generation. It excels in understanding and processing existing text. For BERT to answer questions, it needs a context passage to extract the answer from.
T5 (Text-to-Text Transfer Transformer)
Versatility: Converts any NLP problem into a text-to-text task, enabling tasks like summarization, translation, and text generation.
Automated Summaries: Create concise summaries of customer feedback or lengthy reports.
Multi-lingual Content: Generate marketing content or summaries in different languages.
The T5 model in the snippet below requires more setup compared to GPT-2 in the snippet above because T5 is a task-specific sequence-to-sequence model designed to handle multiple NLP tasks, such as translation, summarization, and question answering. It requires a task-specific prefix like question:
or translate:
to specify the context, which is necessary for the model to understand the desired output format.
T5 also uses the SentencePiece tokenizer, which must encode the input text into token IDs compatible with its architecture, ensuring accurate processing of subword units. Additionally, T5 allows fine-grained control over text generation with parameters like temperature
, top_k
, and top_p
, which determine randomness and diversity in output. In contrast, GPT-2, as shown in the second snippet, is a simpler autoregressive model that doesn't require a prefix or task-specific setup. GPT-2 is quicker and easier to implement, though less flexible for structured multi-task scenarios like T5.
Try first installing the tokenizer library SentencePiece
widely used for models like T5 and Flan-T
T5 uses SentencePiece as its subword tokenizer. SentencePiece
allows the tokenizer to handle a variety of languages and create subword representations effectively. The tokenizer models included with Hugging Face T5 checkpoints (like t5-base
) depend on SentencePiece to load the tokenizer model.
The T5 (Text-to-Text Transfer Transformer) model is computationally intensive, especially in larger versions like T5-Large or T5-3B, requiring substantial memory and processing power, which can make deployment on resource-constrained devices challenging. The model's fixed input and output lengths limit its ability to handle very long texts or generate extended outputs, affecting tasks that involve lengthy sequences.
Without proper fine-tuning, T5 may underperform in domain-specific applications, failing to capture specialized vocabulary or nuances inherent to specific fields.
Additionally, like other large language models, it can produce outputs that are grammatically correct but factually incorrect or nonsensical, especially in complex reasoning scenarios. Lastly, T5 may inadvertently incorporate biases present in its training data, leading to biased or unfair outputs, necessitating careful evaluation and potential post-processing when deployed in sensitive applications.
Models and Parameters
Model parameters in the context of machine learning models like GPT-2, BERT, and T5 refer to the internal variables or "knobs" that the model adjusts during training to learn from data.
Imagine a machine learning model as a complex musical instrument with millions or even billions of adjustable dials and switches (the parameters). Each dial controls a tiny aspect of the sound produced. When all the dials are set correctly, the instrument plays beautiful music (produces accurate predictions or generates coherent text).
During the training process, the model "listens" to a lot of example music (training data) and learns how to adjust its dials to reproduce similar sounds. Each parameter is adjusted slightly to reduce errors and improve performance. The more parameters a model has, the more finely it can tune its performance, allowing it to capture intricate patterns and nuances in the data.
Here's how it relates to the models we mentioned before:
GPT-2: This model has variants with different numbers of parameters, ranging from 117 million to 1.5 billion. More parameters allow the model to generate more coherent and contextually relevant text because it can model more complex language patterns.
BERT: With versions like BERT-base (110 million parameters) and BERT-large (340 million parameters), BERT uses its parameters to understand and process language, enabling tasks like answering questions and understanding context.
T5: This model treats all tasks as text-to-text transformations and comes in sizes from 60 million to 11 billion parameters. The larger models can perform a wide variety of language tasks with greater accuracy due to their increased capacity.
GPT-2
Small
gpt2
117 million
Medium
gpt2-medium
345 million
Large
gpt2-large
774 million
Extra Large (XL)
gpt2-xl
1.5 billion
BERT
Base Uncased
bert-base-uncased
110 million
Large Uncased
bert-large-uncased
340 million
T5
Small
t5-small
60 million
Base
t5-base
220 million
Large
t5-large
770 million
3 Billion (3B)
t5-3b
3 billion
11 Billion (11B)
t5-11b
11 billion
Perform Sentiment Analysis
The raw data review data looks like:
VADER (Valence Aware Dictionary and sEntiment Reasoner)
The VADER (Valence Aware Dictionary and sEntiment Reasoner) Sentiment Analyzer is a tool designed to determine the sentiment expressed in text. It is particularly good at analyzing text that includes opinions, emotions, or casual language like product reviews, tweets, or comments
At its core, VADER uses a pre-built dictionary of words and phrases, where each word is assigned a sentiment score based on its emotional intensity. For example:
Positive words like "amazing" or "great" have high positive scores.
Negative words like "terrible" or "awful" have high negative scores.
Neutral words like "book" or "laptop" have little to no sentiment score.
When analyzing a sentence, VADER looks at each word, sums up the sentiment scores, and adjusts for factors like punctuation, capitalization, and special phrases. For example:
Words in ALL CAPS (e.g., "AWESOME!") are treated as having stronger sentiment.
Punctuation like exclamation marks (!) also boosts emotional intensity.
It also accounts for:
Negation: Words like "not" or "never" can flip the sentiment of a phrase. For instance, "not great" is identified as negative.
Intensity Modifiers: Words like "very" or "extremely" amplify sentiment, while words like "slightly" or "barely" reduce it. For example, "very bad" is more negative than just "bad."
Emoticons and Slang: VADER recognizes common emoticons (e.g., ":)", ":( "), slang (e.g., "lol"), and abbreviations, making it ideal for social media or casual text.
Building a sentiment analyzer, like VADER, is achievable in Data Distiller using its integrated machine learning models and pipelines. Data Distiller allows you to create a end-to-end workflow for sentiment analysis by leveraging labeled sentiment data and custom ML models. Using transformers, you can preprocess text data by tokenizing, normalizing, and extracting features such as word embeddings or term frequencies. These features can be fed into machine learning models like Logistic Regression for sentiment classification.
Assign Sentiment Metadata
Analyze the sentiment of each review using the VADER sentiment analyzer and attach sentiment metadata.
The output CSV file should look like this:
Introduction to Vector Embeddings
A vector embedding is a way of converting text (like product reviews) into a numerical representation (a vector) that computers can process and analyze. These embeddings capture the meaning and relationships between words in a mathematically useful format. For example, sentences like "This laptop is amazing!" and "Great laptop performance!" convey similar meanings. Embeddings convert these sentences into vectors that are close to each other in a mathematical space, facilitating tasks like sentiment analysis, clustering, and similarity comparisons.
We are using Hugging Face Sentence Transformers in this setup, embeddings are generated locally using a pre-trained transformer model like all-MiniLM-L6-v2
. These embeddings play a central role in structuring and enabling efficient similarity searches for customer reviews:
Pretrained Knowledge: The Hugging Face model is trained on extensive datasets, allowing it to understand nuanced meanings. This enables handling domain-specific or complex queries effectively.
Contextual Understanding: The model produces embeddings that are context-aware, meaning it captures relationships between words. For instance, "battery" in "battery life" has a distinct embedding from "battery of tests."
Privacy and Cost Efficiency: Unlike cloud-based embeddings (e.g., OpenAI models), Hugging Face models run locally. This ensures data privacy and eliminates reliance on paid external APIs.
Customizability: The model can be fine-tuned with domain-specific data to improve accuracy and adaptability for tailored applications.
We will be using the Hugging Face model all-MiniLM-L6-v2
that creates high-quality embeddings for product reviews. These embeddings are stored in a FAISS vector database, enabling efficient similarity searches. Here's how the workflow comes together:
Load Reviews: Customer reviews and their metadata (e.g., sentiment) are loaded from a CSV file.
Generate Embeddings: The reviews are transformed into numerical embeddings using the Hugging Face model.
The most common embedding models, based on performance and community adoption, are:
all-MiniLM-L6-v2
: This model is perfect for marketing tasks that require a balance between speed and accuracy. Whether you're conducting semantic search to match user queries with the most relevant product descriptions or performing customer review clustering to identify common themes, this model delivers reliable results. With its 384-dimensional embeddings, it’s lightweight and efficient, making it ideal for real-time marketing applications in resource-constrained environments, such as on-device personalization.all-mpnet-base-v2
: For high-precision marketing tasks, this model excels at capturing semantic nuances. Its 768-dimensional embeddings make it the go-to choice for applications like paraphrase identification, ensuring consistent messaging across campaigns, or textual entailment, which helps determine whether user-generated content aligns with your brand's values. This precision is invaluable for tasks such as refining campaign strategies based on nuanced customer feedback.multi-qa-MiniLM-L6-cos-v1
: Designed for multilingual marketing, this model shines in global campaigns. Supporting multiple languages, it is optimized for question-answering tasks, enabling businesses to create smart search tools that instantly connect users to the right information. Its 384-dimensional embeddings make it highly effective in cross-lingual semantic search, allowing marketers to target diverse audiences with personalized and contextually accurate content, bridging language barriers seamlessly.
Different vector embeddings produce distinct representations because they are tailored to specific use cases. These variations stem from differences in model architecture, training data, and the intended application. For example, traditional embeddings like Word2Vec and GloVe emphasize word relationships through co-occurrence, while modern models like BERT or Hugging Face Sentence Transformers take context into account, generating richer and more nuanced representations.
The choice of training data significantly impacts the embedding's performance. Models trained on general-purpose datasets provide broad applicability across tasks, whereas domain-specific embeddings, such as those trained on legal, medical, or financial texts, excel in specialized applications. Furthermore, embeddings can be optimized for diverse goals, including semantic similarity, sentiment analysis, or intent recognition. This adaptability ensures that the selected embedding model aligns precisely with the requirements of a given use case, offering the flexibility to tackle a wide range of tasks effectively.
The dimensionality of vector embeddings—the number of components in each embedding vector—significantly impacts how well these embeddings capture the underlying characteristics of the data. Higher-dimensional embeddings have the capacity to represent more nuanced and complex relationships because they can encode more features and patterns present in the data. This can lead to better performance in tasks like semantic similarity, classification, or clustering. However, increasing the dimensionality isn't always beneficial; it can introduce challenges such as higher computational costs and the risk of overfitting, where the model learns noise instead of meaningful patterns. Conversely, embeddings with too few dimensions might oversimplify the data, failing to capture important details and leading to poorer performance. Therefore, the choice of embedding dimensions is a balance: enough to encapsulate the necessary information without becoming inefficient or prone to overfitting. The optimal dimensionality often depends on the complexity of the data and the specific requirements of the task at hand.
Introduction to FAISS (Facebook AI Similarity Search)
FAISS (Facebook AI Similarity Search) is a lightweight and efficient vector database optimized for local use, making it an excellent choice for fast and scalable similarity searches. Unlike cloud-native alternatives, FAISS is designed to run entirely on local hardware, making it a cost-effective solution for developers who prioritize privacy and control over their data. For marketing applications, FAISS enables real-time retrieval of semantically similar data, such as analyzing customer reviews to identify sentiments or finding related products based on specific customer preferences, such as "affordable smartphones with excellent camera quality."
FAISS is particularly well-suited for scenarios where lightweight and local infrastructure is needed. Its design minimizes resource consumption while maintaining high performance, allowing teams to run advanced similarity searches without the need for expensive cloud services. For example, marketers can store and search vector embeddings locally, ensuring data privacy and avoiding latency issues often associated with cloud solutions.
Unlike cloud-based solutions such as Pinecone, FAISS provides unparalleled control over indexing and searching, giving developers the flexibility to tune their workflows for specific needs. However, it lacks built-in support for metadata filtering, which requires manual integration with external tools like pandas or JSON files. For teams that require complete data ownership and are comfortable with some additional setup, FAISS is an excellent choice for building recommendation engines, designing targeted ad campaigns, and conducting in-depth sentiment analysis. With its simplicity and local-first architecture, FAISS empowers marketing teams to prototype and deploy sophisticated AI-driven applications efficiently and privately.
The choice of vector database matters significantly, as it impacts the performance, scalability, and functionality of our system. Vector databases are specifically designed to handle high-dimensional numerical data (embeddings), enabling tasks like similarity search and nearest neighbor retrieval. Different vector databases, such as FAISS, Weaviate, Pinecone, or Milvus, offer distinct features and optimizations that may suit specific use cases.
FAISS is optimized for speed and efficiency in handling very large datasets, making it ideal for applications where real-time similarity searches are critical.
Weaviate and Pinecone provide additional functionality, like metadata filtering and integrations with external systems, making them suitable for production environments where complex queries are needed.
The choice also depends on whether you prioritize on-premises solutions (e.g., FAISS) or managed cloud services (e.g., Pinecone). Moreover, the vector database's support for various indexing techniques, scalability, and ease of integration with your embedding generation pipeline can significantly influence the system's overall effectiveness. Thus, the vector database complements the embeddings and ensures that your application can efficiently retrieve the most relevant results based on similarity.
A vector database is fundamentally different from a traditional database in how it stores and retrieves data. Traditional databases are optimized for structured data, like rows and columns, where queries are based on exact matches or straightforward filtering (e.g., finding all products under $50). In contrast, a vector database is designed to handle unstructured data, such as text, images, or audio, by storing high-dimensional numerical representations called embeddings. Instead of exact matches, queries in a vector database focus on finding similar data based on proximity in a mathematical space. For example, in a product review system, a vector database can retrieve reviews similar in meaning to a user’s query, even if they don’t share the exact words. This capability makes vector databases ideal for applications like recommendation systems, natural language processing, and image recognition, where similarity and contextual understanding are more important than precise matches.
Besides FAISS (faiss
), there are several other popular Python packages you could use for local vector similarity search and indexing. One notable alternative is Annoy (annoy
), developed by Spotify, which is efficient in memory usage and provides fast approximate nearest neighbor searches, making it suitable for static datasets where the index doesn't require frequent updates. Another option is HNSWlib (hnswlib
), which implements Hierarchical Navigable Small World graphs and excels in high-performance approximate nearest neighbor searches with dynamic updates, ideal for real-time applications that demand both speed and accuracy. NMSLIB (nmslib
) is also widely used and offers flexibility by supporting various distance metrics and algorithms for fast approximate nearest neighbor search. While FAISS is highly regarded for its performance on large-scale, high-dimensional data and remains one of the most popular choices in the machine learning community, these alternatives like Annoy and HNSWlib are also popular and might be preferred depending on your specific project requirements, such as data size, need for dynamic updates, computational resources, and ease of integration.
Store in Vector Database FAISS for Retrieval
In this part of the tutorial, we're setting up a system that helps us find similar product reviews based on their meaning. Think of each review as being converted into a list of numbers (as an "embedding") that captures its essence. To organize and search through these numerical representations efficiently, we create something called an index using FAISS, a library designed for this purpose.
We start by telling the system how long each list of numbers is—this is the dimension of our embeddings (in this case, 384 numbers per review). Then, we initialize the index with a method called IndexFlatL2
. The term "flat" means that the index will store all our embeddings in a simple, straightforward way without any complex structures. The "L2" refers to using the standard way of measuring distance between two points in space (like measuring the straight-line distance between two spots on a map).
By setting up the index this way, we're preparing a tool that can compare any new review to all existing ones by calculating how "far apart" their embeddings are. Reviews that are closer together in this numerical space are more similar in content. The variable index
now holds this prepared system, and we're ready to add our embeddings to it. Once added, we can quickly search through our reviews to find ones that are most similar to any given piece of text.
Create Embeddings
It is important for us to understand some of the key parts of the code above:
HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
:This loads a pre-trained AI model calledall-MiniLM-L6-v2
. This model is specifically designed to transform text ( customer reviews) into vector representations i.e. embeddings.embedding_model.embed_documents(reviews)
: Here, the model processes each review in thereviews
list and converts it into a numerical representation called an embedding. Each embedding captures the essence or "summary" of the review in a format that AI systems can easily compare and analyze.Convert Embeddings to a NumPy Array:
np.array(embeddings)
:The embeddings generated in Step 2 are stored as a Python list. To work with them efficiently, we convert this list into a NumPy array (embeddings_array
). NumPy arrays are faster and support advanced operations like getting dimensions.
Get the Size of Each Embedding:
embeddings_array.shape[1]
: Theshape
attribute of the NumPy array tells us its structure. Here,.shape[1]
retrieves the number of dimensions in each embedding (e.g., 384 for theall-MiniLM-L6-v2
model).
Create a FAISS Index:
faiss.IndexFlatL2(dimension)
: FAISS is a tool to efficiently search for similar embeddings. TheIndexFlatL2
creates a flat database that uses Euclidean distance to measure similarity between embeddings.
Add Embeddings to the Index:
search_index.add(embeddings_array)
:This step adds the embeddings (as vectors) into the FAISS index, making it ready to perform similarity searches. For example, you can now search for reviews that are similar to a given review or query.
Data Preparation for Vector Store
The strategy for data preparation in the vector store involves restructuring the data to ensure all relevant elements—reviews, their metadata, and embeddings—are readily accessible and interlinked. Each review is paired with its associated metadata (such as sentiment or ID) to create Document
objects, which provide context-rich units of information. These documents are stored in an InMemoryDocstore
for quick retrieval, and a mapping is created between FAISS index IDs and the corresponding document IDs in the docstore. This approach integrates the raw text, structured metadata, and vector representations into a unified system, enabling efficient similarity searches while preserving the ability to trace results back to their original details. By organizing the data in this way, the vector store becomes a powerful tool for querying and retrieving meaningful insights.
Remember that one is a document representation and the other is a vector representation. Here's the explanation of the modular architecture:
Document Representation (
InMemoryDocstore
): TheInMemoryDocstore
stores the actual content of the documents (reviews in this case) along with their metadata, such as sentiment or any other associated details. It's essentially a structured repository that holds the human-readable information and contextual details.Vector Representation (FAISS Index): FAISS stores the numerical embeddings (vector representations) of the reviews. These embeddings are mathematical representations of the textual content, capturing their semantic meaning. FAISS uses these vectors for similarity searches.
When you use docstore
in FAISS, it doesn't mean that the document content itself is stored in FAISS. Instead, it provides a way to link the vector representations in FAISS to their corresponding documents in the InMemoryDocstore:
Mapping with
index_to_docstore_id
: Each vector in the FAISS index is assigned an ID. Theindex_to_docstore_id
below connects these FAISS vector IDs to the IDs of the documents in thedocstore
.Pointer Mechanism: When a similarity search is performed in FAISS, it retrieves vector IDs for the closest matches. These IDs are then used to look up the associated
Document
objects in theInMemoryDocstore
.
This setup keeps FAISS optimized for fast numerical computations (vector searches) while delegating the task of managing document content and metadata to the docstore
. It's a division of responsibilities:
FAISS handles efficient retrieval of relevant vectors.
The
InMemoryDocstore
enriches the retrieval process by adding contextual information from the original documents.
This approach ensures the system remains modular and efficient while providing comprehensive query responses.
Let us now understand this code:
Create documents combining reviews and metadata:
A list of
Document
objects is created, where eachDocument
contains a review (page_content
) and its associated metadata (metadata[i]
), such as sentiment or ID.This links each review to its additional details for better context during retrieval.
Set up an in-memory storage:
The documents are stored in an
InMemoryDocstore
, a temporary storage solution, where each document is assigned a unique string key (its index as a string).This allows for easy retrieval of the original reviews and their metadata during searches.
Create a mapping between index IDs and document IDs:
A dictionary called
index_to_docstore_id
is created, mapping each numerical index in the FAISS vector store to the corresponding document ID in the docstore.This ensures that when a match is found in the FAISS index, the correct document can be retrieved.
Combine everything into a unified vector store:
A
FAISS
object is created to integrate the embedding function (for summarizing new queries), the FAISS search index (for similarity searches), the docstore (for original reviews and details), and the index-to-docstore mapping.This unified tool simplifies the workflow, allowing queries to be processed, matched, and linked to their original content seamlessly.
Create a Retrieval-Augmented Generation (RAG) System
The Retrieval-Augmented Generation (RAG) concept addresses the limitations of standalone language models (LLMs) by incorporating external context to improve response relevance and accuracy. When an LLM is asked a question without context, it generates answers based solely on its pre-trained knowledge, which can result in randomness or hallucinations—plausible-sounding but incorrect responses.
RAG mitigates this by integrating a retriever mechanism that fetches relevant context (e.g., documents or specific knowledge) from a database or vector store based on the query. This retrieved context is then provided to the LLM alongside the query, grounding the generation process in more accurate, up-to-date, or domain-specific information.
Remember that RAG improves how LLM answers questions by giving it helpful context to work with, such as related documents or information from a database. This makes the responses more accurate and relevant. However, mistakes will still happen if the retrieved documents don’t have enough useful information or if the AI misunderstands the content. Even with these limitations, RAG is a powerful approach for getting more reliable and context-based answers, especially in areas where accuracy and relevance are important.
Understanding Retrievers
A retriever is a core component in information retrieval systems, designed to find and return relevant pieces of information based on a query. Conceptually, it acts as a bridge between a user's query and a large knowledge base, enabling efficient and targeted searches. Retrievers work by comparing the query to the stored representations of data, such as vector embeddings or indexed documents, to identify the most similar or relevant items.
Retrieving Similar Vectors:
The
retriever
uses the FAISS vector store for similarity search.When a query is made (e.g., "What do customers say about battery life?"), the query text is transformed into a vector embedding using the same
embedding_function
used during setup.FAISS searches the stored vectors in the
search_index
to find thek
most similar vectors to the query embedding, based on Euclidean distance or another similarity metric.
Connecting to the
docstore
:FAISS returns the IDs of the top
k
closest vector embeddings.These IDs are mapped to their corresponding document IDs using the
index_to_docstore_id
dictionary.
Fetching Documents:
The
docstore
is then queried using these document IDs.It retrieves the actual document content (e.g., the original review) and metadata (e.g., sentiment, ID) associated with each retrieved vector.
Returning Results:
The
retriever
compiles the matching documents, including their metadata, into a format that can be used by downstream components (e.g., question-answering pipelines like RAG).
Retrieval with Generation - The RAG Pipeline
Let us now analyze
The code sets up a Retrieval-Augmented Generation (RAG) pipeline that combines a retriever with a language model to generate context-aware responses.
Text Generation Setup:
The
pipeline("text-generation", model="gpt2", max_new_tokens=50)
creates a text generation model (GPT-2) capable of generating text based on the input. Thepipeline
function. is not a LangChain function but comes from Hugging Face'stransformers
library.The
HuggingFacePipeline
is a function from LangChain. It acts as a bridge to integrate Hugging Face's models into LangChain's ecosystem, allowing Hugging Face models to be used seamlessly in LangChain workflows, like RetrievalQA or other chain-based pipelines.We wrap
text_generator
withHuggingFacePipeline
to make it work with LangChain. Hugging Face's pipeline generates text on its own, but LangChain needs models to follow its format to work well with tools like retrievers and chains. TheHuggingFacePipeline
acts like a translator, connecting the text generator to LangChain, so everything works together smoothly in the retrieval and question-answering process.
Retriever Role:
The
retriever
is already connected to thevector_store
created earlier, which maps query vectors to relevant documents.When a query is provided to the RAG pipeline, the retriever first identifies the most relevant documents (or text chunks) from the vector database by comparing the query’s embedding with stored embeddings.
Combining Retrieval with Generation:
The
RetrievalQA.from_chain_type
method combines the retriever and the LLM (llm
) into a unified pipeline.The
retriever
fetches the most relevant context (e.g., product reviews or document snippets) based on the query.This retrieved context is then fed to the language model, which uses it to generate a more informed and contextually accurate response.
Chain Type: The
chain_type="stuff"
specifies how the retrieved documents are handled. In this case, all retrieved context is concatenated ("stuffed") into a single input for the language model.
RetrievalQA
is a class in the LangChain framework designed to enable Retrieval-Augmented Generation (RAG) workflows. Its primary purpose is to combine a retriever (for finding relevant documents or data) with a language model (LLM) to produce accurate and context-aware responses to user queries. The retrieved documents are prepared (e.g., concatenated or summarized) based on the chain_type.
In LangChain's RetrievalQA
, the chain_type
determines how retrieved documents are processed and presented to the language model (LLM) to generate a response, offering flexibility for various use cases.
The stuff
chain type as mentioned earlier concatenates all retrieved documents into a single input and sends it to the LLM, making it simple and efficient for small sets of concise documents, though it may exceed token limits for larger contexts.
The map_reduce
chain processes each document independently to generate partial responses in the "map" step and combines them into a final answer in the "reduce" step, ideal for contexts too large to fit into a single call.
The refine
chain handles documents iteratively, refining the answer with each additional document, ensuring thorough consideration of all retrieved data, which is useful for in-depth analyses.
Lastly, the map_rerank
chain scores each document for relevance during the "map" step and selects the most relevant one to generate the response, making it effective for scenarios with numerous retrieved documents requiring prioritization.
This setup ensures that the model's responses are grounded in the most relevant information retrieved by the retriever
, reducing hallucination and making the output more reliable and context-aware. The retriever ensures that the LLM works with targeted, high-quality data rather than relying solely on its pre-trained knowledge.
Incoherent Results
The results are quite disappointing
The incoherence in the response is likely due to the combination of several factors:
Model Choice (GPT-2): The
gpt2
model is a general-purpose language model and is not specifically fine-tuned for tasks like summarization or retrieval-augmented question answering. It might struggle to provide coherent responses when fed raw retrieved contexts without fine-tuning or adaptation for the task."Stuff" Chain Type: The
chain_type="stuff"
concatenates all retrieved contexts into a single input before passing it to the LLM. If the retrieved documents contain repetitive or slightly mismatched information, the model might not handle this well and generate confusing responses. For example, repeated statements like "Battery drains quickly" can confuse the LLM's summarization process.Quality of Retrieved Context: If the documents retrieved by FAISS contain irrelevant or overly similar content, the LLM's ability to generate a cohesive answer diminishes. This happens because the model is trying to summarize redundant or poorly aligned input.
Token Limit and Truncation: If the combined context exceeds the model's token limit, parts of the context may be truncated. This can lead to partial or incomplete information being passed to the model, resulting in incoherence.
Absence of Explicit Instruction to the LLM: Without explicit prompts or instructions on how to format the response, the LLM might generate an answer that mixes context with the response, as seen in the output. GPT-2 works better when given very clear prompts.
Data Quality Issues in Retrieved Contexts: If the retrieved documents themselves contain incomplete, repetitive, or poorly structured text, the final response will reflect those issues. The model can only work as well as the data it is provided with.
Coherency and Relevance
We make the following changes:
Better Model (
flan-t5-base
): Replacegpt2
withflan-t5-base
, which is fine-tuned for tasks like summarization and QA. This ensures more accurate and coherent answers.Improved Chain Type (
refine
): Switch fromstuff
torefine
. This chain type ensures that each retrieved document is processed iteratively, allowing the model to refine its answer with each step.Cleaner and Clearer Prompt: Update the query to explicitly ask for a summary:
"
Summarize what customers say about battery life in the reviews
."
Maximum Token Limit Increased: Increased
max_new_tokens
to 100 to give the model more flexibility in generating coherent answers.
Let us run this code"
The response is
Dynamic Sentiment Filtering
In this section, the goal is to enhance the context provided to the Language Model (LLM) by enriching it with additional metadata extracted from the relevant documents. This process involves gathering all the documents that are related to the query and compiling their content, along with their metadata, to create a richer, more detailed context. The metadata can include supplementary information such as sentiment, review IDs, or other attributes that add depth and specificity to the query. By combining these documents and their associated metadata, the input sent to the LLM becomes more comprehensive, enabling it to generate more accurate, informed, and contextually relevant responses to the user's question. This step ensures that the LLM has access to all the necessary details to answer the query effectively.
The response will be:
Understanding the Code
Loop Through Questions: There are a few questions (like "What do customers say about battery life?") and an optional sentiment filter (e.g., "Positive" or "Negative"). The loop goes through each question one by one.
Find Relevant Reviews: For each question, the program looks for reviews that are related to the question using a "retriever." Think of this as finding the most relevant reviews from a library.
Filter by Sentiment:: If you're only interested in reviews with a specific sentiment (e.g., only positive reviews), it will filter the results to include only those matching your preference. If no matching reviews are found, it will print a message saying, "No documents found with sentiment 'Positive'" and fall back to using all the reviews.
Combine Relevant Reviews: Once the relevant reviews (filtered or unfiltered) are ready, it combines their content into a single block of text. This is like creating a summarized "cheat sheet" of what customers are saying.
Ask the LLM to Generate an Answer: Using the combined reviews, the program creates a "prompt" (a detailed question) for the AI. It says: "Here are the customer reviews.", "Based on these reviews, answer the following question." The question (like "What do customers say about battery life?") is included in the prompt.
Generate the Response: The AI reads the prompt, processes the reviews, and writes an answer to the question.
Display the Answer: Finally, it prints the question and the AI's response.
Going to Production
LLM Model Hosting
Model hosting is a critical component of deploying machine learning models like Hugging Face Transformers in production. You have two main options: hosting the model locally or using managed services. Managed hosting solutions, such as Hugging Face Inference API, AWS SageMaker, or Google Cloud AI Platform, simplify infrastructure management by providing pre-configured environments and scalable endpoints for inference. For example, AWS SageMaker allows you to deploy pre-trained models with minimal effort, enabling your backend to call these endpoints for generating responses. If you host the model locally, it can run alongside a FAISS index for efficient similarity searches, but this approach requires managing server resources and scaling manually. Managed services, on the other hand, ensure consistent performance during high traffic by leveraging cloud infrastructure, making them ideal for applications with fluctuating demand.
Doc Store Hosting
Local Hosting: The InMemoryDocstore used in development can be directly hosted on your server alongside the application. It is suitable for small-scale use cases or prototyping but not ideal for production where persistence and scalability are needed.
Managed Databases: Migrate the doc store to cloud-hosted NoSQL databases like MongoDB Atlas, AWS DynamoDB, or Firestore.These services allow you to persist metadata (e.g., review details and sentiment) and ensure scalability and durability.
FAISS Index Hosting
Local Hosting: Host the FAISS index on the same machine as the model and application backend.
This works well if your index size is manageable and you do not expect high traffic or scalability issues.
Cloud Hosting:
Custom VM Instances: Deploy FAISS on cloud services like AWS EC2, Google Cloud Compute Engine, or Azure VMs.
These instances can handle larger datasets and high query throughput.
Serverless Functions: For smaller FAISS indexes, services like AWS Lambda or Google Cloud Functions can be configured to load and query the FAISS index on-demand.
Docker/Kubernetes: Containerize the FAISS index with tools like Docker and deploy it on Kubernetes clusters (e.g., AWS EKS, Google Kubernetes Engine).
FAISS on Managed Services: Tools like Pinecone or Weaviate offer vector search as a managed service, abstracting the infrastructure for FAISS-like functionality. These services handle indexing, scaling, and querying vectors, removing the need for manual FAISS management.
Pipeline Infrastructure
The pipeline is primarily implemented in a backend service that handles:
Query Processing:
Vectorizing the user query with Hugging Face Embeddings.
Searching the FAISS index for relevant documents.
Optional Filtering:
Filtering retrieved documents based on metadata, such as sentiment.
Context Creation:
Preparing the context (e.g., concatenating retrieved reviews) for the LLM.
Response Generation:
Using an LLM (e.g., Hugging Face Transformers) to generate enriched responses.
This backend service can be built using frameworks like FastAPI, Flask, or Django for Python, which allows for easy integration with the vector search and the LLM.
Last updated