RAGs 2: A Guide to Retrieval-Augmented Generation

Improving Semantic Similarity for Real-World Applications

In the first part of this series, we explored a simple approach to finding semantic similarity between text snippets. While that initial code worked in some cases, it's not robust enough for real-world production software.

To build a more practical solution, we'll dive into using a powerful transformer model and cosine similarity to accurately measure the semantic relationships between sentences. This technique is widely used in modern natural language processing applications, from chat-bots to search engines, to surface the most relevant information for users.

By the end of this tutorial, you'll have a production-ready system that can intelligently match user queries to a corpus of relevant content, laying the groundwork for building sophisticated language understanding capabilities.

A better version

In this one, we shall try to use a transformer model to encode the corpus into an embedding format then use the Cosine similarity measure to find the semantic similarity between vectors.

Semantic similarity essentially means how close the meanings of the sentences are to each other.

Cosine Similarity finds out the angle between different vector representations such that even if the points are far apart due to different size, if their meanings are similar then the angle will be smaller irrespective of size.

Thus the smaller the angle, the more similar the semantics are.

One thing to note is that the cosine similarity function returns a score that is essentially better if it's higher. The cosine similarity score is inverse of the angle between the vectors.

In the above code, the auto-download feature of the sentence transformers might not work at times, so you can get the model directly from the below link

https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/

Code

# Description: This file contains the code for the RAG model
# Example: reuse your existing OpenAI setup
from openai import OpenAI
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pickle
import os

model = SentenceTransformer('all-MiniLM-L6-v2')

corpus_of_documents = []
doc_embeddings = []

try:
  if os.path.exists("embeddings.pkl"):
    with open("embeddings.pkl", "rb") as fIn:
      print("Importing embeddings")
      stored_data = pickle.load(fIn)
      corpus_of_documents = stored_data['sentences']
      doc_embeddings = stored_data['embeddings']
  else:
    # Now we use the transformer model to make embeddings 
    # load the corpus text file
    print("Creating embeddings")
    with open('corpus.txt', 'r') as file:
      content = file.read()
      sentences = content.split(',')
      corpus_of_documents = [sentence for sentence in sentences]  
    doc_embeddings = model.encode(corpus_of_documents)
    with open("embeddings.pkl", "wb") as fOut:
      pickle.dump({'sentences': corpus_of_documents, 'embeddings': doc_embeddings}, fOut)
except Exception as e:
  print(f"An error occurred: {e}")

first_prompt = "Chatbot: What is a leisure activity that you like?"
user_input = input(first_prompt+"\n")
query_embedding = model.encode([user_input])

similarities = cosine_similarity(query_embedding,doc_embeddings)

# After we have found the similarity scores, we have to sort it in descending order
# To find the most similar values 

indexed = list(enumerate(similarities[0])) # This is the tuple form of the list
sorted_index = sorted(indexed, key= lambda x: x[1], reverse=True)
# Here we use the lambda function to sort the elements based on the 1th value of the tuple
# which is the similarity score of the tuple. then we reverse it to get descending order.

# Now that we have our similarity measures, we take give the final response

threshold = 0.3 # This decides whether a similarity is high enough or not.

recommended_documents = []
for value,score in sorted_index:
    if score > threshold:
      # formatted_score = "{:.2f}".format(score)
      # print(f"{formatted_score} => {corpus_of_documents[value]}")
      recommended_documents.append(corpus_of_documents[value])

# The above lines take the most relevant documents and then we pass this to the prompt

system_prompt = """
You are a bot that makes recommendations for activities. You answer in short sentences
These are the potential activities:
    {relevant_document}
The user input is: {user_input}
Use less linebreaks
"""
if(recommended_documents == []):
  print("No relevant lines found")
else:  
  system_prompt= system_prompt.format(relevant_document=recommended_documents, user_input=user_input)
  user_prompt = "Based on the potential activities and the user input, recommend me some activities"
  # Point to the local server
  client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

  completion = client.chat.completions.create(
    messages=[
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": user_prompt},
    ],
    model="llama-3.2-3b-instruct",
    temperature=0.7,
  )
  output = "".join(completion.choices[0].message.content)
  print("chatbot:  " + output)

Does it work?

Yes it does.

This code is ran based on a separate text file, which has the same kinds of lines as the first test, but with a lot more lines to make this more interesting.

If the embedding file exists, it will use it directly. If it does not, then it will make one in the same directory.

I had to add some extra logic to this, because when the queries are simple and about generic things like leisure activities, even if there are no relevant lines, the LLM will try to reply it on it’s own. This is not a problem for casual things like this but you really do not want your chatbot to generate random knowledge when important queries are being made on sensitive documentations.

When asked something that is within the corpus, the following happens,

This show all the tasks within the corpus that was above the threshold. And once we remove the lines to show them, we get the final thing

What’s next?

To make this more useful in a production environment, you might want to write a script to take your documents and convert it into a standard text or Json format and use that instead.

In-fact i have done exactly that for a later chatbot project of mine, which incidentally does not use Vector embeddings, rather using a system of summarization to retrieve relevance.

In the next RAG chapter, I will show the summarization system and in the one after that I will rewrite it to include proper embedding systems.

Stay tuned.

This post is adapted from here https://learnbybuilding.ai/tutorials/rag-from-scratch-part-2-semantics-and-cosine-similarity

PS: I found a weird kind of alternative way to do these embeddings. Though it’s technically not embeddings but some form of ratio? Just check this video out.

Exploring RAGs 2

Table of contents

Improving Semantic Similarity for Real-World Applications

A better version

Code

Does it work?

What’s next?