How to improve output quality of your RAG-LLM chatbot with AIMon, Milvus and LlamaIndex 🤖💬📈

You can also open this recipe in Google Collab

Overview

In this tutorial, we'll help you build a retrieval-augmented chatbot that answers questions using the official documentation of AIMon. In this tutorial you will learn to:

Setup a LlamaIndex based indexing pipeline that crawls AIMon's documentation website and adds it to a vector store.
Build an LLM application that answers a user's query related to AIMon's documentation.
Leverage AIMon to monitor quality issues and fix them in the LLM application.

Data

The data used in this example is every page on AIMon's official documentation website.

Tech Stack

Vector Database

For this application, we will use Milvus. For a detailed comparison between Vector Databases, refer to this blog post.

Choice of GenAI Framework

We will use LlamaIndex for this tutorial since it offers a good amount of flexibility and better lower level API abstractions, it supports both Python and TypeScript, and is optimized for data retrieval and querying.

Continuous monitoring for quality

We will leverage AIMon to continuously monitor the LLM application for quality issues.

Architecture

There are two main components we need to be aware of: Ingestion and Inference. The ingestion pipeline crawls the documents from AIMon's official documentation website, processes it and stores it in the Vector database. The inference pipeline processes a user query by first retrieving the relevant documents from the vector store. These documents will then be used as grounding documents for the LLM to generate its response. We also leverage AIMon to continuously monitor the application for various different quality issues like hallucination, context quality problems, instruction adherence, conciseness, toxicity and completeness.

Prerequisites

Install the dependencies.

%pip install aimon requests llama-index llama-index-vector-stores-milvus pymilvus>=2.4.2 --quiet

You will need an OpenAI API key and an AIMon API key. Once you obtained these keys, configure your OPENAI_API_KEY and AIMON_API_KEY in Google Collab secrets and provide them notebook access. We will use OpenAI for the LLM and embedding generation models. We will use AIMon for continuous monitoring of quality issues.
Load the OpenAI API key into an environment variable using the following cell.

import os

# Import Colab Secrets userdata module.
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

We are now ready to build the data ingestion pipeline and the inference pipeline. Let's start with building the data ingestion pipeline that crawls the website, extracts the HTML pages, generates embeddings from them and stores them into a Vector store.

1. Ingestion Pipeline

Fetch the HTML pages from the sitemap and save them as plain text documents. We use Beautiful Soap to convert the crawled HTML documents into plain text.

NOTE: The documentation website that is being crawled here needs to have a sitemap.xml file for this ingestion pipeline to work.

import os
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
from llama_index.core import Document

## Function to crawl the sitemap.
def fetch_from_sitemap(sitemap_url = "https://docs.aimon.ai/sitemap.xml"):
    response = requests.get(sitemap_url)
    response.raise_for_status()
    parser = ET.fromstring(response.content)
    list_of_urls = [element.text for element in parser.findall('.//{*}loc')]
    return list_of_urls


## Scraping text from a URL using BeautifulSoup.
def extract_text_from_url(url):
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')
    text = soup.get_text(separator="\n", strip=True)
    return text

## Function to preprocess text.
def preprocess_text(text):
    text = " ".join(text.split())
    return text

## Function to process all URLs and create LlamaIndex Documents.
def extract_and_create_documents(list_of_urls, partial=False):

    documents = []

    # This list does not include information on toxicity detectors.
    # This omission is intentional to demonstrate a hallucination that
    # we will detect using AIMon.
    if partial==True:
      list_of_urls.remove('https://docs.aimon.ai/category/detectors')
      list_of_urls.remove('https://docs.aimon.ai/detectors/toxicity')

    for url in list_of_urls:

        try:

          raw_text = extract_text_from_url(url)
          clean_text = preprocess_text(raw_text)
          doc = Document(text=clean_text, metadata={"url": url})

          documents.append(doc)

        except Exception as e:
            print(f"Failed to process {url}: {str(e)}")

    return documents


list_of_urls = fetch_from_sitemap(sitemap_url="https://docs.aimon.ai/sitemap.xml")

documents = extract_and_create_documents(list_of_urls, partial=True)

Setup an OpenAI based embedding model. We will use the text-embedding-3-small model here.

from llama_index.embeddings.openai import OpenAIEmbedding
embedding_model = OpenAIEmbedding(model="text-embedding-3-small", embed_batch_size=100, max_retries = 3)

Split documents into nodes and generate their embeddings

def generate_embeddings_for_docs(documents):

  # Using the LlamaIndex SentenceSplitter, parse the documents into text chunks.

  from llama_index.core.node_parser import SentenceSplitter

  text_parser = SentenceSplitter()

  text_chunks = []
  doc_idxs = []
  for doc_idx, doc in enumerate(documents):
      cur_text_chunks = text_parser.split_text(doc.text)
      text_chunks.extend(cur_text_chunks)
      doc_idxs.extend([doc_idx] * len(cur_text_chunks))

  ## Construct nodes from the text chunks.

  from llama_index.core.schema import TextNode

  nodes = []
  for idx, text_chunk in enumerate(text_chunks):
      node = TextNode(text=text_chunk)
      src_doc = documents[doc_idxs[idx]]
      node.metadata = src_doc.metadata
      nodes.append(node)

  ## Generate embeddings for each TextNode.

  for node in nodes:
      node_embedding = embedding_model.get_text_embedding(
          node.get_content(metadata_mode="all"))
      node.embedding = node_embedding

  return nodes

nodes = generate_embeddings_for_docs(documents)

Insert the nodes with embeddings into the Milvus Vector Store.

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore

vector_store = MilvusVectorStore(uri= "./aimon_docs.db", collection_name = "aimondocs", dim=1536, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(nodes=nodes, storage_context=storage_context)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: c1e48d2f2da64ed1883c3a302e7d095e DEBUG:pymilvus.milvus_client.milvus_client:Successfully created collection: aimondocs DEBUG:pymilvus.milvus_client.milvus_client:Successfully created an index on collection: aimondocs

2. LLM application with a RAG

Instantiate a retriever object and define the system prompt.

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

retriever = VectorIndexRetriever(index=index, similarity_top_k=5)

# In this prompt, we ask the LLM to answer the user's question even if the context
# does not contain an answer to this the user's query.
system_prompt = """
                Please be professional and polite.
                Answer the user's question in a single line.
                Even if the context lacks information to answer the question, make
                sure that you answer the user's question based on your own knowledge.

                Example:

                Context: "AIMon provides the hallucination and toxicity detector."
                Query: "Give me the full set of labels that the toxicity detector provides"
                Answer: "AIMon likely provides 'toxic', 'abuse' and 'offensive' labels."

                Notice how the answer was not present in the context, but because the question was
                about toxicity labels, it is very likely that the labels include 'toxic', 'abuse' and 'offensive'
                """

Configure a Large Language Model (LLM). Here we choose OpenAI's gpt-4o-mini model with a temperature setting of 0.1.

## OpenAI's LLM
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini", temperature=0.1, system_prompt = system_prompt)

Instantiate a LlamaIndex query engine.

from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever, llm)

Integrate AIMon

Configure AIMon detectors that you would like to use. AIMon provides easy to use decorators that can annotate your LLM functions to add in monitoring and real-time detection capabilities.

The detect detector is capable of both synchronous and asynchronous continuous monitoring to your LLM application. This helps you check whether your application is performing optimally for every request. You can see the results on this web application: https://app.aimon.ai.

The detector below is configured to run synchronously for every query. You can set async_mode=True to run the detector asynchronously. Refer to the API reference documentation for more details.

import os
from aimon import Detect

aimon_config = {'hallucination': {'detector_name': 'hdm-1'},
                'instruction_adherence': {'detector_name': 'default'},
                'conciseness': {'detector_name': 'default'},
                'completeness': {'detector_name': 'default'},
                'toxicity': {'detector_name': 'default'}}

values_returned = ['context', 'user_query', 'instructions', 'generated_text']

detect = Detect(
      values_returned=values_returned,
      api_key = userdata.get('AIMON_API_KEY'),
      config=aimon_config,
      publish=True,
      application_name="my_chatbot_app",
      model_name="OpenAI-gpt-4o-mini"
    )

The ask_and_validate function uses the query_engine to return response for a given user_query. Internally, LlamaIndex calls the vector store to retrieve the relevant documents and then calls the OpenAI LLM with these documents to construct a response. The documents retrieved from the vector store can be extrated using the get_source_docs function below.

import logging

@detect
def ask_and_validate(user_query, user_instructions, query_engine=query_engine):

    response = query_engine.query(user_query)

    ## Nested function to retrieve context and relevance scores from the LLM response.
    def get_source_docs(chat_response):
      contexts = []
      relevance_scores = []
      if hasattr(chat_response, 'source_nodes'):
          for node in chat_response.source_nodes:
              if hasattr(node, 'node') and hasattr(node.node, 'text') and hasattr(node,
                                                                          'score') and node.score is not None:
                  contexts.append(node.node.text)
                  relevance_scores.append(node.score)
              elif hasattr(node, 'text') and hasattr(node, 'score') and node.score is not None:
                  contexts.append(node.text)
                  relevance_scores.append(node.score)
              else:
                  logging.info("Node does not have required attributes.")
      else:
          logging.info("No source_nodes attribute found in the chat response.")
      return contexts, relevance_scores

    context, relevance_scores = get_source_docs(response)
    return context, user_query, user_instructions, response.response

Now that we have all the building blocks in place, lets test the application with a user query.

user_query = "How does AIMon solve the problem of hallucinations?"
# These are explicit set of "verifiable" instructions that you want to check
# if the LLM followed them for generating its output.
user_instructions =  "1. Use technical terms appropriately but explain them. 2. Limit the response to under 500 words."

Call the ask_and_validate function to obtain the AIMon response. This decoratored function also returns the context, user_query, user_instructions and the llm_response.

context, user_query, user_instructions, llm_response, aimon_response = ask_and_validate(user_query, user_instructions)

print("LLM Response: {} \n".format(llm_response))

print("AIMon Detect Response: {} \n".format(aimon_response.detect_response))

    LLM Response: AIMon addresses hallucinations through its proprietary Hallucination Detector (HDM-1), which outperforms other commercial detectors on industry-standard benchmarks. 
    
    AIMon Detect Response: InferenceDetectResponseItem(result=None, completeness={'reasoning': "The generated answer is relevant and captures the main aspect of how AIMon solves the problem of hallucinations by mentioning the Hallucination Detector (HDM-1). However, it could provide more depth about what hallucinations are, why they're a problem, and how the detector works or its significance compared to other methods.", 'score': 0.75}, conciseness={'reasoning': "The generated answer is concise and directly addresses the user's query about how AIMon solves the problem of hallucinations, specifically mentioning the key component, the Hallucination Detector (HDM-1), and its performance compared to others. It includes the necessary details without any unnecessary elaboration.", 'score': 0.875}, hallucination={'is_hallucinated': 'False', 'score': 0.07959}, instruction_adherence={'results': [{'adherence': False, 'detailed_explanation': "While the response mentions the 'Hallucination Detector (HDM-1),' it fails to explain what hallucinations are in the context of AI and why a detector is necessary. The use of the term is appropriate, but without explanation, it does not meet the instruction adequately.", 'instruction': 'Use technical terms appropriately but explain them.'}, {'adherence': True, 'detailed_explanation': 'The response is concise and well under the 500-word limit, meeting the requirement perfectly.', 'instruction': 'Limit the response to under 500 words.'}], 'score': 0.5}, toxicity={'results': {'generated_text': {'detected_labels': {'identity_hate': 0.13537251949310303, 'insult': 0.18159158527851105, 'obscene': 0.14576633274555206, 'severe_toxic': 0.04887620359659195, 'threat': 0.1119876280426979, 'toxic': 0.3764057457447052}, 'text': 'AIMon addresses hallucinations through its proprietary Hallucination Detector (HDM-1), which outperforms other commercial detectors on industry-standard benchmarks.'}}, 'score': 0.3764057457447052}) 

You can also view these metrics in the LLM Apps -> Production tab by logging into the AIMon dashboard. The dashboard shows a history of all the metrics configured for your application along with the ability to debug issues with the application.

Screenshot (37).png

Detecting Hallucinations in this LLM application

A hallucination is a factual inaccuracy or a complete fabrication of information that does not exist in the context. Let's test a query that produces a hallucination. Below we also show how AIMon detects this hallucination.

query = "What is the full set of labels that AIMon's toxicity detector generates?"
instructions =  "1. Limit the response to under 300 words."

context, user_query, user_instructions, llm_response, aimon_response = ask_and_validate(query, instructions)

print("LLM Response: {} \n".format(llm_response))
print("AIMon Hallucination Detection: {} \n".format(json.dumps(aimon_response.detect_response.hallucination, indent=4)))

    LLM Response: AIMon's toxicity detector likely generates labels such as 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', and 'identity_hate'. 


    AIMon Hallucination Detection: {
        "is_hallucinated": "True",
        "score": 0.92537
    } 

Notice the AIMon response for detecting a hallucination here. This response contains a is_hallucinated boolean variable, indicating whether the response is hallucinated (True) or not (False). Additionally, it includes a hallucination score which is a probability between 0.0 and 1.0. The closer the closer is to 1.0, the more likely that the response text is hallucinated.

The reason this response is hallucinated, is because the output produced by the LLM does not exist in the input context (since we excluded the toxicity detector during ingestion above).

Fixing the hallucination by supplying additional context

In this section, we show how to fix the above hallucination by adding the missing toxicity detectors to the vector store.

## Add additional documents to the existing vector database

additional_documents = extract_and_create_documents(['https://docs.aimon.ai/category/detectors',
                                                     'https://docs.aimon.ai/detectors/toxicity'],
                                                      partial = False)

vector_store = MilvusVectorStore(uri= "./aimon_docs.db", collection_name = "aimondocs", dim=1536, overwrite=False)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(additional_documents, storage_context=storage_context, embed_model=embedding_model)

retriever = VectorIndexRetriever(index=index, similarity_top_k=5)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 8ec1fb478bf54a6695c35df97bc081f4

## Re-assemble the query engine

query_engine = RetrieverQueryEngine.from_args(retriever, llm)

## Repeat the same query that produced the hallucination above

query = "What is the full set of labels that AIMon's toxicity detector generates?"
instructions =  "1. Limit the response to under 300 words."

context, user_query, user_instructions, llm_response_2, aimon_response_2 = ask_and_validate(query, instructions, query_engine)

print("LLM Response: {} \n".format(llm_response_2))
print("AIMon Hallucination Detection: {} \n".format(json.dumps(aimon_response_2.detect_response.hallucination, indent=4)))

    LLM Response: The full set of labels that AIMon's toxicity detector generates includes 'identity_hate', 'toxic', 'severe_toxic', 'obscene', 'threat', and 'insult'. 
 
 
    AIMon Hallucination Detection: {
        "is_hallucinated": "False",
        "score": 0.15246
    } 

🎉 Hallucination fixed!

Notice that the hallucination score from AIMon is low (less than 0.5), indicating that there is less likelihood of a hallucination.

As a recap, in this notebook we did the following things:

Created a documentation chat application using LlamaIndex
Integrated the application with AIMon to continuously monitor the quality of the output at low cost and low latency.
Used AIMon to detect and fix a hallucination produced by an LLM.

Overview​

Data​

Tech Stack​

Vector Database​

Choice of GenAI Framework​

Continuous monitoring for quality​

Architecture​