How to improve the accuracy of your RAG-LLM chatbot with AIMon [Python, ApertureDB, LlamaIndex]

Overview

In this tutorial, we'll help you build a retrieval-augmented chatbot that answers questions over AIMon's official documentation.

In this tutorial you will learn to:

Setup a LlamaIndex based indexing pipeline that crawls AIMon's documentation website and indexes the documents into ApertureDB, a multi-modal vector database.
Build an LLM application that answers a user's query related to AIMon's documentation.
Leverage AIMon to monitor quality issues and fix them in the LLM application.

Data

The data used in this example is every page on AIMon's official documentation website.

Tech Stack

Vector Database

For this application, we will use ApertureDB, is a specialized database designed to manage multimodal data, including images, videos, documents, feature vectors (embeddings), and associated metadata like annotations. For a detailed comparison between Vector Databases, refer to this blog post.

LLM Framework

LlamaIndex is a data framework for connecting, managing, and optimizing data for use with large language models (LLMs). We will use LlamaIndex for this tutorial since it offers a good amount of flexibility and better lower level API abstractions, it supports both Python and TypeScript, and is optimized for data retrieval and querying.

LLM Output Quality Evaluation

AIMon offers proprietary detectors for Hallucination, Context Quality issues, and Instruction Adherence among others. We will use AIMon to continuously monitor the LLM application for quality issues.

Architecture

There are two main components we need to be aware of: Ingestion and RAG based Q&A. The ingestion pipeline crawls the documents from AIMon's official documentation website, processes it and stores it in the Vector database. The RAG Q&A pipeline processes a user query by first retrieving the relevant documents from the vector store. These documents will then be used as grounding documents for the LLM to generate its response. We also leverage AIMon to continuously monitor the application for various different quality issues like hallucination, context quality problems, instruction adherence, conciseness, toxicity and completeness.

Prerequisites

Install the dependencies.

%%capture
%%shell
pip install udocker
udocker --allow-root install
nohup udocker --allow-root run -p 55555:55555 aperturedata/aperturedb-community &

%%capture
%pip install aimon requests llama-index --quiet

# Install ApertureDB-LlamaIndex integration
%%capture
%pip install "git+https://github.com/aperture-data/llama_index.git@add_aperturedb_vector_store#egg=llama-index-vector-stores-ApertureDB&subdirectory=llama-index-integrations/vector_stores/llama-index-vector-stores-ApertureDB" --quiet

Configure the ApertureDB community edition.

%%capture
%%shell
adb config create local --no-interactive --overwrite

Create a connection to the ApertureDB Vector Store.

from aperturedb.CommonLibrary import create_connector
from aperturedb.Utils import Utils
client = create_connector()
utils = Utils(client)
utils.get_schema()

Get the API keys

AIMon: Instructions available here

OPENAI API KEY:

Once you obtained these keys, configure your OPENAI_API_KEY and AIMON_API_KEY in Google Collab secrets and provide them notebook access. We will use OpenAI for the LLM and embedding generation models. We will use AIMon for continuous monitoring of quality issues.

Load the OpenAI API key into an environment variable using the following cell.

import os

# Import Colab Secrets userdata module.
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

We are now ready to build the data ingestion pipeline and the inference pipeline. Let's start with building the data ingestion pipeline that crawls the website, extracts the HTML pages, generates embeddings from them and stores them into a Vector store.

1. Ingestion Pipeline

Configure AIMon to detect conflicting information in the documents that you are going to index. AIMon provides easy to use decorators that can annotate your LLM functions to add in monitoring and real-time detection capabilities.

The detect detector is capable of both synchronous and asynchronous continuous monitoring to your LLM application.

import os
from aimon import Detect

aimon_config = {
                'context_classification': {'detector_name': 'default'}
               }

values_returned = ['context', 'generated_text']

application_name = 'aimon_llamaindex_chatbot_app_112124_103'
model_name = 'gpt-4o-mini'

context_quality_detector = Detect(
      values_returned=values_returned,
      api_key = userdata.get('AIMON_API_KEY'),
      config=aimon_config,
      publish=True,
      application_name=application_name,
      model_name=model_name
    )

Importing dependencies and defining the utility functions to fetch URLs from a sitemap, to extract text from the URLs and to create LlamaIndex documents.

Click to expand the utility functions.

import os
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
from llama_index.core import Document

## Function to crawl the sitemap.
def fetch_from_sitemap(sitemap_url = "https://docs.aimon.ai/sitemap.xml"):
    response = requests.get(sitemap_url)
    response.raise_for_status()
    parser = ET.fromstring(response.content)
    list_of_urls = [element.text for element in parser.findall('.//{*}loc')]
    return list_of_urls

## Scraping text from a URL using BeautifulSoup.
def extract_text_from_url(url):
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.content, 'html.parser')
    text = soup.get_text(separator="\n", strip=True)
    return text

## Function to preprocess text.
def preprocess_text(text):
    text = " ".join(text.split())
    return text

## Function to process all URLs and create LlamaIndex Documents.
def extract_and_create_documents(list_of_urls, partial=False):

    documents = []

    # This list does not include information on toxicity detectors.
    # This omission is intentional to demonstrate a hallucination that
    # we will detect using AIMon.
    if partial==True:
      list_of_urls.remove('https://docs.aimon.ai/category/metrics')
      list_of_urls.remove('https://docs.aimon.ai/detectors/toxicity')

    for url in list_of_urls:

        try:

          raw_text = extract_text_from_url(url)
          clean_text = preprocess_text(raw_text)
          doc = Document(text=clean_text, metadata={"url": url})

          documents.append(doc)

        except Exception as e:
            print(f"Failed to process {url}: {str(e)}")

    return documents

Fetch the HTML pages from the sitemap and save them as LlamaIndex documents (plain text + metdata).

NOTE: The documentation website that is being crawled here needs to have a sitemap.xml file for this ingestion pipeline to work.

list_of_urls = fetch_from_sitemap(sitemap_url="https://docs.aimon.ai/sitemap.xml")

## Removing the recipes from the list_of_urls
list_of_urls.remove("https://docs.aimon.ai/recipes/fixing_hallucinations_in_a_documentation_chatbot")
list_of_urls.remove("https://docs.aimon.ai/recipes/fixing_hallucinations_in_a_documentation_chatbot_TS")
list_of_urls.remove("https://docs.aimon.ai/recipes/fixing_hallucinations_in_a_documentation_chatbot_aperturedb")

## Injecting conflicting information to demonstrate AIMon's context quality detector
list_of_urls.append("https://gist.github.com/pjoshi30/4aadb3b5582b22d7ec5deb90bce5c06a")

documents = extract_and_create_documents(list_of_urls, partial=True)

Check the quality of the documents that will be indexed

# For demostration purposes, we will run AIMon on only the last document
@context_quality_detector
def check_for_conflicts():
  return documents[-1].text, "placeholder"

context, generated_text, aimon_response = check_for_conflicts()

Notice that the last document has conflicting information. To fix this issue, one option is to remove the last element from the list since it has conflicting information. The other option is to fix the document and add it back. Here, we choose the first option.

Screenshot 2024-11-21 at 12.53.13 PM.png

# Remove the last document from the array

documents = documents[:-1]

Setup an OpenAI based embedding model. In this step, any embedding model can be used.

from llama_index.embeddings.openai import OpenAIEmbedding
embedding_model = OpenAIEmbedding(model="text-embedding-3-small", embed_batch_size=100, max_retries = 3)

Split documents into nodes and generate their embeddings.

def generate_embeddings_for_docs(documents):

  # Using the LlamaIndex SentenceSplitter, parse the documents into text chunks.

  from llama_index.core.node_parser import SentenceSplitter

  text_parser = SentenceSplitter()

  text_chunks = []
  doc_idxs = []
  for doc_idx, doc in enumerate(documents):
      cur_text_chunks = text_parser.split_text(doc.text)
      text_chunks.extend(cur_text_chunks)
      doc_idxs.extend([doc_idx] * len(cur_text_chunks))

  ## Construct nodes from the text chunks.

  from llama_index.core.schema import TextNode

  nodes = []
  for idx, text_chunk in enumerate(text_chunks):
      node = TextNode(text=text_chunk)
      src_doc = documents[doc_idxs[idx]]
      node.metadata = src_doc.metadata
      nodes.append(node)

  ## Generate embeddings for each TextNode.

  for node in nodes:
      node_embedding = embedding_model.get_text_embedding(
          node.get_content(metadata_mode="all"))
      node.embedding = node_embedding

  return nodes

nodes = generate_embeddings_for_docs(documents)

Insert the nodes with embeddings into the ApertureDB Vector Store.

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.ApertureDB import ApertureDBVectorStore

vector_store = ApertureDBVectorStore(dimensions=1536, descriptor_set="aimondocs", overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(nodes=nodes, storage_context=storage_context)

So far we have built the first half

2. RAG based QA

This section demonstrates how to build an LLM application that leverages the vector store above for retrieval augmented generation (RAG).

Instantiate a retriever object and define the system prompt.

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

retriever = VectorIndexRetriever(index=index, similarity_top_k=5)

# In this prompt, we ask the LLM to answer the user's question even if the context
# does not contain an answer to this the user's query.
system_prompt = """
                Please be professional and polite.
                Answer the user's question in a single line.
                Even if the context lacks information to answer the question, make
                sure that you answer the user's question based on your own knowledge.

                Example:

                Context: "AIMon provides the hallucination and toxicity detector."
                Query: "Give me the full set of labels that the toxicity detector provides"
                Answer: "AIMon likely provides 'toxic', 'abuse' and 'offensive' labels."

                Notice how the answer was not present in the context, but because the question was
                about toxicity labels, it is very likely that the labels include 'toxic', 'abuse' and 'offensive'
                """

Configure a Large Language Model (LLM). Here we choose OpenAI's GPT-4o-mini model with a temperature setting of 0.1.

## OpenAI's LLM
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o-mini", temperature=0.1, system_prompt = system_prompt)

Instantiate a LlamaIndex query engine.

from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever, llm)

Integrate AIMon with the RAG Q&A Pipeline

Configure AIMon detectors that you would like to use. AIMon provides easy to use decorators that can annotate your LLM functions to add in monitoring and real-time detection capabilities.

The detect detector is capable of both synchronous and asynchronous continuous monitoring to your LLM application. This helps you check whether your application is performing optimally for every request. You can see the results on this web application: https://app.aimon.ai.

The detector below is configured to run synchronously for every query. You can set async_mode=True to run the detector asynchronously. Refer to the API reference documentation for more details.

import os
from aimon import Detect

aimon_config = {'hallucination': {'detector_name': 'hdm-1'},
                'instruction_adherence': {'detector_name': 'default'},
                'conciseness': {'detector_name': 'default'},
                'completeness': {'detector_name': 'default'},
                'toxicity': {'detector_name': 'default'}}

values_returned = ['context', 'user_query', 'instructions', 'generated_text']

detect = Detect(
      values_returned=values_returned,
      api_key = userdata.get('AIMON_API_KEY'),
      config=aimon_config,
      publish=True,
      application_name="my_chatbot_app",
      model_name="OpenAI-gpt-4o-mini"
    )

The ask_and_validate function uses the query_engine to return response for a given user_query. Internally, LlamaIndex calls the vector store to retrieve the relevant documents and then calls the OpenAI LLM with these documents to construct a response. The documents retrieved from the vector store can be extrated using the get_source_docs function below.

import logging

@detect
def ask_and_validate(user_query, user_instructions, query_engine=query_engine):

    response = query_engine.query(user_query)

    ## Nested function to retrieve context and relevance scores from the LLM response.
    def get_source_docs(chat_response):
      contexts = []
      relevance_scores = []
      if hasattr(chat_response, 'source_nodes'):
          for node in chat_response.source_nodes:
              if hasattr(node, 'node') and hasattr(node.node, 'text') and hasattr(node,
                                                                          'score') and node.score is not None:
                  contexts.append(node.node.text)
                  relevance_scores.append(node.score)
              elif hasattr(node, 'text') and hasattr(node, 'score') and node.score is not None:
                  contexts.append(node.text)
                  relevance_scores.append(node.score)
              else:
                  logging.info("Node does not have required attributes.")
      else:
          logging.info("No source_nodes attribute found in the chat response.")
      return contexts, relevance_scores

    context, relevance_scores = get_source_docs(response)
    return context, user_query, user_instructions, response.response

Now that we have all the building blocks in place, lets test the application with a user query.

user_query = "How does AIMon solve the problem of hallucinations?"
# These are explicit set of "verifiable" instructions that you want to check
# if the LLM followed them for generating its output.
user_instructions =  "1. Use technical terms appropriately but explain them. 2. Limit the response to under 500 words."

Call the ask_and_validate function to obtain the AIMon response. This decoratored function also returns the context, user_query, user_instructions and the llm_response.

context, user_query, user_instructions, llm_response, aimon_response = ask_and_validate(user_query, user_instructions)

print("LLM Response: {} \n".format(llm_response))

print("AIMon Detect Response: {} \n".format(aimon_response.detect_response))

    LLM Response: AIMon addresses hallucinations through its proprietary Hallucination Detector (HDM-2), which outperforms other commercial detectors on industry-standard benchmarks. 
    
    AIMon Detect Response: InferenceDetectResponseItem(result=None, completeness={'reasoning': 'The generated answer is relevant and correctly identifies that AIMon uses the Hallucination Detector (HDM-2) to address hallucinations, outperforming other detectors. However, it lacks additional context or details about how the detector functions or its implications for improving LLM outputs, which would enhance the completeness of the answer.', 'score': 0.756}, conciseness={'reasoning': "The generated answer is concise and directly addresses the user query by highlighting AIMon's Hallucination Detector, its effectiveness, and its comparative performance against other detectors. It focuses on the critical aspect of how AIMon solves the problem of hallucinations without unnecessary elaboration.", 'score': 0.895}, hallucination={'is_hallucinated': 'False', 'score': 0.07959, 'sentences': [{'score': 0.07959, 'text': 'AIMon addresses hallucinations through its proprietary Hallucination Detector (HDM-2), which outperforms other commercial detectors on industry-standard benchmarks.'}]}, instruction_adherence={'results': [{'adherence': False, 'detailed_explanation': "The response mentions 'Hallucination Detector (HDM-2)' as a technical term but does not explain what 'hallucinations' are in the context of AI or what the detector does, which is essential for understanding.", 'instruction': 'Use technical terms appropriately but explain them.'}, {'adherence': True, 'detailed_explanation': 'The response is concise and well under the 500-word limit, consisting of only one sentence.', 'instruction': 'Limit the response to under 500 words.'}], 'score': 0.5}, toxicity={'results': {'generated_text': {'detected_labels': {'identity_hate': 0.13537251949310303, 'insult': 0.18159158527851105, 'obscene': 0.14576633274555206, 'severe_toxic': 0.04887620359659195, 'threat': 0.1119876280426979, 'toxic': 0.3764057457447052}, 'text': 'AIMon addresses hallucinations through its proprietary Hallucination Detector (HDM-2), which outperforms other commercial detectors on industry-standard benchmarks.'}}, 'score': 0.3764057457447052})

You can also view these metrics in the LLM Apps -> Production tab by logging into the AIMon dashboard. The dashboard shows a history of all the metrics configured for your application along with the ability to debug issues with the application.

Screenshot (37).png

Congratulations 🎉 We have built the entire stack

Detecting Hallucinations in this LLM application

A hallucination is a factual inaccuracy or a complete fabrication of information that does not exist in the context. Let's test a query that produces a hallucination. Below we also show how AIMon detects this hallucination.

query = "What is the full set of labels that AIMon's toxicity detector generates?"
instructions =  "1. Limit the response to under 300 words."

context, user_query, user_instructions, llm_response, aimon_response = ask_and_validate(query, instructions)

print("LLM Response: {} \n".format(llm_response))
print("AIMon Hallucination Detection: {} \n".format(json.dumps(aimon_response.detect_response.hallucination, indent=4)))

    LLM Response: AIMon's toxicity detector likely generates labels such as 'toxic', 'abuse', 'offensive', and 'hate speech'. 
    
    AIMon Hallucination Detection: {
        "is_hallucinated": "True",
        "score": 0.93529,
        "sentences": [
            {
                "score": 0.93529,
                "text": "AIMon's toxicity detector likely generates labels such as 'toxic', 'abuse', 'offensive', and 'hate speech'."
            }
        ]
    } 

Notice the AIMon response for detecting a hallucination here. This response contains a is_hallucinated boolean variable, indicating whether the response is hallucinated (True) or not (False). Additionally, it includes a hallucination score which is a probability between 0.0 and 1.0. The closer the closer is to 1.0, the more likely that the response text is hallucinated.

The reason this response is hallucinated, is because the output produced by the LLM does not exist in the input context (since we excluded the toxicity detector during ingestion above).

Fixing the hallucination by supplying additional context

In this section, we show how to fix the above hallucination by adding the missing toxicity detectors to the vector store.

## Add additional documents to the existing vector database collection

additional_documents = extract_and_create_documents(['https://docs.aimon.ai/category/metrics',
                                                     'https://docs.aimon.ai/detectors/toxicity'],
                                                      partial = False)

vector_store = ApertureDBVectorStore(dimensions=1536, descriptor_set="aimondocs", overwrite=False)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(additional_documents, storage_context=storage_context, embed_model=embedding_model)

retriever = VectorIndexRetriever(index=index, similarity_top_k=5)

## Re-assemble the engine

query_engine = RetrieverQueryEngine.from_args(retriever, llm)

## Repeat the same query that produced the hallucination above

query = "What is the full set of labels that AIMon's toxicity detector generates?"
instructions =  "1. Limit the response to under 300 words."

context, user_query, user_instructions, llm_response_2, aimon_response_2 = ask_and_validate(query, instructions, query_engine)

print("LLM Response: {} \n".format(llm_response_2))
print("AIMon Hallucination Detection: {} \n".format(json.dumps(aimon_response_2.detect_response.hallucination, indent=4)))

    LLM Response: The full set of labels that AIMon's toxicity detector generates includes: 'identity_hate', 'toxic', 'severe_toxic', 'obscene', 'threat', and 'insult'. 
    
    AIMon Hallucination Detection: {
        "is_hallucinated": "False",
        "score": 0.19274,
        "sentences": [
            {
                "score": 0.19274,
                "text": "The full set of labels that AIMon's toxicity detector generates includes: 'identity_hate', 'toxic', 'severe_toxic', 'obscene', 'threat', and 'insult'."
            }
        ]
    } 

🎉 Hallucination fixed!

Notice that the hallucination score from AIMon is low (less than 0.5), indicating that there is less likelihood of a hallucination.

As a recap, in this notebook we did the following things:

Created a documentation chat application using LlamaIndex
Integrated the application with AIMon to continuously monitor the quality of the output at low cost and low latency.
Used AIMon to detect and fix a hallucination produced by an LLM.

Overview​

Data​

Tech Stack​

Vector Database​

LLM Framework​

LLM Output Quality Evaluation​

Architecture​

Prerequisites​

1. Ingestion Pipeline​

Check the quality of the documents that will be indexed​

2. RAG based QA​

Integrate AIMon with the RAG Q&A Pipeline​