How to improve the accuracy of your RAG-LLM chatbot with AIMon [TypeScript, LlamaIndex]

Overview

In this tutorial, we'll help you build a retrieval-augmented chatbot that answers questions using the official documentation of AIMon. In this tutorial you will learn to:

Setup a LlamaIndex based indexing pipeline that crawls AIMon's documentation website and adds it to a vector store.
Build an LLM application that answers a user's query related to AIMon's documentation.
Leverage AIMon to monitor quality issues and fix them in the LLM application.

Data

The data used in this example is every page on AIMon's official documentation website.

Tech Stack

Choice of GenAI Framework

We will use LlamaIndex for this tutorial since it offers a good amount of flexibility and better lower level API abstractions, it supports both Python and TypeScript, and is optimized for data retrieval and querying.

Continuous monitoring for quality

We will leverage AIMon to continuously monitor the LLM application for quality issues.

Vector Database

To keep this application simple, we will use the LlamaIndex in-memory vector database.

Architecture

There are two main components we need to be aware of: Ingestion and Inference. The ingestion pipeline crawls the documents from AIMon's official documentation website, processes it and stores it in the Vector database. The inference pipeline processes a user query by first retrieving the relevant documents from the vector store. These documents will then be used as grounding documents for the LLM to generate its response. We also leverage AIMon to continuously monitor the application for various different quality issues like hallucination, context quality problems, instruction adherence, conciseness, toxicity and completeness.

Prerequisites

Install the dependencies.

npm install axios cheerio llamaindex typescript@latest aimon xml2js dotenv --save-dev @types/node @types/dotenv @types/xml2js

You will need an OpenAI API key and an AIMon API key. Once you obtained these keys, store them in a .env file as OPENAI_API_KEY=your_api_key and AIMON_API_KEY=your_api_key. We will use OpenAI for the LLM and embedding generation models. We will use AIMon for continuous monitoring of quality issues.
Setting up global configurations

import "dotenv/config";
import Client from "aimon";
import {OpenAI, OpenAIEmbedding, SentenceSplitter, Settings} from "llamaindex";

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const aimon = new Client({authHeader: `Bearer ${process.env.AIMON_API_KEY}`});

Settings.nodeParser = new SentenceSplitter();
Settings.llm = new OpenAI({ model: "gpt-4o-mini", temperature: 0.1 });
Settings.embedModel = new OpenAIEmbedding({model: "text-embedding-3-small"});

We are now ready to build the data ingestion pipeline and the inference pipeline. Let's start with building the data ingestion pipeline that crawls the website, extracts the HTML pages, generates embeddings from them and store them into a Vector store.

1. Ingestion Pipeline

NOTE: The documentation website that is being crawled here needs to have a sitemap.xml file for this ingestion pipeline to work.

Define utility functions to fetch URLs from a sitemap and extract text from the URLs.

Click to expand the utility functions.

// Importing dependencies
import axios from "axios";
import * as cheerio from "cheerio";
import {Document} from "llamaindex";
import { parseStringPromise } from "xml2js";

// Function to extract the URLs from a sitemap
async function fetch_from_sitemap(sitemapUrl: string): Promise<string[]> {
    try {
      // Step 1: Fetch the sitemap XML
      const response = await axios.get(sitemapUrl);
  
      // Step 2: Parse the XML into a JavaScript object
      const parsedXml = await parseStringPromise(response.data);
  
      // Step 3: Extract URLs from the parsed XML structure
      const urls: string[] = [];
      if (parsedXml.urlset && Array.isArray(parsedXml.urlset.url)) {
        parsedXml.urlset.url.forEach((urlEntry: { loc: string[] }) => {
          if (urlEntry.loc && urlEntry.loc[0]) {
            urls.push(urlEntry.loc[0]);
          }
        });
      }
  
      // Return the extracted URLs
      return urls;
    } catch (error) {
      console.error('Error fetching or parsing sitemap:', error);
      throw new Error('Failed to fetch or parse sitemap');
    }
}

// Function to extract text from a URL
async function extract_text_from_url(url: string): Promise<Document> {
  try {
    // Fetch the HTML content of the page
    const response = await axios.get(url);
    const html = response.data;

    // Load the HTML into Cheerio for parsing
    const $ = cheerio.load(html);

    // Extract the text content from the body (you can adjust this selector if needed)
    const textContent = $('body').text();

    // Clean up the text if necessary (trim whitespace, remove unnecessary characters, etc.)
    const cleanText = textContent.trim();

    // Create and return a Document object
    return new Document({ text: cleanText, id_: url });
  } catch (error) {
    console.error(`Error fetching or parsing the URL: ${url}`, error);
    // Return an empty Document object in case of an error
    return new Document({ text: '', id_: url });
  }
}

Fetch the HTML pages from the sitemap and save them as LlamaIndex documents.

// Fetch the URLs from sitemap
let aimon_urls = await fetch_from_sitemap("https://docs.aimon.ai/sitemap.xml");

// Remove the recipes from the list of URLs
const urlsToRemove: string[] = ["https://docs.aimon.ai/recipes/fixing_hallucinations_in_a_documentation_chatbot",
                                "https://docs.aimon.ai/recipes/fixing_hallucinations_in_a_documentation_chatbot_TS",
                                "https://docs.aimon.ai/recipes/fixing_hallucinations_in_a_documentation_chatbot_aperturedb",];

aimon_urls = aimon_urls.filter(url => !urlsToRemove.includes(url));

// Extract text from the URLs and create LlamaIndex documents
const documents: Document[] = [];

for (let i = 0; i < aimon_urls.length; i++) {
  const url = aimon_urls[i];
  // Intentionally not including information on toxicity detectors to demonstrate a hallucination that we will detect using AIMon.
    if(url=='https://docs.aimon.ai/category/metrics' || url=='https://docs.aimon.ai/detectors/toxicity'){
      continue;
    }
    try {
      const document = await extract_text_from_url(url); 
      documents.push(document);
    } 
    catch (error) {
      console.error(`Failed to extract text from ${url}:`, error);
    }
}

Create embeddings and store them in the in-memory Vector Store

import {VectorStoreIndex} from "llamaindex";
const index = await VectorStoreIndex.fromDocuments(documents);

2. LLM application with a RAG

Instantiate a retriever object and define the system prompt.

const retriever = index.asRetriever({similarityTopK: 5});

const system_prompt = " Please be professional and polite.\
                        Answer the user's question in a single line.\
                        Even if the context lacks information to answer the question, make\
                        sure that you answer the user's question based on your own knowledge.\
                        \
                        Example:\
                        \
                        Context: AIMon provides the hallucination and toxicity detector.\
                        Query: Give me the full set of labels that the toxicity detector provides\
                        Answer: 'AIMon likely provides 'toxic', 'abuse' and 'offensive' labels.'\
                        \
                        Notice how the answer was not present in the context, but because the question was\
                        about toxicity labels, it is very likely that the labels include 'toxic', 'abuse' and 'offensive'\
                      "

Instantiate a LlamaIndex ContextChatEngine.

import {ContextChatEngine} from "llamaindex";
const chatbot = new ContextChatEngine({ retriever, systemPrompt: system_prompt});

Now that we have all the building blocks in place, lets test the application with a user query.

// Define user query
const query = "How many detectors does AIMon provide to the end users?";

// Define instructions that you would like to check the LLM response against for adherence
const instructions = "1. Limit the response to under 300 words. 2. The respone should be in English language only.";

// Get LLM response
const response = await chatbot.chat({message:query});

console.log(`\nLLM response: ${response}`)

LLM response: AIMon provides multiple detectors, including a Hallucination Detector and an Instruction Adherence model, among others.

Integrate AIMon

Define a function get_source_documents to retrieve context documents from an LLM response.

function get_source_documents(response_string){
    let contexts = []
    let relevance_scores = []
    if (response_string.sourceNodes) {
        for(let node of response_string.sourceNodes){
            if((node.node)&&(node.node.text)&&(node.score)&&(node.score!=null)){
                contexts.push(node.node.text);
                relevance_scores.push(node.score);
            }
            else if((node.text)&&(node.score)&&(node.score!=null)){
                contexts.push(node.text);
                relevance_scores.push(node.score);
            }
            else{
                console.log("Node does not have the required attributes.");
            } } }
    else{console.log("No source_nodes attribute found in the chat response.");}
    return [contexts, relevance_scores]
}

Retrieve the context used by the LLM to generate the resposne.

const [context, relevance_scores] = get_source_documents(response);

Configure AIMon detectors that you would like to use.

const detectors = { hallucination: {detector_name: "hdm-1"}};

The detect detector is capable of both synchronous and asynchronous continuous monitoring to your LLM application. This helps you check whether your application is performing optimally for every request. You can see the results on this web application: https://app.aimon.ai.

Call the AIMon detect API. The detector below is configured to run synchronously for every query. You can set async_mode=True to run the detector asynchronously. Refer to the API reference documentation for more details.

const aimonResponse = await aimon.detect( response.response, 
                                            context, 
                                            query, 
                                            detectors, 
                                            instructions,
                                            false,          // async_mode [boolean]
                                            true,           // publish [boolean] 
                                            "TS_HDM2_test", // application name [string]
                                            "gpt-4o-mini"   // LLM name [string]
                                          );

console.log(`\nAIMon response: ${JSON.stringify(aimonResponse)}`);                   

AIMon response: [{"hallucination":{"is_hallucinated":"False","score":0.09823,"sentences":[{"score":0.09823,"text":"AIMon provides multiple detectors, including a Hallucination Detector and an Instruction Adherence model, among others."}]}}]

You can also view the metrics from AIMon response in the LLM Apps -> Production tab by logging into the AIMon dashboard. The dashboard shows a history of all the metrics configured for your application along with the ability to debug issues with the application.

Screenshot (37).png

Detecting Hallucinations in this LLM application

A hallucination is a factual inaccuracy or a complete fabrication of information that does not exist in the context. Let's test a query that produces a hallucination. Below we also show how AIMon detects this hallucination.

query = "What is the full set of labels that AIMon's toxicity detector generates?";
instructions =  "1. Limit the response to under 300 words."

llm_response_1 = await chatbot.chat({message:query});
const [context_1, relevance_scores_1] = get_source_documents(llm_response_1);
aimon_response_1 = await aimon.detect( llm_response_1.response, 
                                            context_1, 
                                            query, 
                                            detectors, 
                                            instructions,
                                            false,                  // async_mode [boolean]
                                            false,                  // publish [boolean] 
                                            "my_chatbot_app",       // application name [string]
                                            "OpenAI-gpt-4o-mini"    // LLM name [string]
                                          );

console.log(`\nLLM response: ${llm_response_1}`);
console.log(`\nAIMon response: ${JSON.stringify(aimon_response_1)}`);

LLM response: AIMon's toxicity detector likely generates labels such as 'toxic', 'abusive', 'offensive', and 'hateful'.

AIMon response: [{"hallucination":{"is_hallucinated":"True","score":0.91627,"sentences":[{"score":0.91627,"text":"AIMon's toxicity detector likely generates labels such as 'toxic', 'abusive', 'offensive', and 'hateful'."}]}}]

Notice the AIMon response for detecting a hallucination here. This response contains a is_hallucinated boolean variable, indicating whether the response is hallucinated (True) or not (False). Additionally, it includes a hallucination score which is a probability between 0.0 and 1.0. The closer the closer is to 1.0, the more likely that the response text is hallucinated.

The reason this response is hallucinated, is because the output produced by the LLM does not exist in the input context (since we excluded the toxicity detector during ingestion above).

Fixing the hallucination by supplying additional context

In this section, we show how to fix the above hallucination by adding the missing toxicity detectors to the vector store.

// Add additional documents to the existing vector database

const document_on_detectors = await extract_text_from_url('https://docs.aimon.ai/category/metrics');
const document_on_toxicity = await extract_text_from_url('https://docs.aimon.ai/detectors/toxicity');

// Generate updated index
const updatedIndex = await VectorStoreIndex.fromDocuments([...documents, document_on_detectors, document_on_toxicity])

// Re-assemble the ContextChatEngine
const chatbot_2 = new ContextChatEngine({ retriever: updatedIndex.asRetriever({similarityTopK: 5}), systemPrompt: system_prompt});

Repeat the same query that produced the hallucination above

query = "What is the full set of labels that AIMon's toxicity detector generates?"
instructions =  "1. Limit the response to under 300 words."

llm_response_2 = await chatbot.chat({message:query});
const [context_2, relevance_scores_2] = get_source_documents(llm_response_2);
aimon_response_2 = await aimon.detect( llm_response_2.response, 
                                            context_2, 
                                            query, 
                                            detectors, 
                                            instructions,
                                            false,                  // async_mode [boolean]
                                            false,                  // publish [boolean] 
                                            "my_chatbot_app",       // application name [string]
                                            "OpenAI-gpt-4o-mini"    // LLM name [string]
                                          );

console.log(`\nLLM response: ${llm_response_2}`);
console.log(`\nAIMon response: ${JSON.stringify(aimon_response_2)}`);

LLM response: AIMon's toxicity detector generates the following labels: 'identity_hate', 'toxic', 'severe_toxic', 'obscene', 'threat', and 'insult'.

AIMon response: [{"hallucination":{"is_hallucinated":"False","score":0.21688,"sentences":[{"score":0.21688,"text":"AIMon's toxicity detector generates the following labels: 'identity_hate', 'toxic', 'severe_toxic', 'obscene', 'threat', and 'insult'."}]}}]

🎉 Hallucination fixed!

Notice that the hallucination score from AIMon is low (less than 0.5), indicating that there is less likelihood of a hallucination.

As a recap, in this notebook we did the following things:

Created a documentation chat application using LlamaIndex
Integrated the application with AIMon to continuously monitor the quality of the output at low cost and low latency.
Used AIMon to detect and fix a hallucination produced by an LLM.

Overview​

Data​

Tech Stack​

Choice of GenAI Framework​

Continuous monitoring for quality​

Vector Database​

Architecture​