Skip to main content

Evaluation and Continuous Monitoring

With AIMon, you can evaluate and continuously monitor the quality of the generated text. In this section, we will demonstrate how to quickly and easily setup evaluation and/or monitoring for your LLM applications. Let's start with some basic concepts that will be useful to understand before instrumenting your LLM application.

Model

A model is a generative model, typically an LLM, that generates text based on an input query, context and user provided instructions. The model can be a vanilla model, a fine-tuned model or a prompt-engineered model.

Application

An application is a specific use-case or a task that is associated a model. For example, a summarization application. Each application is versioned i.e., each application is associated with a particular model for a given version of the application. When you use a different model for the same application, AIMon will automatically create a new version of the application.

Evaluation

Evaluation is the process of assessing the quality of the generated text typically in an offline setup. This can be done using various detectors provided by the AIMon platform. AIMon adopts a "batteries included" approach i.e., you do not have to use another third-party API for the various detectors.

Before deploying the application to production, it is a good idea to test it with either a curated golden dataset or a snapshot of production traffic. In this section, we will demonstrate how AIMon can assist you to perform these tests.

Evaluation Dataset

AIMon can manage datasets for you. The dataset should be a CSV file with these columns:

  • "prompt": This is the prompt used for the LLM
  • "user_query": This the query specified by the user
  • "context_docs": These are context documents that are either retrieved from a RAG or through other methods. For tasks like summarization, these documents could be directly specified by the user.

A dataset can be created using the AIMon client as follows:

from aimon import Client
import json

aimon_client = Client(auth_header="Bearer <AIMON API KEY>")
# Create a new datasets
file_path = "evaluation_dataset.csv"

dataset = json.dumps({
"name": "evaluation_dataset.csv",
"description": "This is a golden dataset"
})

with open(file_path, 'rb') as file1:
aimon_dataset = aimon_client.datasets.create(
file=file1,
json_data=dataset
)
Dataset Collection

You can group a collection of evaluation datasets into a dataset collection for ease of use. A dataset collection can be created as follows:

dataset_collection = aimon_client.datasets.collection.create(
name="my_first_dataset_collection",
dataset_ids=[aimon_dataset1.sha, aimon_dataset2.sha],
description="This is a collection of two datasets."
)

This allows you to then access records from the dataset collection as follows:

# Get all records from the datasets in this collection
dataset_collection_records = []
for dataset_id in dataset_collection.dataset_ids:
dataset_records = aimon_client.datasets.records.list(sha=dataset_id)
dataset_collection_records.extend(dataset_records)

Creating an Evaluation

An evaluation is associated with a specific dataset collection and a particular version of an application (and its corresponding model).

Running an Evaluation

A "run" is an instance of an evaluation that you would like to track metrics against. You could have multiple runs of the same evaluation. This is typically done is a CI/CD context where the same evaluation would run at regular intervals. Since LLMs are probabilistic in nature, they could produce different outputs for the same query and context. It is a good idea to run the evaluations regularly to understand the variations of outputs produced by your LLMs. In addition, runs give you the ability to choose different metrics for each run.

Detectors can be specified using the config parameter in the payload as shown below. The keys indicate the type of metric computed and the value of detector_name is the specific algorithm used to compute those metrics. For most cases, we recommend using the default algorithm for each detector.

config={
'hallucination': {'detector_name':'default'},
'toxicity': {'detector_name':'default'},
'conciseness': {'detector_name':'default'},
'completeness': {'detector_name':'default'}
}

You can also tag a particular run. Tags allow you to specify metadata like the application commit SHA or other key-value pairs that you want to insert for analytics purposes.

Example

Setting up and running an evaluation can be done using the higher level Analyze decorator in Python or the low level API that offers more control.

Here is an example of running an evaluation

from aimon import AnalyzeEval, Application, Model

analyze_eval = AnalyzeEval(
Application("my_first_llm_app"),
Model("my_first_model", "GPT-4o"),
evaluation_name="your_first_evaluation",
dataset_collection_name="my_first_dataset_collection",
)

# Lanchain app example
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain.llms.openai import OpenAI
from langchain.chains.summarize import load_summarize_chain

# The analyze_eval decorator will automatically stream through
# records in the specified data collection and run it against
# this function. The signature of this function should necessarily
# contain context_docs, user_query and prompt as the first 3
# arguments.
@analyze_eval
def run_application_eval_mode(context_docs=None, user_query=None, prompt=None):
# Split the source text
text_splitter = CharacterTextSplitter()
texts = text_splitter.split_text(context_docs)

# Create Document objects for the texts
docs = [Document(page_content=t) for t in texts[:3]]

# Initialize the OpenAI module, load and run the summarize chain
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
chain = load_summarize_chain(llm, chain_type="map_reduce")
return chain.run(docs)

# This will automatically run an evaluation using records from the specified dataset collection asynchronously.
aimon_eval_res = run_application_eval_mode()
print(aimon_eval_res)

Continuous Monitoring

Once your application is ready for production, you can set up continuous monitoring to track the quality of the generated text. This can also be done using the Analyze decorator in Python or the analyze.production API in Typescript as follows:

from aimon import AnalyzeProd, Application, Model
analyze_prod = AnalyzeProd(
Application("my_first_llm_app"),
Model("my_best_model", "Llama3"),
values_returned=["context", "generated_text"],
)

# Lanchain app example
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain.llms.openai import OpenAI
from langchain.chains.summarize import load_summarize_chain

@analyze_prod
def run_application(context_docs=None, user_query=None, prompt=None):
# Split the source text
text_splitter = CharacterTextSplitter()
texts = text_splitter.split_text(context_docs)

# Create Document objects for the texts
docs = [Document(page_content=t) for t in texts[:3]]

# Initialize the OpenAI module, load and run the summarize chain
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
chain = load_summarize_chain(llm, chain_type="map_reduce")
return context_docs, chain.run(docs)

source_text = "Sample document to summarize"

# This will automatically run the application against the source text and prompt. It will also
# asynchronously run detections for the quality of the generated text.
context, res, aimon_res = run_application(source_text, prompt="Langhchain based summarization of documents")
print(aimon_res)

Lower level API

If you need more control over the evaluation or continuous monitoring process, you can use the lower level API described in this notebook.