Offline Evaluation
This page explains how to quickly and easily evaluate your dataset of LLM prompts, contexts and responses using AIMon detectors (hallucination detector, toxicity detector, and others).
Uploading the Evaluation Dataset
Before evaluating on a dataset, you should create a dataset csv and upload it to the AIMon platform. A dataset is a CSV file that contains one or more of the supported columns listed below. A dataset is immutable once created.
The supported columns are:
- "prompt": This is the system prompt used for the LLM
- "user_query": This the query specified by the user
- "context_docs": These are context documents that are either retrieved from a RAG or through other methods. For tasks like summarization, these documents could be directly specified by the user.
- "output": This is the generated text by the LLM
- "instructions": These are the instructions provided to the LLM
Depending on the detector being used, you may not need all the columns. For example, the hallucination detector only requires the "context_docs" and "output" columns. For the "context_classification" detector, you would need the "context_docs" column. The dataset creation API is designed to fail fast to provide you immediate feedback on the required columns for the detector you are using.
Here is an example dataset evaluation_dataset.csv
:
prompt,user_query,context_docs,instructions,output
"Please provide information on the latest version of the Acme python client, including its features and release date.","""What is the latest version of Acme python client?""","[""KB_Article_1: Acme supports Python, Javascript, and Java. The latest version of the python library is v2.1, which was launched in March 2024"", ""KB_Article_2: Acme has deep integrations with the Python ecosystem where the Python client has shown to add value to developers"", ""KB_Article_3: The Acme python client version 2.1 introduces new features like async support and improved error handling.""]","1. Ensure that the response in under 500 words,
2. Ensure that there is no mention of the word ""Typescript""",The latest version is 2.1 and has async support. It was launched in March 2024
Could you explain how to configure the Acme python client for a new project?,"""How do I configure the Acme python client?""","[""KB_Article_4: Configuring the Acme python client involves setting up the environment variables first, followed by installing the necessary dependencies."", ""KB_Article_5: Detailed configuration steps for the Acme client can be found in the official documentation. It covers both basic and advanced setups.""]","1. Ensure that the response in under 500 words
2. Ensure that there is no mention of the word ""Typescript""
3. Ensure the response is in english"," Setup the environment variables, install dependencies and follow the official documentation for configuration"""
Upload the dataset using the AIMon client as follows:
- Python
- TypeScript
from aimon import Client
import json
aimon_client = Client(auth_header="Bearer <AIMON API KEY>")
# Create a new dataset
file_path = "evaluation_dataset.csv"
dataset_args = json.dumps({
"name": "evaluation_dataset.csv",
"description": "This is a golden dataset"
})
with open(file_path, 'rb') as file1:
aimon_dataset = aimon_client.datasets.create(
file=file1,
json_data=dataset_args
)
import Client from "aimon";
const aimon_client = new Client({
authHeader: `Bearer API_KEY`,
});
// Creates a new dataset from the local path csv file
const createDataset = async (
path: string,
datasetName: string,
description: string
): Promise<Client.Dataset> => {
const file = await fileFromPath(path);
const json_data = JSON.stringify({
name: datasetName,
description: description,
});
const params = {
file: file,
json_data: json_data,
};
const dataset: Client.Dataset = await aimon.datasets.create(params);
return dataset;
};
Combining Datasets into a Dataset Collection
Group evaluation datasets into a collection for ease of use:
- Python
- TypeScript
dataset_collection = aimon_client.datasets.collection.create(
name="my_first_dataset_collection",
dataset_ids=[aimon_dataset.sha,],
description="This is a collection containing just one dataset."
)
const dataset1 = await createDataset(
"/path/to/file/filename_1.csv",
"filename1.csv",
"description"
);
const dataset2 = await createDataset(
"/path/to/file/filename_2.csv",
"filename2.csv",
"description"
);
let datasetCollection: Client.Datasets.CollectionCreateResponse | undefined;
// Ensures that dataset1.sha and dataset2.sha are defined
if (dataset1.sha && dataset2.sha) {
// Creates dataset collection
datasetCollection = await aimon.datasets.collection.create({
name: "my_first_dataset_collection",
dataset_ids: [dataset1.sha, dataset2.sha],
description: "This is a collection of two datasets.",
});
} else {
throw new Error("Dataset sha is undefined");
}
Running an Evaluation
An evaluation is associated with a specific dataset collection and a particular version of an application (and its corresponding model). You could evaluate the same application multiple times at different points in time. For example, this makes sense to do in a CI/CD context after any changes to the application or the model.
Once you have uploaded the dataset collection, you can evaluate
function to run an evaluation, as shown in the example below.
Detectors are specified using the config
parameter as shown below.
For each metric, you can specify the name of the invoked AIMon detector in the detector_name
field.
We recommend leaving default
if you are new to the platform.
- Python
- TypeScript
from aimon import evaluate
import os
eval_config={
'hallucination': {'detector_name': 'default'},
'toxicity': {'detector_name': 'default'},
'conciseness': {'detector_name': 'default'},
'completeness': {'detector_name': 'default'}
}
res = evaluate(
dataset_collection_name="my_first_dataset_collection", # use the same name you specified in client.datasets.collection.create
headers=['context_docs', 'user_query', 'prompt', 'instructions', 'output'], # columns of your dataset used in the evaluation
application_name="llm_marketing_summarization_app_v5",
model_name="meta-llama/Llama-3.2-1B_finetuned_oct_4", # name of your LLM which generated the dataset responses
evaluation_name="simple_eval_with_output_oct_17",
api_key=os.getenv("AIMON_API_KEY"),
aimon_client=aimon_client,
config=eval_config,
)
print(res[0].response)
# AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)
const config = {
hallucination: { detector_name: "default" },
toxicity: { detector_name: "default" },
conciseness: { detector_name: "default" },
completeness: { detector_name: "default" }
};
for (const record of datasetCollectionRecords) {
// Analyze quality of the generated output using AIMon
const aimonResponse: Client.AnalyzeCreateResponse =
await aimon.analyze.create([
{
application_id: application.id,
version: application.version,
prompt: prompt !== null ? record.prompt : "",
user_query: userQuery !== null ? record.userQuery : "",
context_docs: record.contextDocs,
output: record.output,
evaluation_id: evaluationRun.evaluation_id,
evaluation_run_id: evaluationRun.id,
},
]);
}
Lower-level API
If you need more control over the evaluation or continuous monitoring process, you can use the lower-level API described in this notebook.
Glossary
Evaluation
Before deploying an LLM application to production, it is a good idea to test it with either a curated golden dataset or a snapshot of production traffic. AIMon platform provides detectors to assess the quality of the generated text in your dataset. AIMon adopts a "batteries included" approach, i.e., you do not have to use another third-party API.
Model
A model is a generative model, typically an LLM, that generates text based on an input query, context and user provided instructions. The model can be a vanilla model, a fine-tuned model or a prompt-engineered model. When evaluating on a dataset, you simply tag your evaluation with a model name.
Application
An application is a specific use-case or a task that is associated with a model. For example, a summarization application. Each application is versioned i.e., each application is associated with a particular model for a given version of the application. When you use a different model for the same application, AIMon will automatically create a new version of the application.