Skip to main content

Offline Evaluation

This page explains how to quickly and easily evaluate your dataset of LLM prompts, contexts and responses using AIMon detectors (hallucination detector, toxicity detector, and others).

Uploading the Evaluation Dataset

Before evaluating on a dataset, you should create a dataset csv and upload it to the AIMon platform. A dataset is a CSV file that contains one or more of the supported columns listed below. A dataset is immutable once created.

The supported columns are:

  • "context_docs": These are context documents that are either retrieved from a RAG or through other methods. For tasks like summarization, these documents could be directly specified by the user.
  • "prompt": This is the system prompt used for the LLM
  • "instructions": These are the instructions provided to the LLM in the system prompt This field is a substring of the system prompt that is used to gauge instruction adherence.
  • "user_query": This the query specified by the user
  • "output": This is the generated text by the LLM

Depending on the detector being used, you may not need all the columns. For example, the hallucination detector only requires the "context_docs" and "output" columns. For the "context_classification" detector, you would need the "context_docs" column. The dataset creation API is designed to fail fast to provide you immediate feedback on the required columns for the detector you are using.

Here is an example dataset evaluation_dataset.csv:

prompt,user_query,context_docs,instructions,output
"Please provide information on the latest version of the Acme python client, including its features and release date.","""What is the latest version of Acme python client?""","[""KB_Article_1: Acme supports Python, Javascript, and Java. The latest version of the python library is v2.1, which was launched in March 2024"", ""KB_Article_2: Acme has deep integrations with the Python ecosystem where the Python client has shown to add value to developers"", ""KB_Article_3: The Acme python client version 2.1 introduces new features like async support and improved error handling.""]","1. Ensure that the response in under 500 words,
2. Ensure that there is no mention of the word ""Typescript""",The latest version is 2.1 and has async support. It was launched in March 2024
Could you explain how to configure the Acme python client for a new project?,"""How do I configure the Acme python client?""","[""KB_Article_4: Configuring the Acme python client involves setting up the environment variables first, followed by installing the necessary dependencies."", ""KB_Article_5: Detailed configuration steps for the Acme client can be found in the official documentation. It covers both basic and advanced setups.""]","1. Ensure that the response in under 500 words
2. Ensure that there is no mention of the word ""Typescript""
3. Ensure the response is in english"," Setup the environment variables, install dependencies and follow the official documentation for configuration"""

Upload the dataset using the AIMon client as follows:

from aimon import Client
import json

aimon_client = Client(auth_header="Bearer <AIMON API KEY>")
# Create a new dataset
file_path = "evaluation_dataset.csv"

dataset_args = json.dumps({
"name": "evaluation_dataset.csv",
"description": "This is a golden dataset"
})

with open(file_path, 'rb') as file1:
aimon_dataset = aimon_client.datasets.create(
file=file1,
json_data=dataset_args
)

Combining Datasets into a Dataset Collection

Group evaluation datasets into a collection for ease of use:

dataset_collection = aimon_client.datasets.collection.create(
name="my_first_dataset_collection",
dataset_ids=[aimon_dataset.sha,],
description="This is a collection containing just one dataset."
)

Running an Evaluation

An evaluation is associated with a specific dataset collection and a particular version of an application (and its corresponding model). You could evaluate the same application multiple times at different points in time. For example, this makes sense to do in a CI/CD context after any changes to the application or the model.

Once you have uploaded the dataset collection, you can evaluate function to run an evaluation, as shown in the example below.

Detectors are specified using the config parameter as shown below. For each metric, you can specify the name of the invoked AIMon detector in the detector_name field. We recommend leaving default if you are new to the platform.

from aimon import evaluate
import os

eval_config={
'hallucination': {'detector_name': 'default'},
'toxicity': {'detector_name': 'default'},
'conciseness': {'detector_name': 'default'},
'completeness': {'detector_name': 'default'}
}

res = evaluate(
dataset_collection_name="my_first_dataset_collection", # use the same name you specified in client.datasets.collection.create
headers=['context_docs', 'user_query', 'prompt', 'instructions', 'output'], # columns of your dataset used in the evaluation
application_name="llm_marketing_summarization_app_v5",
model_name="meta-llama/Llama-3.2-1B_finetuned_oct_4", # name of your LLM which generated the dataset responses
evaluation_name="simple_eval_with_output_oct_17",
api_key=os.getenv("AIMON_API_KEY"),
aimon_client=aimon_client,
config=eval_config,
)

print(res[0].response)
# AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)

Lower-level API

If you need more control over the evaluation or continuous monitoring process, you can use the lower-level API described in this notebook.

Glossary

Evaluation

Before deploying an LLM application to production, it is a good idea to test it with either a curated golden dataset or a snapshot of production traffic. AIMon platform provides detectors to assess the quality of the generated text in your dataset. AIMon adopts a "batteries included" approach, i.e., you do not have to use another third-party API.

Model

A model is a generative model, typically an LLM, that generates text based on an input query, context and user-provided instructions. The model can be a vanilla model, a fine-tuned model or a prompt-engineered model. When evaluating on a dataset, you simply tag your evaluation with a model name.

Application

An application is a specific use-case or a task that is associated with a model. For example, a summarization application. Each application is versioned i.e., each application is associated with a particular model for a given version of the application. When you use a different model for the same application, AIMon will automatically create a new version of the application.