Offline Evaluation

This page explains how to quickly and easily evaluate your dataset of LLM prompts, contexts and responses using AIMon's metrics (hallucination, toxicity, and others).

Uploading the Evaluation Dataset

Before evaluating on a dataset, you should create a dataset csv and upload it to the AIMon platform. A dataset is a CSV file that contains one or more of the supported columns listed below. A dataset is immutable once created.

The supported columns are:

"context_docs": These are context documents that are either retrieved from a RAG or through other methods. For tasks like summarization, these documents could be directly specified by the user.
"prompt": This is the system prompt used for the LLM
"instructions": An array of strings representing individual instructions given to the LLM. These are extracted from the system prompt and used to evaluate instruction adherence. Each string in the array corresponds to a distinct directive or guideline.
"user_query": This the query specified by the user
"output": This is the generated text by the LLM

Depending on the detector being used, you may not need all the columns. For example, the hallucination detector only requires the "context_docs" and "output" columns. For the "context_classification" detector, you would need the "context_docs" column. The dataset creation API is designed to fail fast to provide you immediate feedback on the required columns for the detector you are using.

Here is an example dataset evaluation_dataset.csv:

prompt,user_query,context_docs,instructions,output
"Please provide information on the latest version of the Acme python client, including its features and release date.","""What is the latest version of Acme python client?""","[""KB_Article_1: Acme supports Python, Javascript, and Java. The latest version of the python library is v2.1, which was launched in March 2024"", ""KB_Article_2: Acme has deep integrations with the Python ecosystem where the Python client has shown to add value to developers"", ""KB_Article_3: The Acme python client version 2.1 introduces new features like async support and improved error handling.""]","1. Ensure that the response in under 500 words,
2. Ensure that there is no mention of the word ""Typescript""",The latest version is 2.1 and has async support. It was launched in March 2024
Could you explain how to configure the Acme python client for a new project?,"""How do I configure the Acme python client?""","[""KB_Article_4: Configuring the Acme python client involves setting up the environment variables first, followed by installing the necessary dependencies."", ""KB_Article_5: Detailed configuration steps for the Acme client can be found in the official documentation. It covers both basic and advanced setups.""]","1. Ensure that the response in under 500 words
2. Ensure that there is no mention of the word ""Typescript""
3. Ensure the response is in english"," Setup the environment variables, install dependencies and follow the official documentation for configuration"""

Upload the dataset using the AIMon client as follows:

Python
TypeScript

from aimon import Client
import json

aimon_client = Client(auth_header="Bearer <AIMON API KEY>")
# Create a new dataset
file_path = "evaluation_dataset.csv"

with open(file_path, 'rb') as file:
    aimon_dataset = aimon_client.datasets.create(
        file=file,
        name="evaluation_dataset.csv",
        description="This is a golden dataset"
    )

import Client from "aimon";
import { fileFromPath } from "formdata-node/file-from-path";

const aimon = new Client({
  authHeader: `Bearer API_KEY`,
});

// Creates a new dataset from the local path csv file
const createDataset = async (
  path: string,
  datasetName: string,
  description: string
): Promise<Client.Dataset> => {
  const file = await fileFromPath(path);
  const json_data = JSON.stringify({
    name: datasetName,
    description: description,
  });

  const params = {
    file: file,
    json_data: json_data,
  };

  const dataset: Client.Dataset = await aimon.datasets.create(params);
  return dataset;
};

const dataset1 = await createDataset(
  "/path/to/file/filename_1.csv",
  "filename1.csv",
  "description"
);

const dataset2 = await createDataset(
  "/path/to/file/filename_2.csv",
  "filename2.csv",
  "description"
);

Combining Datasets into a Dataset Collection

Group evaluation datasets into a collection for ease of use:

Python
TypeScript

dataset_collection = aimon_client.datasets.collection.create(
    name="my_first_dataset_collection",
    dataset_ids=[aimon_dataset.sha,],
    description="This is a collection containing just one dataset."
)

let datasetCollection: Client.Datasets.CollectionCreateResponse | undefined;

// Ensures that dataset1.sha and dataset2.sha are defined
if (dataset1.sha && dataset2.sha) {
  // Creates dataset collection
  datasetCollection = await aimon.datasets.collection.create({
    name: "my_first_dataset_collection",
    dataset_ids: [dataset1.sha, dataset2.sha],
    description: "This is a collection of two datasets.",
  });
} else {
  throw new Error("Dataset sha is undefined");
}

Running an Evaluation

An evaluation is associated with a specific dataset collection and a particular version of an application (and its corresponding model). You could evaluate the same application multiple times at different points in time. For example, this makes sense to do in a CI/CD context after any changes to the application or the model.

Once you have uploaded the dataset collection, you can evaluate function to run an evaluation, as shown in the example below.

Detectors are specified using the config parameter as shown below. For each metric, you can specify the name of the invoked AIMon detector in the detector_name field. We recommend leaving default if you are new to the platform.

Python
TypeScript

from aimon import evaluate
import os

eval_config={
    'hallucination': {'detector_name': 'default'},
    'toxicity': {'detector_name': 'default'},
    'conciseness': {'detector_name': 'default'},
    'completeness': {'detector_name': 'default'}
}

res = evaluate(
    dataset_collection_name="my_first_dataset_collection",  # use the same name you specified in client.datasets.collection.create
    headers=['context_docs', 'user_query', 'prompt', 'instructions', 'output'],  # columns of your dataset used in the evaluation
    application_name="llm_marketing_summarization_app_v5",
    model_name="meta-llama/Llama-3.2-1B_finetuned_oct_4",  # name of your LLM which generated the dataset responses
    evaluation_name="simple_eval_with_output_oct_17",
    api_key=os.getenv("AIMON_API_KEY"),
    aimon_client=aimon_client,
    config=eval_config,
)

print(res[0].response)
# AnalyzeCreateResponse(message='Data successfully sent to AIMon.', status=200)

const headers = ["context_docs", "user_query", "output"];
const config = {
  hallucination: { detector_name: "default" },
  instruction_adherence: { detector_name: "default" },
};
const results = await aimon.evaluate(
  "my_application_name", //Application Name
  "my_model_name", // Model name
  //this dataset collection must exist in the Aimon platform
  "my_first_dataset_collection",
  "my_evaluation_name", // Evaluation name,
  headers,
  config
);

Lower-level API

If you need more control over the evaluation or continuous monitoring process, you can use the lower-level API described in this notebook.

Glossary

Evaluation

Before deploying an LLM application to production, it is a good idea to test it with either a curated golden dataset or a snapshot of production traffic. AIMon platform provides detectors to assess the quality of the generated text in your dataset. AIMon adopts a "batteries included" approach, i.e., you do not have to use another third-party API.

Model

A model is a generative model, typically an LLM, that generates text based on an input query, context and user-provided instructions. The model can be a vanilla model, a fine-tuned model or a prompt-engineered model. When evaluating on a dataset, you simply tag your evaluation with a model name.

Application

An application is a specific use-case or a task that is associated with a model. For example, a summarization application. Each application is versioned i.e., each application is associated with a particular model for a given version of the application. When you use a different model for the same application, AIMon will automatically create a new version of the application.

Uploading the Evaluation Dataset​

Combining Datasets into a Dataset Collection​

Running an Evaluation​

Lower-level API​

Glossary​

Evaluation​

Model​

Application​