Evaluation and Continuous Monitoring

With AIMon, you can evaluate and continuously monitor the quality of the generated text. In this section, we will demonstrate how to quickly and easily setup evaluation and/or monitoring for your LLM applications. Let's start with some basic concepts that will be useful to understand before instrumenting your LLM application.

Model

A model is a generative model, typically an LLM, that generates text based on an input query, context and user provided instructions. The model can be a vanilla model, a fine-tuned model or a prompt-engineered model.

Application

An application is a specific use-case or a task that is associated a model. For example, a summarization application. Each application is versioned i.e., each application is associated with a particular model for a given version of the application. When you use a different model for the same application, AIMon will automatically create a new version of the application.

Evaluation

Evaluation is the process of assessing the quality of the generated text typically in an offline setup. This can be done using various detectors provided by the AIMon platform. AIMon adopts a "batteries included" approach i.e., you do not have to use another third-party API for the various detectors.

Before deploying the application to production, it is a good idea to test it with either a curated golden dataset or a snapshot of production traffic. In this section, we will demonstrate how AIMon can assist you to perform these tests.

Evaluation Dataset

AIMon can manage datasets for you. The dataset should be a CSV file with these columns:

"prompt": This is the prompt used for the LLM
"user_query": This the query specified by the user
"context_docs": These are context documents that are either retrieved from a RAG or through other methods. For tasks like summarization, these documents could be directly specified by the user.

A dataset can be created using the AIMon client as follows:

Python
TypeScript

from aimon import Client
import json

aimon_client = Client(auth_header="Bearer <AIMON API KEY>")
# Create a new datasets
file_path = "evaluation_dataset.csv"

dataset = json.dumps({
    "name": "evaluation_dataset.csv",
    "description": "This is a golden dataset"
})

with open(file_path, 'rb') as file1:
    aimon_dataset = aimon_client.datasets.create(
        file=file1,
        json_data=dataset
    )

import Client from "aimon";

const aimon_client = new Client({
  authHeader: `Bearer API_KEY`,
});


// Creates a new dataset from the local path csv file
const createDataset = async (
  path: string,
  datasetName: string,
  description: string
): Promise<Client.Dataset> => {
  const file = await fileFromPath(path);
  const json_data = JSON.stringify({
    name: datasetName,
    description: description,
  });

  const params = {
    file: file,
    json_data: json_data,
  };

  const dataset: Client.Dataset = await aimon.datasets.create(params);
  return dataset;
};

Dataset Collection

You can group a collection of evaluation datasets into a dataset collection for ease of use. A dataset collection can be created as follows:

Python
TypeScript

dataset_collection = aimon_client.datasets.collection.create(
    name="my_first_dataset_collection", 
    dataset_ids=[aimon_dataset1.sha, aimon_dataset2.sha], 
    description="This is a collection of two datasets."
)

const dataset1 = await createDataset(
    "/path/to/file/filename_1.csv",
    "filename1.csv",
    "description"
);

const dataset2 = await createDataset(
    "/path/to/file/filename_2.csv",
    "filename2.csv",
    "description"
);

let datasetCollection: Client.Datasets.CollectionCreateResponse | undefined;

// Ensures that dataset1.sha and dataset2.sha are defined
if (dataset1.sha && dataset2.sha) {
    // Creates dataset collection
    datasetCollection = await aimon.datasets.collection.create({
    name: "my_first_dataset_collection",
    dataset_ids: [dataset1.sha, dataset2.sha],
    description: "This is a collection of two datasets.",
    });
} else {
    throw new Error("Dataset sha is undefined");
}

This allows you to then access records from the dataset collection as follows:

Python
TypeScript

# Get all records from the datasets in this collection
dataset_collection_records = []
for dataset_id in dataset_collection.dataset_ids:
    dataset_records = aimon_client.datasets.records.list(sha=dataset_id)
    dataset_collection_records.extend(dataset_records)

const datasetCollectionRecords: any[] = [];

for (const datasetId of datasetCollection.dataset_ids) {
  const datasetRecords = await aimonClient.datasets.records.list({ sha: datasetId });
  datasetCollectionRecords.push(...datasetRecords);
}

Creating an Evaluation

An evaluation is associated with a specific dataset collection and a particular version of an application (and its corresponding model).

Running an Evaluation

A "run" is an instance of an evaluation that you would like to track metrics against. You could have multiple runs of the same evaluation. This is typically done in a CI/CD context where the same evaluation would run at regular intervals. Since LLMs are probabilistic in nature, they could produce different outputs for the same query and context. It is a good idea to run the evaluations regularly to understand the variations of outputs produced by your LLMs. In addition, runs give you the ability to choose different metrics for each run.

Detectors can be specified using the config parameter in the payload as shown below. The keys indicate the type of metric computed and the value of detector_name is the specific algorithm used to compute those metrics. For most cases, we recommend using the default algorithm for each detector.

Python
TypeScript

config={
    'hallucination': {'detector_name':'default'}, 
    'toxicity': {'detector_name':'default'}, 
    'conciseness': {'detector_name':'default'}, 
    'completeness': {'detector_name':'default'}
}

const config = {
  hallucination: { detector_name: "default" },
  toxicity: { detector_name: "default" },
  conciseness: { detector_name: "default" },
  completeness: { detector_name: "default" }
};

You can also tag a particular run. Tags allow you to specify metadata like the application commit SHA or other key-value pairs that you want to insert for analytics purposes.

Example

Setting up and running an evaluation can be done using the higher level Analyze decorator in Python or the low level API that offers more control.

Here is an example of running an evaluation

Python
TypeScript

from aimon import AnalyzeEval, Application, Model

analyze_eval = AnalyzeEval(
    Application("my_first_llm_app"),
    Model("my_first_model", "GPT-4o"), 
    evaluation_name="your_first_evaluation",
    dataset_collection_name="my_first_dataset_collection",
)

# Lanchain app example
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain.llms.openai import OpenAI
from langchain.chains.summarize import load_summarize_chain

# The analyze_eval decorator will automatically stream through
# records in the specified data collection and run it against 
# this function. The signature of this function should necessarily 
# contain context_docs, user_query and prompt as the first 3 
# arguments.
@analyze_eval
def run_application_eval_mode(context_docs=None, user_query=None, prompt=None):
    # Split the source text
    text_splitter = CharacterTextSplitter()
    texts = text_splitter.split_text(context_docs)
    
    # Create Document objects for the texts
    docs = [Document(page_content=t) for t in texts[:3]]
    
    # Initialize the OpenAI module, load and run the summarize chain
    llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
    chain = load_summarize_chain(llm, chain_type="map_reduce")
    return chain.run(docs)

# This will automatically run an evaluation using records from the specified dataset collection asynchronously.
aimon_eval_res = run_application_eval_mode()
print(aimon_eval_res)

import Client from "aimon";
import { OpenAI } from "@langchain/openai";
import { loadSummarizationChain } from "langchain/chains";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { fileFromPath } from "formdata-node/file-from-path";

// Create the AIMon client. You would need an API Key (that can be retrieved from the UI in your user profile).
const aimon = new Client({
  authHeader: 'Bearer: <AIMON_API_KEY>',
});

// Initialize OpenAI configuration
const openaiApiKey = "OPENAI_API_KEY";

// Analyzes the dataset record and model output offline.
const runApplication: any = async (
  application: any,
  sourceText: any,
  prompt: string | null = null,
  userQuery: string | null = null,
  evaluationRun: any = null
) => {
  // Split the source text
  const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
  const docs = await textSplitter.createDocuments([sourceText]);
  const contextDocs = docs.map((doc) => doc.pageContent);

  // Summarize the texts
  const llm = new OpenAI({ temperature: 0, openAIApiKey: openaiApiKey });
  const chain = loadSummarizationChain(llm, { type: "map_reduce" });
  const output = await chain.invoke({
    input_documents: docs,
  });

  // Analyze quality of the generated output using AIMon
  const aimonResponse: Client.AnalyzeCreateResponse =
    await aimon.analyze.create([
      {
        application_id: application.id,
        version: application.version,
        prompt: prompt !== null ? prompt : "",
        user_query: userQuery !== null ? userQuery : "",
        context_docs: contextDocs,
        output: output.text,
        evaluation_id: evaluationRun.evaluation_id,
        evaluation_run_id: evaluationRun.id,
      },
    ]);
};


for (const record of datasetCollectionRecords) {
    await runApplication(
        myApplication,
        record.context_docs,
        record.prompt,
        record.user_query,
        newEvaluationRun
    );
}

Continuous Monitoring

Once your application is ready for production, you can set up continuous monitoring to track the quality of the generated text. This can also be done using the Analyze decorator in Python or the analyze.production API in Typescript as follows:

Python
TypeScript

from aimon import AnalyzeProd, Application, Model
analyze_prod = AnalyzeProd(
    Application("my_first_llm_app"), 
    Model("my_best_model", "Llama3"),
    values_returned=["context", "generated_text"],
)

# Lanchain app example
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain.llms.openai import OpenAI
from langchain.chains.summarize import load_summarize_chain

@analyze_prod
def run_application(context_docs=None, user_query=None, prompt=None):
    # Split the source text
    text_splitter = CharacterTextSplitter()
    texts = text_splitter.split_text(context_docs)
    
    # Create Document objects for the texts
    docs = [Document(page_content=t) for t in texts[:3]]
    
    # Initialize the OpenAI module, load and run the summarize chain
    llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
    chain = load_summarize_chain(llm, chain_type="map_reduce")
    return context_docs, chain.run(docs)

source_text = "Sample document to summarize"

# This will automatically run the application against the source text and prompt. It will also
# asynchronously run detections for the quality of the generated text.
context, res, aimon_res = run_application(source_text, prompt="Langhchain based summarization of documents")
print(aimon_res)

import Client from "aimon";
import { OpenAI } from "@langchain/openai";
import { loadSummarizationChain } from "langchain/chains";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { fileFromPath } from "formdata-node/file-from-path";

// Create the AIMon client. You would need an API Key (that can be retrieved from the UI in your user profile).
const aimon = new Client({
  authHeader: 'Bearer: <AIMON_API_KEY>',
});

// Initialize OpenAI configuration
const openaiApiKey = "OPENAI_API_KEY";

const runApplication: any = async (
  applicationName: string,
  modelName: string,
  sourceText: any,
  prompt: string | null = null,
  userQuery: string | null = null,
) => {
  // Split the source text
  const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
  const docs = await textSplitter.createDocuments([sourceText]);
  const contextDocs = docs.map((doc) => doc.pageContent);

  // Summarize the texts
  const llm = new OpenAI({ temperature: 0, openAIApiKey: openaiApiKey });
  const chain = loadSummarizationChain(llm, { type: "map_reduce" });
  const output = await chain.invoke({
    input_documents: docs,
  });

  const payload = {
    context_docs: contextDocs,
    output: String(output.text),
    prompt: prompt ?? "",
    user_query: userQuery ?? "",
    instructions: "These are the instructions",
  };

  const config = {
    hallucination: { detector_name: "default" },
    conciseness: { detector_name: "default" },
    completeness: { detector_name: "default" },
    instruction_adherence: { detector_name: "default" },
  };

  // Analyze quality of the generated output using AIMon
  const response: Client.AnalyzeCreateResponse = await aimon.analyze.production(
    applicationName,
    modelName,
    payload,
    config
  );
};

Lower level API

If you need more control over the evaluation or continuous monitoring process, you can use the lower level API described in this notebook.

Evaluation and Continuous Monitoring

Model​

Application​

Evaluation​

Evaluation Dataset​

Dataset Collection​

Creating an Evaluation​

Running an Evaluation​

Example​

Continuous Monitoring​

Lower level API​