Evaluation and Continuous Monitoring
With AIMon, you can evaluate and continuously monitor the quality of the generated text. In this section, we will demonstrate how to quickly and easily setup evaluation and/or monitoring for your LLM applications. Let's start with some basic concepts that will be useful to understand before instrumenting your LLM application.
Model
A model is a generative model, typically an LLM, that generates text based on an input query, context and user provided instructions. The model can be a vanilla model, a fine-tuned model or a prompt-engineered model.
Application
An application is a specific use-case or a task that is associated a model. For example, a summarization application. Each application is versioned i.e., each application is associated with a particular model for a given version of the application. When you use a different model for the same application, AIMon will automatically create a new version of the application.
Evaluation
Evaluation is the process of assessing the quality of the generated text typically in an offline setup. This can be done using various detectors provided by the AIMon platform. AIMon adopts a "batteries included" approach i.e., you do not have to use another third-party API for the various detectors.
Before deploying the application to production, it is a good idea to test it with either a curated golden dataset or a snapshot of production traffic. In this section, we will demonstrate how AIMon can assist you to perform these tests.
Evaluation Dataset
AIMon can manage datasets for you. The dataset should be a CSV file with these columns:
- "prompt": This is the prompt used for the LLM
- "user_query": This the query specified by the user
- "context_docs": These are context documents that are either retrieved from a RAG or through other methods. For tasks like summarization, these documents could be directly specified by the user.
A dataset can be created using the AIMon client as follows:
- Python
- TypeScript
from aimon import Client
import json
aimon_client = Client(auth_header="Bearer <AIMON API KEY>")
# Create a new datasets
file_path = "evaluation_dataset.csv"
dataset = json.dumps({
"name": "evaluation_dataset.csv",
"description": "This is a golden dataset"
})
with open(file_path, 'rb') as file1:
aimon_dataset = aimon_client.datasets.create(
file=file1,
json_data=dataset
)
import Client from "aimon";
const aimon_client = new Client({
authHeader: `Bearer API_KEY`,
});
// Creates a new dataset from the local path csv file
const createDataset = async (
path: string,
datasetName: string,
description: string
): Promise<Client.Dataset> => {
const file = await fileFromPath(path);
const json_data = JSON.stringify({
name: datasetName,
description: description,
});
const params = {
file: file,
json_data: json_data,
};
const dataset: Client.Dataset = await aimon.datasets.create(params);
return dataset;
};
Dataset Collection
You can group a collection of evaluation datasets into a dataset collection
for ease of use. A dataset collection
can be created as follows:
- Python
- TypeScript
dataset_collection = aimon_client.datasets.collection.create(
name="my_first_dataset_collection",
dataset_ids=[aimon_dataset1.sha, aimon_dataset2.sha],
description="This is a collection of two datasets."
)
const dataset1 = await createDataset(
"/path/to/file/filename_1.csv",
"filename1.csv",
"description"
);
const dataset2 = await createDataset(
"/path/to/file/filename_2.csv",
"filename2.csv",
"description"
);
let datasetCollection: Client.Datasets.CollectionCreateResponse | undefined;
// Ensures that dataset1.sha and dataset2.sha are defined
if (dataset1.sha && dataset2.sha) {
// Creates dataset collection
datasetCollection = await aimon.datasets.collection.create({
name: "my_first_dataset_collection",
dataset_ids: [dataset1.sha, dataset2.sha],
description: "This is a collection of two datasets.",
});
} else {
throw new Error("Dataset sha is undefined");
}
This allows you to then access records from the dataset collection as follows:
- Python
- TypeScript
# Get all records from the datasets in this collection
dataset_collection_records = []
for dataset_id in dataset_collection.dataset_ids:
dataset_records = aimon_client.datasets.records.list(sha=dataset_id)
dataset_collection_records.extend(dataset_records)
const datasetCollectionRecords: any[] = [];
for (const datasetId of datasetCollection.dataset_ids) {
const datasetRecords = await aimonClient.datasets.records.list({ sha: datasetId });
datasetCollectionRecords.push(...datasetRecords);
}
Creating an Evaluation
An evaluation is associated with a specific dataset collection and a particular version of an application (and its corresponding model).
Running an Evaluation
A "run" is an instance of an evaluation that you would like to track metrics against. You could have multiple runs of the same evaluation. This is typically done is a CI/CD context where the same evaluation would run at regular intervals. Since LLMs are probabilistic in nature, they could produce different outputs for the same query and context. It is a good idea to run the evaluations regularly to understand the variations of outputs produced by your LLMs. In addition, runs give you the ability to choose different metrics for each run.
Detectors can be specified using the config
parameter in the payload as shown below.
The keys indicate the type of metric computed and the value of detector_name
is the specific algorithm used to compute those metrics. For most cases, we recommend using the default
algorithm for each detector.
- Python
- TypeScript
config={
'hallucination': {'detector_name':'default'},
'toxicity': {'detector_name':'default'},
'conciseness': {'detector_name':'default'},
'completeness': {'detector_name':'default'}
}
const config = {
hallucination: { detector_name: "default" },
toxicity: { detector_name: "default" },
conciseness: { detector_name: "default" },
completeness: { detector_name: "default" }
};
You can also tag
a particular run. Tags allow you to specify metadata like the application commit SHA or
other key-value pairs that you want to insert for analytics purposes.
Example
Setting up and running an evaluation can be done using the higher level Analyze
decorator in Python or the low level API that
offers more control.
Here is an example of running an evaluation
- Python
- TypeScript
from aimon import AnalyzeEval, Application, Model
analyze_eval = AnalyzeEval(
Application("my_first_llm_app"),
Model("my_first_model", "GPT-4o"),
evaluation_name="your_first_evaluation",
dataset_collection_name="my_first_dataset_collection",
)
# Lanchain app example
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain.llms.openai import OpenAI
from langchain.chains.summarize import load_summarize_chain
# The analyze_eval decorator will automatically stream through
# records in the specified data collection and run it against
# this function. The signature of this function should necessarily
# contain context_docs, user_query and prompt as the first 3
# arguments.
@analyze_eval
def run_application_eval_mode(context_docs=None, user_query=None, prompt=None):
# Split the source text
text_splitter = CharacterTextSplitter()
texts = text_splitter.split_text(context_docs)
# Create Document objects for the texts
docs = [Document(page_content=t) for t in texts[:3]]
# Initialize the OpenAI module, load and run the summarize chain
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
chain = load_summarize_chain(llm, chain_type="map_reduce")
return chain.run(docs)
# This will automatically run an evaluation using records from the specified dataset collection asynchronously.
aimon_eval_res = run_application_eval_mode()
print(aimon_eval_res)
import Client from "aimon";
import { OpenAI } from "@langchain/openai";
import { loadSummarizationChain } from "langchain/chains";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { fileFromPath } from "formdata-node/file-from-path";
// Create the AIMon client. You would need an API Key (that can be retrieved from the UI in your user profile).
const aimon = new Client({
authHeader: 'Bearer: <AIMON_API_KEY>',
});
// Initialize OpenAI configuration
const openaiApiKey = "OPENAI_API_KEY";
// Analyzes the dataset record and model output offline.
const runApplication: any = async (
application: any,
sourceText: any,
prompt: string | null = null,
userQuery: string | null = null,
evaluationRun: any = null
) => {
// Split the source text
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
const docs = await textSplitter.createDocuments([sourceText]);
const contextDocs = docs.map((doc) => doc.pageContent);
// Summarize the texts
const llm = new OpenAI({ temperature: 0, openAIApiKey: openaiApiKey });
const chain = loadSummarizationChain(llm, { type: "map_reduce" });
const output = await chain.invoke({
input_documents: docs,
});
// Analyze quality of the generated output using AIMon
const aimonResponse: Client.AnalyzeCreateResponse =
await aimon.analyze.create([
{
application_id: application.id,
version: application.version,
prompt: prompt !== null ? prompt : "",
user_query: userQuery !== null ? userQuery : "",
context_docs: contextDocs,
output: output.text,
evaluation_id: evaluationRun.evaluation_id,
evaluation_run_id: evaluationRun.id,
},
]);
};
for (const record of datasetCollectionRecords) {
await runApplication(
myApplication,
record.context_docs,
record.prompt,
record.user_query,
newEvaluationRun
);
}
Continuous Monitoring
Once your application is ready for production, you can set up continuous monitoring to track the quality of the
generated text. This can also be done using the Analyze
decorator in Python or the analyze.production
API in Typescript as follows:
- Python
- TypeScript
from aimon import AnalyzeProd, Application, Model
analyze_prod = AnalyzeProd(
Application("my_first_llm_app"),
Model("my_best_model", "Llama3"),
values_returned=["context", "generated_text"],
)
# Lanchain app example
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain.llms.openai import OpenAI
from langchain.chains.summarize import load_summarize_chain
@analyze_prod
def run_application(context_docs=None, user_query=None, prompt=None):
# Split the source text
text_splitter = CharacterTextSplitter()
texts = text_splitter.split_text(context_docs)
# Create Document objects for the texts
docs = [Document(page_content=t) for t in texts[:3]]
# Initialize the OpenAI module, load and run the summarize chain
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)
chain = load_summarize_chain(llm, chain_type="map_reduce")
return context_docs, chain.run(docs)
source_text = "Sample document to summarize"
# This will automatically run the application against the source text and prompt. It will also
# asynchronously run detections for the quality of the generated text.
context, res, aimon_res = run_application(source_text, prompt="Langhchain based summarization of documents")
print(aimon_res)
import Client from "aimon";
import { OpenAI } from "@langchain/openai";
import { loadSummarizationChain } from "langchain/chains";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { fileFromPath } from "formdata-node/file-from-path";
// Create the AIMon client. You would need an API Key (that can be retrieved from the UI in your user profile).
const aimon = new Client({
authHeader: 'Bearer: <AIMON_API_KEY>',
});
// Initialize OpenAI configuration
const openaiApiKey = "OPENAI_API_KEY";
const runApplication: any = async (
applicationName: string,
modelName: string,
sourceText: any,
prompt: string | null = null,
userQuery: string | null = null,
) => {
// Split the source text
const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000 });
const docs = await textSplitter.createDocuments([sourceText]);
const contextDocs = docs.map((doc) => doc.pageContent);
// Summarize the texts
const llm = new OpenAI({ temperature: 0, openAIApiKey: openaiApiKey });
const chain = loadSummarizationChain(llm, { type: "map_reduce" });
const output = await chain.invoke({
input_documents: docs,
});
const payload = {
context_docs: contextDocs,
output: String(output.text),
prompt: prompt ?? "",
user_query: userQuery ?? "",
instructions: "These are the instructions",
};
const config = {
hallucination: { detector_name: "default" },
conciseness: { detector_name: "default" },
completeness: { detector_name: "default" },
instruction_adherence: { detector_name: "default" },
};
// Analyze quality of the generated output using AIMon
const response: Client.AnalyzeCreateResponse = await aimon.analyze.production(
applicationName,
modelName,
payload,
config
);
};
Lower level API
If you need more control over the evaluation or continuous monitoring process, you can use the lower level API described in this notebook.