Skip to main content

Context Relevance

Context relevance is a critical aspect of Language Model (LM) evaluation, especially in the context of Large Language Models (LLMs) like GPT-4o. The relevance of context data plays a crucial role in determining the accuracy and quality of the responses generated by LLMs. However, evaluating the relevance of context data can be complicated and often subjective. In order to address these challenges, AIMon has developed purpose-built tools that enable users to assess and improve the relevance of context data in LLM-based applications. To address the challenge of subjectiveness in evaluating context relevance, AIMon has developed a customizable context relevance evaluator and a re-ranker model that provide a more efficient and accurate solution for evaluating context relevance.

Challenges with Traditional LLM Evaluation Methods

Traditional methods of evaluating context relevance in LLM-based applications have several limitations, including:

  • Variance and inconsistency in results.
  • Subjectiveness of evaluations.
  • Cost inefficiency of using large off-the-shelf LLMs.

Read more about pros and cons of LLM Judges here.

AIMon's Approach to Context Relevance

AIMon's purpose-built and customizable relevance graders are designed to address the limitations of traditional evaluation methods. These tools enable users to assess the relevance of context data more accurately and efficiently, leading to improved performance and reliability of LLM-based applications. By using AIMon's relevance graders, users can customize the evaluation criteria to suit their specific needs and requirements, resulting in more precise and consistent evaluations.

The customization made during the evaluation can be immediately applied to the improve the re-ranking phase of the LLM application to bubble up the most relevant documents to the top.

Diagram depicting working of AIMon reranker

See more about the re-ranker model here.

Task Definition

The task definition allows users to specify the domain or objective of the re-ranker, ensuring that context is evaluated based on the intended use case. By providing a structured task definition, users can improve context relevance assessments, enabling more precise ranking of retrieved documents for applications such as summarization, question answering, and retrieval-augmented generation (RAG).

Example

Pre-requisites

Before running, ensure that you have an AIMon API key. Refer to the Quickstart guide for more information.

Code Examples

Like any other evaluation, the context relevance evaluation can be done using the AIMon SDK. Here is an example of how to evaluate the relevance of context data using the AIMon SDK:

API Request & Response Example

[
{
"context": ["The capital of France is Paris.", "France is a country in Europe."],
"user_query": "What is the capital city of France?",
"task_definition": "Answer the user's question based on the provided context.",
"config": {
"retrieval_relevance": {
"detector_name": "default"
}
}
}
]

Code Examples

# Synchronous example

import os
from aimon import Client
import json

# Initialize client
client = Client(auth_header=f"Bearer {os.environ['AIMON_API_KEY']}")

# Construct payload
payload = [{
"generated_text": "The product ships within 2 business days.",
"user_query": "How long will it take for my order to arrive?",
"context": ["Our shipping policy states delivery in 5–7 business days."],
"task_definition": "Answer the user's question based on logistics data.",
"config": {
"retrieval_relevance": {
"detector_name": "default"
}
},
"publish": False
}]

# Call sync detect
response = client.inference.detect(body=payload)

# Print result
print(json.dumps(response[0].retrieval_relevance, indent=2))

Here is an example on how the metrics look on the AIMon UI: retrieval_relevance.png