Metrics Overview
Evaluating LLM and RAG applications often involves trade-offs between accuracy, reliability, and speed. While several frameworks like RAGAs, TrueLens, and DeepEval exist, they can be overwhelming due to inconsistent metrics, complexity in setup, and reliance on subjective LLM-based evaluation.
To simplify and standardize this process, we’ve organized the AIMon Labs metrics into four actionable categories that reflect real-world deployment priorities: Output Quality, Output Safety, RAG & Data, and Custom Metrics.
Metric Categories
Output Quality
Metrics that assess how well the LLM follows instructions, stays grounded in context, and produces complete, concise, and relevant responses.
View Output Quality Metrics
Topic | Description |
---|---|
Hallucination | Does the LLM output introduce fabricated information that wasn’t in the context? |
Instruction Adherence | Does the output follow the prompt’s instructions accurately and thoroughly? |
Completeness | Does the response include all necessary information relevant to the query? |
Conciseness | Is the response brief and free of unnecessary repetition or verbosity? |
Groundedness | Is the output properly supported by the source context documents? |
Output Relevance | Is the LLM's answer aligned with the user’s query without digression or off-topic content? |
Safety Metrics
Metrics that detect unsafe, biased, or privacy-violating content to ensure your LLM application meets safety, legal, and brand standards.
Note: Some safety metrics operate on the user’s query (pre-generation), while others evaluate the LLM output (post-generation).
-
Pre-Response metrics:
sql_prevention
,code_injection_detection
,jailbreak
,prompt_injection
-
Post-Response metrics:
toxicity
,personal_harm
,unsafe_stereotypes
,cbrn
,pii
View Safety Metrics
Topic | Description |
---|---|
Toxicity | Has the LLM generated any harmful, toxic, or offensive language? |
Prompt Injection | Is the user prompt attempting to override system behavior or leak instructions? |
CBRN | Is the output referencing or explaining harmful chemical, biological, radiological, or nuclear topics? |
Personal Harm | Could the content enable, promote, or encourage self-harm or violence toward others? |
Unsafe Stereotypes | Does the output perpetuate biased, stereotypical, or discriminatory views? |
Code Injection Detection | Is the user query attempting to inject unsafe executable code? |
SQL Prevention | Is the query vulnerable to SQL injection or suggesting unsafe database usage? |
Jailbreak | Has the user prompt tried to bypass system-level constraints or filters? |
PII | Is any personally identifiable information (e.g. names, emails, phone numbers) being leaked in the response? |
RAG & Data
Metrics that assess retrieval relevance and data quality in RAG pipelines — critical for grounding LLMs in the right context.
View RAG and Data Metrics
Topic | Description |
---|---|
Query-Context Relevance | Are the retrieved context chunks actually relevant to the user's query? |
Indexing Data Quality | Are the source documents coherent, accurate, and free from noise, duplicates, or truncation? |
Custom Metrics
Metrics that reflect your specific business goals or domain-specific requirements — ideal for cases where standard benchmarks fall short.
View Custom Metric Examples
Topic | Description |
---|---|
Use-case Alignment | Does the output reflect the internal guidelines or regulatory constraints specific to your industry? |
Domain Accuracy | Are LLMs generating factually correct content in a specialized domain (e.g. finance, healthcare, legal)? |
Brand Voice Matching | Does the LLM mimic the tone and style expected by your brand or communication playbook? |