Metrics Overview

Evaluating LLM and RAG applications often involves trade-offs between accuracy, reliability, and speed. While several frameworks like RAGAs, TrueLens, and DeepEval exist, they can be overwhelming due to inconsistent metrics, complexity in setup, and reliance on subjective LLM-based evaluation.

To simplify and standardize this process, we’ve organized the AIMon Labs metrics into four actionable categories that reflect real-world deployment priorities: Output Quality, Output Safety, RAG & Data, and Custom Metrics.

Metric Categories

Output Quality

Metrics that assess how well the LLM follows instructions, stays grounded in context, and produces complete, concise, and relevant responses.

View Output Quality Metrics

Topic	Description
Hallucination	Does the LLM output introduce fabricated information that wasn’t in the context?
Instruction Adherence	Does the output follow the prompt’s instructions accurately and thoroughly?
Completeness	Does the response include all necessary information relevant to the query?
Conciseness	Is the response brief and free of unnecessary repetition or verbosity?
Groundedness	Is the output properly supported by the source context documents?
Output Relevance	Is the LLM's answer aligned with the user’s query without digression or off-topic content?

Safety Metrics

Metrics that detect unsafe, biased, or privacy-violating content to ensure your LLM application meets safety, legal, and brand standards.

Note: Some safety metrics operate on the user’s query (pre-generation), while others evaluate the LLM output (post-generation).

Pre-Response metrics: sql_prevention, code_injection_detection, jailbreak, prompt_injection
Post-Response metrics: toxicity, personal_harm, unsafe_stereotypes, cbrn, pii

View Safety Metrics

Topic	Description
Toxicity	Has the LLM generated any harmful, toxic, or offensive language?
Prompt Injection	Is the user prompt attempting to override system behavior or leak instructions?
CBRN	Is the output referencing or explaining harmful chemical, biological, radiological, or nuclear topics?
Personal Harm	Could the content enable, promote, or encourage self-harm or violence toward others?
Unsafe Stereotypes	Does the output perpetuate biased, stereotypical, or discriminatory views?
Code Injection Detection	Is the user query attempting to inject unsafe executable code?
SQL Prevention	Is the query vulnerable to SQL injection or suggesting unsafe database usage?
Jailbreak	Has the user prompt tried to bypass system-level constraints or filters?
PII	Is any personally identifiable information (e.g. names, emails, phone numbers) being leaked in the response?

RAG & Data

Metrics that assess retrieval relevance and data quality in RAG pipelines — critical for grounding LLMs in the right context.

View RAG and Data Metrics

Topic	Description
Query-Context Relevance	Are the retrieved context chunks actually relevant to the user's query?
Indexing Data Quality	Are the source documents coherent, accurate, and free from noise, duplicates, or truncation?

Custom Metrics

Metrics that reflect your specific business goals or domain-specific requirements — ideal for cases where standard benchmarks fall short.

View Custom Metric Examples

Topic	Description
Use-case Alignment	Does the output reflect the internal guidelines or regulatory constraints specific to your industry?
Domain Accuracy	Are LLMs generating factually correct content in a specialized domain (e.g. finance, healthcare, legal)?
Brand Voice Matching	Does the LLM mimic the tone and style expected by your brand or communication playbook?

Metric Categories​

Output Quality​

Safety Metrics​

RAG & Data​

Custom Metrics​

Metric Categories

Output Quality

Safety Metrics

RAG & Data

Custom Metrics