Skip to main content

Metrics Overview

Evaluating LLM and RAG applications often involves trade-offs between accuracy, reliability, and speed. While several frameworks like RAGAs, TrueLens, and DeepEval exist, they can be overwhelming due to inconsistent metrics, complexity in setup, and reliance on subjective LLM-based evaluation.

To simplify and standardize this process, we’ve organized the AIMon Labs metrics into four actionable categories that reflect real-world deployment priorities: Output Quality, Output Safety, RAG & Data, and Custom Metrics.

Metric Categories

Output Quality

Metrics that assess how well the LLM follows instructions, stays grounded in context, and produces complete, concise, and relevant responses.

View Output Quality Metrics

TopicDescription
HallucinationDoes the LLM output introduce fabricated information that wasn’t in the context?
Instruction AdherenceDoes the output follow the prompt’s instructions accurately and thoroughly?
CompletenessDoes the response include all necessary information relevant to the query?
ConcisenessIs the response brief and free of unnecessary repetition or verbosity?
GroundednessIs the output properly supported by the source context documents?
Output RelevanceIs the LLM's answer aligned with the user’s query without digression or off-topic content?

Safety Metrics

Metrics that detect unsafe, biased, or privacy-violating content to ensure your LLM application meets safety, legal, and brand standards.

Note: Some safety metrics operate on the user’s query (pre-generation), while others evaluate the LLM output (post-generation).

  • Pre-Response metrics: sql_prevention, code_injection_detection, jailbreak, prompt_injection

  • Post-Response metrics: toxicity, personal_harm, unsafe_stereotypes, cbrn, pii

View Safety Metrics

TopicDescription
ToxicityHas the LLM generated any harmful, toxic, or offensive language?
Prompt InjectionIs the user prompt attempting to override system behavior or leak instructions?
CBRNIs the output referencing or explaining harmful chemical, biological, radiological, or nuclear topics?
Personal HarmCould the content enable, promote, or encourage self-harm or violence toward others?
Unsafe StereotypesDoes the output perpetuate biased, stereotypical, or discriminatory views?
Code Injection DetectionIs the user query attempting to inject unsafe executable code?
SQL PreventionIs the query vulnerable to SQL injection or suggesting unsafe database usage?
JailbreakHas the user prompt tried to bypass system-level constraints or filters?
PIIIs any personally identifiable information (e.g. names, emails, phone numbers) being leaked in the response?

RAG & Data

Metrics that assess retrieval relevance and data quality in RAG pipelines — critical for grounding LLMs in the right context.

View RAG and Data Metrics

TopicDescription
Query-Context RelevanceAre the retrieved context chunks actually relevant to the user's query?
Indexing Data QualityAre the source documents coherent, accurate, and free from noise, duplicates, or truncation?

Custom Metrics

Metrics that reflect your specific business goals or domain-specific requirements — ideal for cases where standard benchmarks fall short.

View Custom Metric Examples

TopicDescription
Use-case AlignmentDoes the output reflect the internal guidelines or regulatory constraints specific to your industry?
Domain AccuracyAre LLMs generating factually correct content in a specialized domain (e.g. finance, healthcare, legal)?
Brand Voice MatchingDoes the LLM mimic the tone and style expected by your brand or communication playbook?