Skip to main content

Metrics Overview

Evaluating LLM and RAG applications often involves trade-offs between accuracy, reliability, and speed. While several frameworks like RAGAs, TrueLens, and DeepEval exist, they can be overwhelming due to inconsistent metrics, complexity in setup, and reliance on subjective LLM-based evaluation.

To simplify and standardize this process, we propose the SCORE framework — a set of principles to guide metric design and evaluation.

The SCORE Principles

SCORE stands for:

  • Simple – Easy to understand and implement
  • Consistent – Delivers reproducible results
  • Objective – Based on measurable, verifiable signals
  • Reliable – Works across use cases and datasets
  • Efficient – Low setup and runtime cost

These principles help you evaluate not just LLM outputs, but also RAG pipelines and data quality.

We’ve organized the metrics into three categories:

Output Quality

Metrics that assess how well the LLM follows instructions, stays accurate, and provides complete, concise responses.

View Output Quality Metrics

TopicDescription
Inaccuracy or HallucinationsIs the output factually correct? Are there any fabricated facts?
Output RelevanceIs the output in line with the query or is the LLM serving irrelevant or off-topic information.
Instruction AdherenceAre the instructions in the prompt followed in the output?
CompletenessEspecially for summarization use cases, this answers - "how well is the LLM capturing all the different facts in the context documents?"
ConcisenessIs the LLM being terse in its generation?
Custom MetricsAre my business objectives being met?

Output Safety

Metrics that detect harmful, toxic, or privacy-violating content, ensuring outputs align with safety and brand guidelines.

View Output Safety Metrics

TopicDescription
PII, PCI, or PHIIs the output leaking any unintended privacy, financial, or health information?
ToxicityHas the LLM generated any toxic or unsafe language?
Off-brand- Is the LLM talking about other brands or companies that are your competitors?
- Is the LLM using language that is not capturing the brand's tone?
BiasDoes the LLM exhibit any gender bias, cultural bias, or political bias?

RAG & Data

Metrics focused on the relevance and integrity of context used in retrieval-augmented generation pipelines.

View RAG and Data Metrics

TopicDescription
Query-Context RelevanceAre the results from the RAG system relevant the query the system is serving?
Data QualityIs the data retrieved from the data sources coherent or does it have any conflicting information, noisy data, incomplete sentence, etc.?
Indexing Data QualityIs the data retrieved from the data sources coherent or does it have any conflicting information, noisy data, incomplete sentence, etc.?