Metrics Overview
Evaluating LLM and RAG applications often involves trade-offs between accuracy, reliability, and speed. While several frameworks like RAGAs, TrueLens, and DeepEval exist, they can be overwhelming due to inconsistent metrics, complexity in setup, and reliance on subjective LLM-based evaluation.
To simplify and standardize this process, we propose the SCORE framework — a set of principles to guide metric design and evaluation.
The SCORE Principles
SCORE stands for:
- Simple – Easy to understand and implement
- Consistent – Delivers reproducible results
- Objective – Based on measurable, verifiable signals
- Reliable – Works across use cases and datasets
- Efficient – Low setup and runtime cost
These principles help you evaluate not just LLM outputs, but also RAG pipelines and data quality.
We’ve organized the metrics into three categories:
Output Quality
Metrics that assess how well the LLM follows instructions, stays accurate, and provides complete, concise responses.
View Output Quality Metrics
Topic | Description |
---|---|
Inaccuracy or Hallucinations | Is the output factually correct? Are there any fabricated facts? |
Output Relevance | Is the output in line with the query or is the LLM serving irrelevant or off-topic information. |
Instruction Adherence | Are the instructions in the prompt followed in the output? |
Completeness | Especially for summarization use cases, this answers - "how well is the LLM capturing all the different facts in the context documents?" |
Conciseness | Is the LLM being terse in its generation? |
Custom Metrics | Are my business objectives being met? |
Output Safety
Metrics that detect harmful, toxic, or privacy-violating content, ensuring outputs align with safety and brand guidelines.
View Output Safety Metrics
Topic | Description |
---|---|
PII, PCI, or PHI | Is the output leaking any unintended privacy, financial, or health information? |
Toxicity | Has the LLM generated any toxic or unsafe language? |
Off-brand | - Is the LLM talking about other brands or companies that are your competitors? - Is the LLM using language that is not capturing the brand's tone? |
Bias | Does the LLM exhibit any gender bias, cultural bias, or political bias? |
RAG & Data
Metrics focused on the relevance and integrity of context used in retrieval-augmented generation pipelines.
View RAG and Data Metrics
Topic | Description |
---|---|
Query-Context Relevance | Are the results from the RAG system relevant the query the system is serving? |
Data Quality | Is the data retrieved from the data sources coherent or does it have any conflicting information, noisy data, incomplete sentence, etc.? |
Indexing Data Quality | Is the data retrieved from the data sources coherent or does it have any conflicting information, noisy data, incomplete sentence, etc.? |