SCORE Metrics

The Challenge of evaluations: the status quo

Organizations are adopting various LLM and RAG evaluation frameworks, including RAGAs, TrueLens, and DeepEval. While these provide comprehensive metrics, developers often struggle to choose appropriate metrics and spend too much time stabilizing evaluators rather than developing applications. The frameworks also differ in their approaches to metric definition and computation.

Limitations of Evaluation Frameworks

While these evaluation frameworks offer significant benefits for evaluating and enhancing LLM applications, there are some potential drawbacks or limitations to consider:

Dependency on LLMs:
- Added latency makes LLM Judges a bad fit for real-time monitoring and guardrails
- Off-the-shelf LLMs aren't trained graders
- Subjectivity in Metrics
- Evaluation LLM might hallucinate
- Computational Costs
Inconsistency among different evaluation frameworks
Complexity of Setup and Integration Overhead

SCORE Principles

We propose a set of principles followed by a set of metrics that you can use to start and transform your LLM evaluation journey.

We call the principles to use for RAG and LLM evaluation, SCORE - which stands for Simple, Consistent, Objective, Reliable, and Efficient. Let us dive into what each component means.

Simple: A straightforward approach to evaluating outputs
Consistent: Provides uniform results across evaluations
Objective: Based on measurable criteria rather than subjective opinions
Reliable: Produces dependable and reproducible results
Efficient: Requires minimal resources and time to implement

SCORE Metrics

LLM Output Quality Metrics

Topic	Description
Inaccuracy or Hallucinations	Is the output factually correct? Are there any fabricated facts?
Output Relevance	Is the output in line with the query or is the LLM serving irrelevant or off-topic information.
Instruction Adherence	Are the instructions in the prompt followed in the output?
Completeness	Especially for summarization use cases, this answers - "how well is the LLM capturing all the different facts in the context documents?"
Conciseness	Is the LLM being terse in its generation?
Custom Metrics	Are my business objectives being met?

LLM Output Safety Metrics

Topic	Description
PII, PCI, or PHI	Is the output leaking any unintended privacy, financial, or health information?
Toxicity	Has the LLM generated any toxic or unsafe language?
Off-brand	- Is the LLM talking about other brands or companies that are your competitors? - Is the LLM using language that is not capturing the brand's tone?
Bias	Does the LLM exhibit any gender bias, cultural bias, or political bias?

RAG Metrics

Topic	Description
Query-Context Relevance	Are the results from the RAG system relevant the query the system is serving?
Data Quality	Is the data retrieved from the data sources coherent or does it have any conflicting information, noisy data, incomplete sentence, etc.?

Data Metrics

Topic	Description
Indexing Data Quality	Is the data retrieved from the data sources coherent or does it have any conflicting information, noisy data, incomplete sentence, etc.?

The Challenge of evaluations: the status quo​

Limitations of Evaluation Frameworks​

SCORE Principles​

SCORE Metrics​

LLM Output Quality Metrics​

LLM Output Safety Metrics​

RAG Metrics​

Data Metrics​

The Challenge of evaluations: the status quo

Limitations of Evaluation Frameworks

SCORE Principles

SCORE Metrics

LLM Output Quality Metrics

LLM Output Safety Metrics

RAG Metrics

Data Metrics