SCORE Metrics
The Challenge of evaluations: the status quo
Organizations are adopting various LLM and RAG evaluation frameworks, including RAGAs, TrueLens, and DeepEval. While these provide comprehensive metrics, developers often struggle to choose appropriate metrics and spend too much time stabilizing evaluators rather than developing applications. The frameworks also differ in their approaches to metric definition and computation.
Limitations of Evaluation Frameworks
While these evaluation frameworks offer significant benefits for evaluating and enhancing LLM applications, there are some potential drawbacks or limitations to consider:
- Dependency on LLMs:
-
- Added latency makes LLM Judges a bad fit for real-time monitoring and guardrails
-
- Off-the-shelf LLMs aren't trained graders
-
- Subjectivity in Metrics
-
- Evaluation LLM might hallucinate
-
- Computational Costs
- Inconsistency among different evaluation frameworks
- Complexity of Setup and Integration Overhead
SCORE Principles
We propose a set of principles followed by a set of metrics that you can use to start and transform your LLM evaluation journey.
We call the principles to use for RAG and LLM evaluation, SCORE - which stands for Simple, Consistent, Objective, Reliable, and Efficient. Let us dive into what each component means.
- Simple: A straightforward approach to evaluating outputs
- Consistent: Provides uniform results across evaluations
- Objective: Based on measurable criteria rather than subjective opinions
- Reliable: Produces dependable and reproducible results
- Efficient: Requires minimal resources and time to implement
SCORE Metrics
LLM Output Quality Metrics
Topic | Description |
---|---|
Inaccuracy or Hallucinations | Is the output factually correct? Are there any fabricated facts? |
Output Relevance | Is the output in line with the query or is the LLM serving irrelevant or off-topic information. |
Instruction Adherence | Are the instructions in the prompt followed in the output? |
Completeness | Especially for summarization use cases, this answers - "how well is the LLM capturing all the different facts in the context documents?" |
Conciseness | Is the LLM being terse in its generation? |
Custom Metrics | Are my business objectives being met? |
LLM Output Safety Metrics
Topic | Description |
---|---|
PII, PCI, or PHI | Is the output leaking any unintended privacy, financial, or health information? |
Toxicity | Has the LLM generated any toxic or unsafe language? |
Off-brand | - Is the LLM talking about other brands or companies that are your competitors? - Is the LLM using language that is not capturing the brand's tone? |
Bias | Does the LLM exhibit any gender bias, cultural bias, or political bias? |
RAG Metrics
Topic | Description |
---|---|
Query-Context Relevance | Are the results from the RAG system relevant the query the system is serving? |
Data Quality | Is the data retrieved from the data sources coherent or does it have any conflicting information, noisy data, incomplete sentence, etc.? |
Data Metrics
Topic | Description |
---|---|
Indexing Data Quality | Is the data retrieved from the data sources coherent or does it have any conflicting information, noisy data, incomplete sentence, etc.? |