Skip to main content

SCORE Metrics

The Challenge of evaluations: the status quo

Organizations are adopting various LLM and RAG evaluation frameworks, including RAGAs, TrueLens, and DeepEval. While these provide comprehensive metrics, developers often struggle to choose appropriate metrics and spend too much time stabilizing evaluators rather than developing applications. The frameworks also differ in their approaches to metric definition and computation.

Limitations of Evaluation Frameworks

While these evaluation frameworks offer significant benefits for evaluating and enhancing LLM applications, there are some potential drawbacks or limitations to consider:

  • Dependency on LLMs:
    • Added latency makes LLM Judges a bad fit for real-time monitoring and guardrails
    • Off-the-shelf LLMs aren't trained graders
    • Subjectivity in Metrics
    • Evaluation LLM might hallucinate
    • Computational Costs
  • Inconsistency among different evaluation frameworks
  • Complexity of Setup and Integration Overhead

SCORE Principles

We propose a set of principles followed by a set of metrics that you can use to start and transform your LLM evaluation journey.

We call the principles to use for RAG and LLM evaluation, SCORE - which stands for Simple, Consistent, Objective, Reliable, and Efficient. Let us dive into what each component means.

  • Simple: A straightforward approach to evaluating outputs
  • Consistent: Provides uniform results across evaluations
  • Objective: Based on measurable criteria rather than subjective opinions
  • Reliable: Produces dependable and reproducible results
  • Efficient: Requires minimal resources and time to implement

SCORE Metrics

LLM Output Quality Metrics

TopicDescription
Inaccuracy or HallucinationsIs the output factually correct? Are there any fabricated facts?
Output RelevanceIs the output in line with the query or is the LLM serving irrelevant or off-topic information.
Instruction AdherenceAre the instructions in the prompt followed in the output?
CompletenessEspecially for summarization use cases, this answers - "how well is the LLM capturing all the different facts in the context documents?"
ConcisenessIs the LLM being terse in its generation?
Custom MetricsAre my business objectives being met?

LLM Output Safety Metrics

TopicDescription
PII, PCI, or PHIIs the output leaking any unintended privacy, financial, or health information?
ToxicityHas the LLM generated any toxic or unsafe language?
Off-brand- Is the LLM talking about other brands or companies that are your competitors?
- Is the LLM using language that is not capturing the brand's tone?
BiasDoes the LLM exhibit any gender bias, cultural bias, or political bias?

RAG Metrics

TopicDescription
Query-Context RelevanceAre the results from the RAG system relevant the query the system is serving?
Data QualityIs the data retrieved from the data sources coherent or does it have any conflicting information, noisy data, incomplete sentence, etc.?

Data Metrics

TopicDescription
Indexing Data QualityIs the data retrieved from the data sources coherent or does it have any conflicting information, noisy data, incomplete sentence, etc.?