Skip to main content

Tool Call Check

The Tool Call Check metric evaluates whether a model’s generated output accurately reflects the usage and results of tools such as APIs or function calls. This metric is designed for tool-using LLM agents and ensures that the final response honors what was actually executed in the tool trace.

This evaluation is critical in tool-augmented systems where the model must not fabricate tool-based information or ignore tool call results. It helps verify that tool usage aligns with the user query and that results are reflected faithfully in the output.

When to Use

Use this metric when:

  • You want to check whether the tool selected matches the user's intent.
  • Your system relies on tools (functions, APIs, plugins) to respond to user queries.
  • You need to ensure the model includes accurate tool results in its final response.
  • You want to prevent hallucinated tool usage or misreporting of tool execution status.

Labels Evaluated

The API uses a set of instruction checks to evaluate the alignment of tool usage and tool output. These include:

  • Tool relevance: Ensures the tool used is appropriate for the user query.
  • Result fidelity: Checks that key results from successful tool calls are reflected in the output.
  • Status consistency: Ensures that failed or pending tools are not treated as successful in the response.
  • No hallucination: Verifies that the model does not refer to tools that were never called.

Required Fields

To evaluate this metric, the following fields must be present:

  • user_query: The original user request that triggered the tool use.
  • generated_text: The model’s final response to the user.
  • tool_trace: An array representing the trace of tool calls made during the interaction. Each entry should include tool name, execution status, timestamps, and optionally payloads and retries.

Response Format

The response contains a tool_call_check object with the following structure:

  • score (float, 0.0 to 1.0):

    A final score that reflects tool alignment. It is equal to the lowest follow_probability across all evaluated instructions.

  • instructions_list (array):

    A list of individual checks applied to the generated output. Each entry contains:

    • instruction: The behavioral rule being evaluated.
    • label: Whether the output followed this rule (true) or violated it (false).
    • follow_probability: Confidence that the instruction was followed.
    • explanation: Human-readable explanation for why the rule was or was not followed.

Schema for tool_trace

{
"tool_trace": [
{
"order": 1,
"tool_name": "tool.name",
"tool_version": "v1.0.0",
"start_time": "2025-01-09T17:23:46Z",
"end_time": "2025-01-09T17:23:47Z",
"status": "success",
"error": null,
"request_payload": {},
"response_payload": {},
"retries": 0
}
]
}

API Request & Response Example

[
{
"user_query": "What is the current weather in New York?",
"generated_text": "I'll check the current weather in New York for you. Let me use a weather tool to get the latest information.\n\nThe current weather in New York is 72°F with partly cloudy skies. The humidity is 65% and there's a light breeze from the west at 8 mph.",
"tool_trace": [
{
"order": 1,
"tool_name": "weather_api",
"status": "success",
"start_time": "2024-01-15T10:30:00Z",
"end_time": "2024-01-15T10:30:02Z",
"retries": 0,
"request_payload": {
"location": "New York, NY",
"units": "imperial"
},
"response_payload": {
"temperature": 72,
"condition": "partly cloudy",
"humidity": 65,
"wind_speed": 8,
"wind_direction": "west"
}
}
],
"config": {
"tool_call_check": {
"detector_name": "default",
"explain": true
}
}
}
]

Code Examples

# Synchronous example

import os
from aimon import Client
import json

# Initialize client
client = Client(auth_header=f"Bearer {os.environ['AIMON_API_KEY']}")

# Construct payload
payload = [
{
"user_query": "What is the current weather in New York?",
"generated_text": "I'll check the current weather in New York for you. Let me use a weather tool to get the latest information.\n\nThe current weather in New York is 72°F with partly cloudy skies. The humidity is 65% and there's a light breeze from the west at 8 mph.",
"tool_trace": [
{
"order": 1,
"tool_name": "weather_api",
"status": "success",
"start_time": "2024-01-15T10:30:00Z",
"end_time": "2024-01-15T10:30:02Z",
"retries": 0,
"request_payload": {
"location": "New York, NY",
"units": "imperial"
},
"response_payload": {
"temperature": 72,
"condition": "partly cloudy",
"humidity": 65,
"wind_speed": 8,
"wind_direction": "west"
}
}
],
"config": {
"tool_call_check": {
"detector_name": "default",
"explain": True
}
}
}
]

# Call sync detect
response = client.inference.detect(body=payload)

# Print result
print(json.dumps(response[0].tool_call_check, indent=2))