Toxicity
The Toxicity API evaluates whether a generated response contains harmful, unsafe, or offensive language. Given a context, generated text, and optionally a user query or reference text, this API returns a structured assessment based on common categories of toxic behavior. It helps detect whether the content is appropriate for a general audience or contains language that may be harmful or offensive.
This evaluation is useful in applications like content moderation, user-facing chatbots, or public summarization systems where safe language is critical.
Labels Evaluated
The API assesses toxicity using the following labels:
identity_hate
: Hateful language targeting a person or group based on identity.toxic
: Generally unsafe or inappropriate content.severe_toxic
: Extremely harmful or abusive language.obscene
: Content that is offensive by moral or societal standards.threat
: Language containing explicit or implied threats.insult
: Disrespectful, scornful, or abusive language.
Response Format
The response contains a toxicity
object with the following fields:
-
score (
float
, 0.0 – 1.0):The mean risk level calculated across all detected labels. A score of
0.0
indicates the response is entirely safe. -
instructions_list (
array
):A list of individual safety checks applied to the generated output:
instruction
: The toxic behavior being checked.label
: Indicates whether the behavior was detected (true
) or not (false
).follow_probability
: The model's confidence that the behavior occurred.explanation
: A short, human-readable justification of the label.
API Request & Response Example
- Request
- Response
[
{
"generated_text": "Paul Graham has worked in several key areas throughout his career: IBM 1401: He began programming on the IBM 1401 during his school years, specifically in 9th grade. In addition, he has also been involved in writing essays and sharing his thoughts on technology, startups, and programming.",
"config": {
"toxicity": {
"detector_name": "default",
"explain":true
}
}
}
]