Textual similarity evaluators

It's important to compare how closely the textual response generated by your AI system matches the response you would expect. The expected response is called the ground truth.

Use a LLM-judge metric like Similarity with a focus on the semantic similarity between the generated response and the ground truth. Or, use metrics from the field of natural language processing (NLP), including F1 score, BLEU, GLEU, ROUGE, and METEOR with a focus on the overlaps of tokens or n-grams between the two.

Similarity

Similarity measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response, instead of simple overlap in tokens or n-grams. It also considers the broader context of a query.

F1 score

F1 score measures the similarity by shared tokens between the generated text and the ground truth. It focuses on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. The ratio is computed over the individual words in the generated response against those words in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score.

Precision is the ratio of the number of shared words to the total number of words in the generation.
Recall is the ratio of the number of shared words to the total number of words in the ground truth.

BLEU score

Bleu score computes the Bilingual Evaluation Understudy (BLEU) score commonly used in natural language processing and machine translation. It measures how closely the generated text matches the reference text. It's well-suited for machine translation quality assessment, where exact n-gram matches between system output and human reference translations are a meaningful proxy for quality.

GLEU score

Gleu score computes the Google-BLEU (GLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth. Similar to the BLEU score, it focuses on both precision and recall. It addresses the drawbacks of the BLEU score using a per-sentence reward objective, making it useful for sentence-level evaluation tasks where BLEU's corpus-level aggregation can mask per-sentence quality differences.

ROUGE score

Rouge score computes the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries, with a focus on recall. Unlike other evaluators that return a single score, builtin.rouge_score returns three separate 0-1 float scores: precision, recall, and F1.

METEOR score

Meteor score measures the similarity by shared n-grams between the generated text and the ground truth. Similar to the BLEU score, it focuses on precision and recall. It addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.

Configure and run evaluators

Use builtin.similarity when semantic meaning matters — for example, when paraphrases or synonyms should score as equivalent. Use the algorithmic evaluators (F1, BLEU, GLEU, ROUGE, METEOR) when exact token overlap is the criterion, such as machine translation quality assessment, fixed-answer retrieval, or NLP benchmarking. All six evaluators are generally available (GA).

Textual similarity evaluators compare generated responses against ground truth text using different algorithms:

Similarity - LLM-based semantic similarity evaluation
F1, BLEU, GLEU, ROUGE, METEOR - Algorithmic token/n-gram overlap metrics

Examples:

Textual similarity evaluator sample

Evaluator	What it measures	Required inputs	Required parameters	Output	Default threshold
`builtin.similarity`	Semantic similarity to ground truth	`query`, `response`, `ground_truth`	`deployment_name`	1-5 integer	3
`builtin.f1_score`	Token overlap using precision and recall	`ground_truth`, `response`	(none)	0-1 float	0.5
`builtin.bleu_score`	N-gram overlap (machine translation metric)	`ground_truth`, `response`	(none)	0-1 float	0.5
`builtin.gleu_score`	Per-sentence reward variant of BLEU	`ground_truth`, `response`	(none)	0-1 float	0.5
`builtin.rouge_score`	Recall-oriented n-gram overlap	`ground_truth`, `response`	`rouge_type`	3 floats: precision, recall, F1	0.5 per score
`builtin.meteor_score`	Weighted alignment with synonyms	`ground_truth`, `response`	(none)	0-1 float	0.5

Example input

Your test dataset should contain the fields referenced in your data mappings:

{"query": "What is the largest city in France?", "response": "Paris is the largest city in France.", "ground_truth": "The largest city in France is Paris."}
{"query": "Explain machine learning.", "response": "Machine learning is a subset of AI that enables systems to learn from data.", "ground_truth": "Machine learning is an AI technique where computers learn patterns from data."}

Configuration example

Data mapping syntax:

{{item.field_name}} references fields from your test dataset (for example, {{item.response}}).

testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "Similarity",
        "evaluator_name": "builtin.similarity",
        "initialization_parameters": {"deployment_name": model_deployment},
        "data_mapping": {
            "query": "{{item.query}}",
            "response": "{{item.response}}",
            "ground_truth": "{{item.ground_truth}}",
        },
    },
    {
        "type": "azure_ai_evaluator",
        "name": "BLEUScore",
        "evaluator_name": "builtin.bleu_score",
        "data_mapping": {
            "ground_truth": "{{item.ground_truth}}",
            "response": "{{item.response}}",
        },
    },
    {
        "type": "azure_ai_evaluator",
        "name": "ROUGEScore",
        "evaluator_name": "builtin.rouge_score",
        "initialization_parameters": {"rouge_type": "rouge1"},
        "data_mapping": {
            "ground_truth": "{{item.ground_truth}}",
            "response": "{{item.response}}",
        },
    },
]

See Run evaluations from the SDK for details on running evaluations and configuring data sources.

Note

The rouge_type parameter accepts: rouge1 (unigram overlap), rouge2 (bigram overlap), rouge3, rouge4, rouge5, and rougeL (longest common subsequence). Use rouge1 for general-purpose summarization evaluation.

Example output

LLM-based evaluators like builtin.similarity use a 1-5 Likert scale. Most algorithmic evaluators output a single 0-1 float score. builtin.rouge_score is an exception — it returns three separate scores. All evaluators output pass or fail based on their thresholds. Key output fields:

{
    "type": "azure_ai_evaluator",
    "name": "Similarity",
    "metric": "similarity",
    "score": 4,
    "label": "pass",
    "reason": "The response accurately conveys the same meaning as the ground truth.",
    "threshold": 3,
    "passed": true
}

Note

builtin.rouge_score returns three separate scores rather than one: rouge_precision, rouge_recall, and rouge_f1_score, each with its own pass/fail result. The threshold of 0.5 applies independently to each score.

Feedback

Was this page helpful?

Last updated on 2026-04-08