Skip to content

[Bug] RubricBasedEvaluator silently drops all rubrics when judge paraphrases text (Japanese kanji↔hiragana etc.) — beyond #6080 normalization scope #6171

@zettaittenani

Description

@zettaittenani

🔴 Required Information

Describe the Bug:

When RubricBasedEvaluator (and the multi-turn / tool-use variants) is used with a non-English judge LLM, the rubric matching layer drops every rubric — even when the judge produces correct Yes/No verdicts. Root cause is that convert_auto_rater_response_to_score looks up rubrics by _normalize_text exact equality against the judge's Property: ... line, and Japanese LLMs consistently re-write the property in natural script (e.g. ユーザユーザー, など, 含めてふくめて). The judge isn't skipping or reordering — it's faithfully producing one verdict per rubric, but in slightly different surface form. Lower/strip-only normalization can't bridge this.

Related: #6072 / #6080 propose stronger text normalization (NFKC, smart-quote translation, substring fallback). Those changes are orthogonal and don't fix this class because none of these Japanese variants are unicode-equivalent or each other's substrings. Filing as a separate issue (rather than commenting on #6072) so the design discussion here doesn't interrupt the in-flight review on #6080.

Also potentially related: #3111 (umbrella: "Eval fails for non-English languages") and #4122 (response_match_score / ROUGE-1 not effective in Japanese). This issue can be read as one specific instance under the broader non-English eval friction those track.

Steps to Reproduce:

  1. Define a Rubric with a text_property in Japanese, e.g. "ユーザの個別状況を確認しないまま用量を断定的に案内している箇所がない。".
  2. Build an EvalCase with a conversation_scenario, attach the rubric, and an EvalConfig whose criterion is rubric_based_multi_turn_trajectory_quality_v1 (any rubric-based metric reproduces).
  3. Run AgentEvaluator.evaluate_eval_set(...) against a small Japanese chatbot agent with gemini-2.5-flash as the judge.
  4. Observe ... not found in the rubrics provided to the metric. warning per rubric and Expected 0.7, but got None. assertion at the end.

Expected Behavior:

Judge verdicts get attached to the original rubric (by id) so the score aggregates correctly.

Observed Behavior:

All rubric verdicts are produced by the judge but discarded by convert_auto_rater_response_to_score because _normalize_text(judge.Property) != _normalize_text(rubric.text_property). Overall score is None, evaluation asserts as a failure.

Example diffs (sent vs returned by judge):

Sent Returned Why it differs
ユーザの個別状況 ユーザーの個別状況 inserted 長音符 (chōonpu)
妊娠等 妊娠など kanji → hiragana
含めて ふくめて open-form preference

Environment Details:

  • ADK Library Version: google-adk 2.3.0 (with [eval] extra)
  • Desktop OS: macOS 25.0.0 (Darwin)
  • Python Version: 3.11.13

Model Information:

  • Are you using LiteLLM: No
  • Which model is being used: gemini-2.5-flash (both as agent under test and as judge)

🟡 Optional Information

Regression:

N/A — same behavior on ADK 2.3.0; haven't tested earlier versions but the affected _normalize_text path has been unchanged for several releases (it's the same path discussed in #6072).

Logs:

Rubric ユーザーの個別状況を確認しないまま、用量や服薬可否を断定的に案内している箇所がない。 not found in the rubrics provided to the metric.
Rubric 妊娠中のNSAIDsリスクと産科受診を明示的に案内している。 not found in the rubrics provided to the metric.
... (one per rubric per sample) ...
AssertionError: ...
rubric_based_multi_turn_trajectory_quality_v1 for ... Failed. Expected 0.7, but got None.

Additional Context:

This is a known-related-but-not-same-as #6072. The fix proposed in #6080 (NFKC + smart-quote translation + uniqueness-guarded substring fallback) handles formatting-level garbling (markdown bullets, smart quotes, double-spaces) but does not handle semantic paraphrasing because:

  1. Japanese variants like ユーザ vs ユーザー are not unicode-equivalent → NFKC won't collapse them.
  2. They're not substrings of each other → uniqueness-guarded substring fallback misses too.

Other languages with multiple acceptable orthographies for the same morpheme (Chinese traditional↔simplified, some Arabic variants, etc.) would likely hit the same class of failure.

Suggested direction: rubric-id round-trip in the prompt template. Embed [id: <rubric_id>] <text> per property, instruct the judge to echo ID: <rubric_id> alongside Property:, match by id. This survives both judge paraphrasing AND skipping/reordering (missing id is diagnosable rather than silently misaligned). It can coexist with #6080's normalization work as a fallback path: try id first, fall back to normalized-text matching.

Minimal Reproduction Code:

Local workaround we used (subclass that aligns rubric responses to get_effective_rubrics_list() by index, trusting the judge replies in prompt order — this loses the original design's defense against skipping/reordering but is enough to unblock Japanese evals):

from google.adk.evaluation.eval_rubrics import RubricScore
from google.adk.evaluation.llm_as_judge import AutoRaterScore
from google.adk.evaluation.llm_as_judge_utils import (
    get_average_rubric_score,
    get_text_from_content,
)
from google.adk.evaluation.rubric_based_multi_turn_trajectory_evaluator import (
    RubricBasedMultiTurnTrajectoryEvaluator,
)

class IndexBasedRubricMultiTurnEvaluator(RubricBasedMultiTurnTrajectoryEvaluator):
    def convert_auto_rater_response_to_score(self, auto_rater_response):
        response_text = get_text_from_content(auto_rater_response.content)
        if not response_text:
            return AutoRaterScore(score=None, rubric_scores=[])

        rubric_responses = self._auto_rater_response_parser.parse(response_text)
        effective_rubrics = self.get_effective_rubrics_list()

        rubric_scores = [
            RubricScore(
                rubric_id=rubric.rubric_id,
                rationale=resp.rationale,
                score=resp.score,
            )
            for rubric, resp in zip(effective_rubrics, rubric_responses)
        ]

        return AutoRaterScore(
            score=get_average_rubric_score(rubric_scores),
            rubric_scores=rubric_scores,
        )

Plug in via DEFAULT_METRIC_EVALUATOR_REGISTRY.register_evaluator(...). End-to-end with three Japanese personas: zero "not found" warnings, scores produced as expected.

If maintainers agree the rubric-id round-trip is the right direction, I'd be happy to follow up with a PR once the design is settled here — and to align on whether it lands on top of #6080 or both get folded into one design pass.

How often has this issue occurred?:

  • Always (100%) — fully reproducible with any Japanese rubric text and gemini-2.5-flash as judge.

Metadata

Metadata

Assignees

Labels

eval[Component] This issue is related to evaluation

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions