🔴 Required Information
Describe the Bug:
When RubricBasedEvaluator (and the multi-turn / tool-use variants) is used with a non-English judge LLM, the rubric matching layer drops every rubric — even when the judge produces correct Yes/No verdicts. Root cause is that convert_auto_rater_response_to_score looks up rubrics by _normalize_text exact equality against the judge's Property: ... line, and Japanese LLMs consistently re-write the property in natural script (e.g. ユーザ → ユーザー, 等 → など, 含めて → ふくめて). The judge isn't skipping or reordering — it's faithfully producing one verdict per rubric, but in slightly different surface form. Lower/strip-only normalization can't bridge this.
Related: #6072 / #6080 propose stronger text normalization (NFKC, smart-quote translation, substring fallback). Those changes are orthogonal and don't fix this class because none of these Japanese variants are unicode-equivalent or each other's substrings. Filing as a separate issue (rather than commenting on #6072) so the design discussion here doesn't interrupt the in-flight review on #6080.
Also potentially related: #3111 (umbrella: "Eval fails for non-English languages") and #4122 (response_match_score / ROUGE-1 not effective in Japanese). This issue can be read as one specific instance under the broader non-English eval friction those track.
Steps to Reproduce:
- Define a
Rubric with a text_property in Japanese, e.g. "ユーザの個別状況を確認しないまま用量を断定的に案内している箇所がない。".
- Build an
EvalCase with a conversation_scenario, attach the rubric, and an EvalConfig whose criterion is rubric_based_multi_turn_trajectory_quality_v1 (any rubric-based metric reproduces).
- Run
AgentEvaluator.evaluate_eval_set(...) against a small Japanese chatbot agent with gemini-2.5-flash as the judge.
- Observe
... not found in the rubrics provided to the metric. warning per rubric and Expected 0.7, but got None. assertion at the end.
Expected Behavior:
Judge verdicts get attached to the original rubric (by id) so the score aggregates correctly.
Observed Behavior:
All rubric verdicts are produced by the judge but discarded by convert_auto_rater_response_to_score because _normalize_text(judge.Property) != _normalize_text(rubric.text_property). Overall score is None, evaluation asserts as a failure.
Example diffs (sent vs returned by judge):
| Sent |
Returned |
Why it differs |
ユーザの個別状況 |
ユーザーの個別状況 |
inserted 長音符 (chōonpu) |
妊娠等 |
妊娠など |
kanji → hiragana |
含めて |
ふくめて |
open-form preference |
Environment Details:
- ADK Library Version:
google-adk 2.3.0 (with [eval] extra)
- Desktop OS: macOS 25.0.0 (Darwin)
- Python Version: 3.11.13
Model Information:
- Are you using LiteLLM: No
- Which model is being used:
gemini-2.5-flash (both as agent under test and as judge)
🟡 Optional Information
Regression:
N/A — same behavior on ADK 2.3.0; haven't tested earlier versions but the affected _normalize_text path has been unchanged for several releases (it's the same path discussed in #6072).
Logs:
Rubric ユーザーの個別状況を確認しないまま、用量や服薬可否を断定的に案内している箇所がない。 not found in the rubrics provided to the metric.
Rubric 妊娠中のNSAIDsリスクと産科受診を明示的に案内している。 not found in the rubrics provided to the metric.
... (one per rubric per sample) ...
AssertionError: ...
rubric_based_multi_turn_trajectory_quality_v1 for ... Failed. Expected 0.7, but got None.
Additional Context:
This is a known-related-but-not-same-as #6072. The fix proposed in #6080 (NFKC + smart-quote translation + uniqueness-guarded substring fallback) handles formatting-level garbling (markdown bullets, smart quotes, double-spaces) but does not handle semantic paraphrasing because:
- Japanese variants like
ユーザ vs ユーザー are not unicode-equivalent → NFKC won't collapse them.
- They're not substrings of each other → uniqueness-guarded substring fallback misses too.
Other languages with multiple acceptable orthographies for the same morpheme (Chinese traditional↔simplified, some Arabic variants, etc.) would likely hit the same class of failure.
Suggested direction: rubric-id round-trip in the prompt template. Embed [id: <rubric_id>] <text> per property, instruct the judge to echo ID: <rubric_id> alongside Property:, match by id. This survives both judge paraphrasing AND skipping/reordering (missing id is diagnosable rather than silently misaligned). It can coexist with #6080's normalization work as a fallback path: try id first, fall back to normalized-text matching.
Minimal Reproduction Code:
Local workaround we used (subclass that aligns rubric responses to get_effective_rubrics_list() by index, trusting the judge replies in prompt order — this loses the original design's defense against skipping/reordering but is enough to unblock Japanese evals):
from google.adk.evaluation.eval_rubrics import RubricScore
from google.adk.evaluation.llm_as_judge import AutoRaterScore
from google.adk.evaluation.llm_as_judge_utils import (
get_average_rubric_score,
get_text_from_content,
)
from google.adk.evaluation.rubric_based_multi_turn_trajectory_evaluator import (
RubricBasedMultiTurnTrajectoryEvaluator,
)
class IndexBasedRubricMultiTurnEvaluator(RubricBasedMultiTurnTrajectoryEvaluator):
def convert_auto_rater_response_to_score(self, auto_rater_response):
response_text = get_text_from_content(auto_rater_response.content)
if not response_text:
return AutoRaterScore(score=None, rubric_scores=[])
rubric_responses = self._auto_rater_response_parser.parse(response_text)
effective_rubrics = self.get_effective_rubrics_list()
rubric_scores = [
RubricScore(
rubric_id=rubric.rubric_id,
rationale=resp.rationale,
score=resp.score,
)
for rubric, resp in zip(effective_rubrics, rubric_responses)
]
return AutoRaterScore(
score=get_average_rubric_score(rubric_scores),
rubric_scores=rubric_scores,
)
Plug in via DEFAULT_METRIC_EVALUATOR_REGISTRY.register_evaluator(...). End-to-end with three Japanese personas: zero "not found" warnings, scores produced as expected.
If maintainers agree the rubric-id round-trip is the right direction, I'd be happy to follow up with a PR once the design is settled here — and to align on whether it lands on top of #6080 or both get folded into one design pass.
How often has this issue occurred?:
- Always (100%) — fully reproducible with any Japanese rubric text and gemini-2.5-flash as judge.
🔴 Required Information
Describe the Bug:
When
RubricBasedEvaluator(and the multi-turn / tool-use variants) is used with a non-English judge LLM, the rubric matching layer drops every rubric — even when the judge produces correct Yes/No verdicts. Root cause is thatconvert_auto_rater_response_to_scorelooks up rubrics by_normalize_textexact equality against the judge'sProperty: ...line, and Japanese LLMs consistently re-write the property in natural script (e.g.ユーザ→ユーザー,等→など,含めて→ふくめて). The judge isn't skipping or reordering — it's faithfully producing one verdict per rubric, but in slightly different surface form. Lower/strip-only normalization can't bridge this.Related: #6072 / #6080 propose stronger text normalization (NFKC, smart-quote translation, substring fallback). Those changes are orthogonal and don't fix this class because none of these Japanese variants are unicode-equivalent or each other's substrings. Filing as a separate issue (rather than commenting on #6072) so the design discussion here doesn't interrupt the in-flight review on #6080.
Also potentially related: #3111 (umbrella: "Eval fails for non-English languages") and #4122 (
response_match_score/ ROUGE-1 not effective in Japanese). This issue can be read as one specific instance under the broader non-English eval friction those track.Steps to Reproduce:
Rubricwith atext_propertyin Japanese, e.g."ユーザの個別状況を確認しないまま用量を断定的に案内している箇所がない。".EvalCasewith aconversation_scenario, attach the rubric, and anEvalConfigwhose criterion isrubric_based_multi_turn_trajectory_quality_v1(any rubric-based metric reproduces).AgentEvaluator.evaluate_eval_set(...)against a small Japanese chatbot agent withgemini-2.5-flashas the judge.... not found in the rubrics provided to the metric.warning per rubric andExpected 0.7, but got None.assertion at the end.Expected Behavior:
Judge verdicts get attached to the original rubric (by id) so the score aggregates correctly.
Observed Behavior:
All rubric verdicts are produced by the judge but discarded by
convert_auto_rater_response_to_scorebecause_normalize_text(judge.Property) != _normalize_text(rubric.text_property). Overall score isNone, evaluation asserts as a failure.Example diffs (sent vs returned by judge):
ユーザの個別状況ユーザーの個別状況妊娠等妊娠など含めてふくめてEnvironment Details:
google-adk 2.3.0(with[eval]extra)Model Information:
gemini-2.5-flash(both as agent under test and as judge)🟡 Optional Information
Regression:
N/A — same behavior on ADK 2.3.0; haven't tested earlier versions but the affected
_normalize_textpath has been unchanged for several releases (it's the same path discussed in #6072).Logs:
Additional Context:
This is a known-related-but-not-same-as #6072. The fix proposed in #6080 (NFKC + smart-quote translation + uniqueness-guarded substring fallback) handles formatting-level garbling (markdown bullets, smart quotes, double-spaces) but does not handle semantic paraphrasing because:
ユーザvsユーザーare not unicode-equivalent → NFKC won't collapse them.Other languages with multiple acceptable orthographies for the same morpheme (Chinese traditional↔simplified, some Arabic variants, etc.) would likely hit the same class of failure.
Suggested direction: rubric-id round-trip in the prompt template. Embed
[id: <rubric_id>] <text>per property, instruct the judge to echoID: <rubric_id>alongsideProperty:, match by id. This survives both judge paraphrasing AND skipping/reordering (missing id is diagnosable rather than silently misaligned). It can coexist with #6080's normalization work as a fallback path: try id first, fall back to normalized-text matching.Minimal Reproduction Code:
Local workaround we used (subclass that aligns rubric responses to
get_effective_rubrics_list()by index, trusting the judge replies in prompt order — this loses the original design's defense against skipping/reordering but is enough to unblock Japanese evals):Plug in via
DEFAULT_METRIC_EVALUATOR_REGISTRY.register_evaluator(...). End-to-end with three Japanese personas: zero "not found" warnings, scores produced as expected.If maintainers agree the rubric-id round-trip is the right direction, I'd be happy to follow up with a PR once the design is settled here — and to align on whether it lands on top of #6080 or both get folded into one design pass.
How often has this issue occurred?: