[Bug] RubricBasedEvaluator silently drops all rubrics when judge paraphrases text (Japanese kanji↔hiragana etc.) — beyond #6080 normalization scope

## 🔴 Required Information

**Describe the Bug:**

When `RubricBasedEvaluator` (and the multi-turn / tool-use variants) is used with a non-English judge LLM, the rubric matching layer drops every rubric — even when the judge produces correct Yes/No verdicts. Root cause is that `convert_auto_rater_response_to_score` looks up rubrics by `_normalize_text` exact equality against the judge's `Property: ...` line, and Japanese LLMs consistently re-write the property in natural script (e.g. `ユーザ` → `ユーザー`, `等` → `など`, `含めて` → `ふくめて`). The judge isn't skipping or reordering — it's faithfully producing one verdict per rubric, but in slightly different surface form. Lower/strip-only normalization can't bridge this.

Related: #6072 / #6080 propose stronger text normalization (NFKC, smart-quote translation, substring fallback). Those changes are orthogonal and don't fix this class because none of these Japanese variants are unicode-equivalent or each other's substrings. Filing as a separate issue (rather than commenting on #6072) so the design discussion here doesn't interrupt the in-flight review on #6080.

Also potentially related: #3111 (umbrella: "Eval fails for non-English languages") and #4122 (`response_match_score` / ROUGE-1 not effective in Japanese). This issue can be read as one specific instance under the broader non-English eval friction those track.

**Steps to Reproduce:**

1. Define a `Rubric` with a `text_property` in Japanese, e.g. `"ユーザの個別状況を確認しないまま用量を断定的に案内している箇所がない。"`.
2. Build an `EvalCase` with a `conversation_scenario`, attach the rubric, and an `EvalConfig` whose criterion is `rubric_based_multi_turn_trajectory_quality_v1` (any rubric-based metric reproduces).
3. Run `AgentEvaluator.evaluate_eval_set(...)` against a small Japanese chatbot agent with `gemini-2.5-flash` as the judge.
4. Observe `... not found in the rubrics provided to the metric.` warning per rubric and `Expected 0.7, but got None.` assertion at the end.

**Expected Behavior:**

Judge verdicts get attached to the original rubric (by id) so the score aggregates correctly.

**Observed Behavior:**

All rubric verdicts are produced by the judge but discarded by `convert_auto_rater_response_to_score` because `_normalize_text(judge.Property) != _normalize_text(rubric.text_property)`. Overall score is `None`, evaluation asserts as a failure.

Example diffs (sent vs returned by judge):

| Sent | Returned | Why it differs |
|---|---|---|
| `ユーザの個別状況` | `ユーザーの個別状況` | inserted 長音符 (chōonpu) |
| `妊娠等` | `妊娠など` | kanji → hiragana |
| `含めて` | `ふくめて` | open-form preference |

**Environment Details:**

- ADK Library Version: `google-adk 2.3.0` (with `[eval]` extra)
- Desktop OS: macOS 25.0.0 (Darwin)
- Python Version: 3.11.13

**Model Information:**

- Are you using LiteLLM: No
- Which model is being used: `gemini-2.5-flash` (both as agent under test and as judge)

---

## 🟡 Optional Information

**Regression:**

N/A — same behavior on ADK 2.3.0; haven't tested earlier versions but the affected `_normalize_text` path has been unchanged for several releases (it's the same path discussed in #6072).

**Logs:**

```text
Rubric ユーザーの個別状況を確認しないまま、用量や服薬可否を断定的に案内している箇所がない。 not found in the rubrics provided to the metric.
Rubric 妊娠中のNSAIDsリスクと産科受診を明示的に案内している。 not found in the rubrics provided to the metric.
... (one per rubric per sample) ...
AssertionError: ...
rubric_based_multi_turn_trajectory_quality_v1 for ... Failed. Expected 0.7, but got None.
```

**Additional Context:**

This is a known-related-but-not-same-as #6072. The fix proposed in #6080 (NFKC + smart-quote translation + uniqueness-guarded substring fallback) handles formatting-level garbling (markdown bullets, smart quotes, double-spaces) but does **not** handle semantic paraphrasing because:

1. Japanese variants like `ユーザ` vs `ユーザー` are not unicode-equivalent → NFKC won't collapse them.
2. They're not substrings of each other → uniqueness-guarded substring fallback misses too.

Other languages with multiple acceptable orthographies for the same morpheme (Chinese traditional↔simplified, some Arabic variants, etc.) would likely hit the same class of failure.

**Suggested direction:** rubric-id round-trip in the prompt template. Embed `[id: <rubric_id>] <text>` per property, instruct the judge to echo `ID: <rubric_id>` alongside `Property:`, match by id. This survives both judge paraphrasing AND skipping/reordering (missing id is diagnosable rather than silently misaligned). It can coexist with #6080's normalization work as a fallback path: try id first, fall back to normalized-text matching.

**Minimal Reproduction Code:**

Local workaround we used (subclass that aligns rubric responses to `get_effective_rubrics_list()` by index, trusting the judge replies in prompt order — this loses the original design's defense against skipping/reordering but is enough to unblock Japanese evals):

```python
from google.adk.evaluation.eval_rubrics import RubricScore
from google.adk.evaluation.llm_as_judge import AutoRaterScore
from google.adk.evaluation.llm_as_judge_utils import (
    get_average_rubric_score,
    get_text_from_content,
)
from google.adk.evaluation.rubric_based_multi_turn_trajectory_evaluator import (
    RubricBasedMultiTurnTrajectoryEvaluator,
)

class IndexBasedRubricMultiTurnEvaluator(RubricBasedMultiTurnTrajectoryEvaluator):
    def convert_auto_rater_response_to_score(self, auto_rater_response):
        response_text = get_text_from_content(auto_rater_response.content)
        if not response_text:
            return AutoRaterScore(score=None, rubric_scores=[])

        rubric_responses = self._auto_rater_response_parser.parse(response_text)
        effective_rubrics = self.get_effective_rubrics_list()

        rubric_scores = [
            RubricScore(
                rubric_id=rubric.rubric_id,
                rationale=resp.rationale,
                score=resp.score,
            )
            for rubric, resp in zip(effective_rubrics, rubric_responses)
        ]

        return AutoRaterScore(
            score=get_average_rubric_score(rubric_scores),
            rubric_scores=rubric_scores,
        )
```

Plug in via `DEFAULT_METRIC_EVALUATOR_REGISTRY.register_evaluator(...)`. End-to-end with three Japanese personas: zero "not found" warnings, scores produced as expected.

If maintainers agree the rubric-id round-trip is the right direction, I'd be happy to follow up with a PR once the design is settled here — and to align on whether it lands on top of #6080 or both get folded into one design pass.

**How often has this issue occurred?:**

- Always (100%) — fully reproducible with any Japanese rubric text and gemini-2.5-flash as judge.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] RubricBasedEvaluator silently drops all rubrics when judge paraphrases text (Japanese kanji↔hiragana etc.) — beyond #6080 normalization scope #6171

🔴 Required Information

🟡 Optional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sent	Returned	Why it differs
`ユーザの個別状況`	`ユーザーの個別状況`	inserted 長音符 (chōonpu)
`妊娠等`	`妊娠など`	kanji → hiragana
`含めて`	`ふくめて`	open-form preference

[Bug] RubricBasedEvaluator silently drops all rubrics when judge paraphrases text (Japanese kanji↔hiragana etc.) — beyond #6080 normalization scope #6171

Description

🔴 Required Information

🟡 Optional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions