Skip to content

Expose evaluationReferenceInputs (ground truth) on EvaluatorInput for code-based evaluators #539

@jariy17

Description

@jariy17

Summary

EvaluatorInput (used by @custom_code_based_evaluator()) does not expose the evaluationReferenceInputs field from the Lambda event. The code-based evaluator contract now delivers ground-truth reference inputs, so evaluator functions cannot access them through the typed input model.

Background

Per the code-based evaluators docs, the Lambda event includes a top-level evaluationReferenceInputs list when ground truth is configured:

{
    "schemaVersion": "1.0",
    "evaluationLevel": "TRACE",
    "evaluationInput": { "sessionSpans": [...] },
    "evaluationReferenceInputs": [
        {
            "context": { "spanContext": { "sessionId": "...", "traceId": "..." } },
            "expectedResponse": { "text": "..." }
        }
    ],
    "evaluationTarget": { "traceIds": ["trace123"], "spanIds": ["span123"] }
}

The service filters these by evaluation level (SESSION → all; TRACE → session + matching traceId; TOOL_CALL → session + matching spanId).

Current behavior

EvaluatorInput only carries evaluation_level, session_spans, target_trace_id, target_span_id, and schema_version:

https://github.com/aws/bedrock-agentcore-sdk-python/blob/main/src/bedrock_agentcore/evaluation/custom_code_based_evaluators/models.py

The decorator parses the raw event but drops evaluationReferenceInputs:

https://github.com/aws/bedrock-agentcore-sdk-python/blob/main/src/bedrock_agentcore/evaluation/custom_code_based_evaluators/decorator.py

As a result, evaluator functions that need ground truth (e.g. expected-response comparisons, exact-match scoring) have no typed access to it and must drop down to the raw event — which the decorator does not even pass through.

Proposed change

  1. Add an optional reference_inputs: List[Dict] = [] field to EvaluatorInput.
  2. Populate it in the decorator from event.get("evaluationReferenceInputs") or [].

This is backward compatible — existing evaluators ignore the new field.

Motivation

This also unblocks consolidating third-party evaluator integrations (e.g. PR #528's DeepEvalHandler) onto the standard @custom_code_based_evaluator() contract. That handler currently reinvents event parsing and output serialization partly because EvaluatorInput cannot surface evaluationReferenceInputs (needed to build expected_output for metrics like ContextualPrecision/Recall).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions