Summary
EvaluatorInput (used by @custom_code_based_evaluator()) does not expose the evaluationReferenceInputs field from the Lambda event. The code-based evaluator contract now delivers ground-truth reference inputs, so evaluator functions cannot access them through the typed input model.
Background
Per the code-based evaluators docs, the Lambda event includes a top-level evaluationReferenceInputs list when ground truth is configured:
{
"schemaVersion": "1.0",
"evaluationLevel": "TRACE",
"evaluationInput": { "sessionSpans": [...] },
"evaluationReferenceInputs": [
{
"context": { "spanContext": { "sessionId": "...", "traceId": "..." } },
"expectedResponse": { "text": "..." }
}
],
"evaluationTarget": { "traceIds": ["trace123"], "spanIds": ["span123"] }
}
The service filters these by evaluation level (SESSION → all; TRACE → session + matching traceId; TOOL_CALL → session + matching spanId).
Current behavior
EvaluatorInput only carries evaluation_level, session_spans, target_trace_id, target_span_id, and schema_version:
https://github.com/aws/bedrock-agentcore-sdk-python/blob/main/src/bedrock_agentcore/evaluation/custom_code_based_evaluators/models.py
The decorator parses the raw event but drops evaluationReferenceInputs:
https://github.com/aws/bedrock-agentcore-sdk-python/blob/main/src/bedrock_agentcore/evaluation/custom_code_based_evaluators/decorator.py
As a result, evaluator functions that need ground truth (e.g. expected-response comparisons, exact-match scoring) have no typed access to it and must drop down to the raw event — which the decorator does not even pass through.
Proposed change
- Add an optional
reference_inputs: List[Dict] = [] field to EvaluatorInput.
- Populate it in the decorator from
event.get("evaluationReferenceInputs") or [].
This is backward compatible — existing evaluators ignore the new field.
Motivation
This also unblocks consolidating third-party evaluator integrations (e.g. PR #528's DeepEvalHandler) onto the standard @custom_code_based_evaluator() contract. That handler currently reinvents event parsing and output serialization partly because EvaluatorInput cannot surface evaluationReferenceInputs (needed to build expected_output for metrics like ContextualPrecision/Recall).
Summary
EvaluatorInput(used by@custom_code_based_evaluator()) does not expose theevaluationReferenceInputsfield from the Lambda event. The code-based evaluator contract now delivers ground-truth reference inputs, so evaluator functions cannot access them through the typed input model.Background
Per the code-based evaluators docs, the Lambda event includes a top-level
evaluationReferenceInputslist when ground truth is configured:{ "schemaVersion": "1.0", "evaluationLevel": "TRACE", "evaluationInput": { "sessionSpans": [...] }, "evaluationReferenceInputs": [ { "context": { "spanContext": { "sessionId": "...", "traceId": "..." } }, "expectedResponse": { "text": "..." } } ], "evaluationTarget": { "traceIds": ["trace123"], "spanIds": ["span123"] } }The service filters these by evaluation level (SESSION → all; TRACE → session + matching traceId; TOOL_CALL → session + matching spanId).
Current behavior
EvaluatorInputonly carriesevaluation_level,session_spans,target_trace_id,target_span_id, andschema_version:https://github.com/aws/bedrock-agentcore-sdk-python/blob/main/src/bedrock_agentcore/evaluation/custom_code_based_evaluators/models.py
The decorator parses the raw event but drops
evaluationReferenceInputs:https://github.com/aws/bedrock-agentcore-sdk-python/blob/main/src/bedrock_agentcore/evaluation/custom_code_based_evaluators/decorator.py
As a result, evaluator functions that need ground truth (e.g. expected-response comparisons, exact-match scoring) have no typed access to it and must drop down to the raw event — which the decorator does not even pass through.
Proposed change
reference_inputs: List[Dict] = []field toEvaluatorInput.event.get("evaluationReferenceInputs") or [].This is backward compatible — existing evaluators ignore the new field.
Motivation
This also unblocks consolidating third-party evaluator integrations (e.g. PR #528's
DeepEvalHandler) onto the standard@custom_code_based_evaluator()contract. That handler currently reinvents event parsing and output serialization partly becauseEvaluatorInputcannot surfaceevaluationReferenceInputs(needed to buildexpected_outputfor metrics like ContextualPrecision/Recall).