Add tutorial for evaluating and comparing models with AgentOps

## Background

AgentOps is an accelerator for continuous evaluation, safety testing, observability, and release readiness of Microsoft Foundry agents. The current documentation frames the product around Evaluate, Ship, Observe, and Own. We have tutorial patterns for prompt-agent and hosted-agent scenarios, but we should add a third tutorial style focused on model evaluation and comparison.

This tutorial should help users answer a practical release question: given the same prompt or agent scenario, can a candidate model deployment replace the current baseline without lowering quality, safety, cost, latency, or consistency expectations?

## Proposed scope

Add a tutorial that walks through comparing multiple model deployments with AgentOps using the same scenario, dataset, and evaluators. The tutorial should focus primarily on the Evaluate pillar, with a clear path to using the results as PR gate or release readiness criteria.

A suggested outline:

1. Define the model-selection decision
   - Identify the baseline model deployment and candidate model deployments.
   - State the decision being made, for example: promote a cheaper, faster, newer, or safer model if it meets pre-defined criteria.

2. Keep the comparison fair
   - Use the same prompt or agent configuration for each run.
   - Keep tools, dataset, evaluator configuration, and test environment constant except for the model deployment.
   - Define pass/fail thresholds before looking at the results.
   - Include a note on randomization, repeated runs, and consistency where relevant.

3. Run the same evaluation set against each model
   - Use one dataset across all candidate model deployments.
   - Apply the same quality and safety evaluators.
   - Capture operational signals such as cost, latency, and failure rate when available.

4. Compare outcomes
   - Compare aggregate quality, safety, cost, latency, and consistency metrics.
   - Inspect per-case failures, regressions, and outliers, not only averages.
   - Call out examples where a candidate has better average scores but worse behavior on important cases.

5. Make a release decision
   - Decide whether the candidate model can replace the baseline, needs prompt or configuration work, or should be rejected.
   - Show how the selected thresholds can become PR gate or release criteria.
   - Connect the comparison workflow back to AgentOps release readiness.

## Acceptance criteria

- A new tutorial exists for evaluating and comparing models with AgentOps, distinct from prompt-agent and hosted-agent tutorials.
- The tutorial uses one prompt or agent scenario, multiple candidate model deployments, the same dataset, and the same evaluator configuration.
- The tutorial explains how to compare quality, safety, cost, latency, and consistency signals.
- The tutorial includes guidance for fair comparisons: hold non-model variables constant, define thresholds before choosing, and inspect per-case failures in addition to aggregate metrics.
- The tutorial ends with a model-selection decision: keep the baseline, replace it with a candidate, or require more work before promotion.
- The tutorial shows how comparison outcomes can feed PR gates or release readiness criteria without making the docs too implementation-specific.
- The writing is practical, beginner-friendly, and aligned with the existing Evaluate, Ship, Observe, Own framing.

## Suggested labels

- documentation
- tutorial
- evaluation
- agentops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tutorial for evaluating and comparing models with AgentOps #339

Background

Proposed scope

Acceptance criteria

Suggested labels

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add tutorial for evaluating and comparing models with AgentOps #339

Description

Background

Proposed scope

Acceptance criteria

Suggested labels

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions