Skip to content

Add tutorial for evaluating and comparing models with AgentOps #339

Description

@placerda

Background

AgentOps is an accelerator for continuous evaluation, safety testing, observability, and release readiness of Microsoft Foundry agents. The current documentation frames the product around Evaluate, Ship, Observe, and Own. We have tutorial patterns for prompt-agent and hosted-agent scenarios, but we should add a third tutorial style focused on model evaluation and comparison.

This tutorial should help users answer a practical release question: given the same prompt or agent scenario, can a candidate model deployment replace the current baseline without lowering quality, safety, cost, latency, or consistency expectations?

Proposed scope

Add a tutorial that walks through comparing multiple model deployments with AgentOps using the same scenario, dataset, and evaluators. The tutorial should focus primarily on the Evaluate pillar, with a clear path to using the results as PR gate or release readiness criteria.

A suggested outline:

  1. Define the model-selection decision

    • Identify the baseline model deployment and candidate model deployments.
    • State the decision being made, for example: promote a cheaper, faster, newer, or safer model if it meets pre-defined criteria.
  2. Keep the comparison fair

    • Use the same prompt or agent configuration for each run.
    • Keep tools, dataset, evaluator configuration, and test environment constant except for the model deployment.
    • Define pass/fail thresholds before looking at the results.
    • Include a note on randomization, repeated runs, and consistency where relevant.
  3. Run the same evaluation set against each model

    • Use one dataset across all candidate model deployments.
    • Apply the same quality and safety evaluators.
    • Capture operational signals such as cost, latency, and failure rate when available.
  4. Compare outcomes

    • Compare aggregate quality, safety, cost, latency, and consistency metrics.
    • Inspect per-case failures, regressions, and outliers, not only averages.
    • Call out examples where a candidate has better average scores but worse behavior on important cases.
  5. Make a release decision

    • Decide whether the candidate model can replace the baseline, needs prompt or configuration work, or should be rejected.
    • Show how the selected thresholds can become PR gate or release criteria.
    • Connect the comparison workflow back to AgentOps release readiness.

Acceptance criteria

  • A new tutorial exists for evaluating and comparing models with AgentOps, distinct from prompt-agent and hosted-agent tutorials.
  • The tutorial uses one prompt or agent scenario, multiple candidate model deployments, the same dataset, and the same evaluator configuration.
  • The tutorial explains how to compare quality, safety, cost, latency, and consistency signals.
  • The tutorial includes guidance for fair comparisons: hold non-model variables constant, define thresholds before choosing, and inspect per-case failures in addition to aggregate metrics.
  • The tutorial ends with a model-selection decision: keep the baseline, replace it with a candidate, or require more work before promotion.
  • The tutorial shows how comparison outcomes can feed PR gates or release readiness criteria without making the docs too implementation-specific.
  • The writing is practical, beginner-friendly, and aligned with the existing Evaluate, Ship, Observe, Own framing.

Suggested labels

  • documentation
  • tutorial
  • evaluation
  • agentops

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions