Skip to content

Automate MinMax Scenario Runs and Leaderboard #378

@ChathurangiShyalika

Description

@ChathurangiShyalika

Description
Run all MinMax scenarios automatically and generate a leaderboard.

Scenario set:

Methods to evaluate: directLLM, sttiup_agent

Tasks

  1. Combine all 65 scenarios into one scenario registry/file.
  2. Add automated runner for MinMax scenarios.
  3. Run each scenario with directLLM and sttiup_agent.
  4. Save outputs, final answers, and trajectories where applicable.
  5. Score each run using the same evaluation process as before.
  6. Aggregate results by method.
  7. Generate leaderboard in CSV/Markdown.

Expected Flow
scenarios → automated runs → saved outputs/trajectories → scoring → leaderboard

Acceptance Criteria

  • All 65 scenarios run automatically.
  • Both methods are evaluated.
  • Scores are generated per scenario.
  • Final leaderboard compares directLLM vs sttiup_agent.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions