Description
Run all MinMax scenarios automatically and generate a leaderboard.
Scenario set:
Methods to evaluate: directLLM, sttiup_agent
Tasks
- Combine all 65 scenarios into one scenario registry/file.
- Add automated runner for MinMax scenarios.
- Run each scenario with directLLM and sttiup_agent.
- Save outputs, final answers, and trajectories where applicable.
- Score each run using the same evaluation process as before.
- Aggregate results by method.
- Generate leaderboard in CSV/Markdown.
Expected Flow
scenarios → automated runs → saved outputs/trajectories → scoring → leaderboard
Acceptance Criteria
- All 65 scenarios run automatically.
- Both methods are evaluated.
- Scores are generated per scenario.
- Final leaderboard compares directLLM vs sttiup_agent.
Description
Run all MinMax scenarios automatically and generate a leaderboard.
Scenario set:
Methods to evaluate: directLLM, sttiup_agent
Tasks
Expected Flow
scenarios → automated runs → saved outputs/trajectories → scoring → leaderboard
Acceptance Criteria