Skip to content

[P0][Phase 1][Docs] Goal vs node criteria, audit auto-correction, improvement gaps #10

Description

@pravinmishra672

Overview

Document how EngineX evaluates success at each layer — and audit what already exists in code.

Every agent has:

  • A Goal — final checklist for the whole job
  • Steps (nodes) — each step can auto-retry when AI output is incomplete
  • Human review — a person approves in the dashboard when required

These layers are often conflated. This ticket covers documentation (with diagrams), tests, and gap analysis.

Assignee: @P00rkavi
Supersedes: #1 (closed — scope merged here)


Ticket metadata

Field Value
Phase Phase 1 — Pilot / GTM
Priority P0 — Critical path
Type Documentation + platform audit
Blocks Onboarding, sales conversations
Related #2 Hourly Tracking (closed — v1 shipped PR #9)

Conceptual model

Think of an agent as a factory line:

  • Goal = Final QA on the finished product
  • Each step (node) = One station on the line
  • Judge RETRY = Same station sends work back to AI to fix
  • Graph loop = Route to a fix step, then back to validation
  • Human review = Person must sign off before the line continues
flowchart TB
 subgraph whole_job [Whole job — GOAL level]
 G[Goal: mission + final checklist + rules]
 end

 subgraph one_step [One step — NODE level]
 N[Node: do one piece of work]
 O[Required outputs — output_keys]
 J[Judge: good enough?]
 end

 G --> N
 N --> O --> J
 J -->|RETRY| N
 J -->|ACCEPT| Next[Next step]
Loading

Common misconceptions

Misconception Actual behavior
Every node has its own success criteria list Usually no — checklist is on the Goal in agent.py
Goal criteria auto-retry the whole agent No — they score/report at the end only
ESCALATE = send to human No — human review = pause_nodes
Node criteria = Goal criteria Different layers (see below)

The three layers

Layer 1 — Goal (whole job checklist)

  • Where: examples/templates/<agent>/agent.py
  • Code: core/engine/graph/goal.py, core/engine/runtime/outcome_aggregator.py
  • Role: Final evaluation — not the retry mechanism for individual steps

Layer 2 — Node outputs (did this step finish?)

  • Where: NodeSpec.output_keys in nodes/__init__.py
  • Behavior: Missing output → Judge RETRY → AI tries again (limit: loop_config.max_iterations)
  • Code: core/engine/graph/event_loop/node.py_evaluate()

Layer 3 — Node success_criteria (optional quality rubric)

  • Where: optional NodeSpec.success_criteria
  • Behavior: Second LLM quality check via conversation_judge.py
  • Status: Supported in code, rarely used in templates today

Auto-correction — existing implementation

Do not rebuild — document, test, and fill gaps.

Capability Code location
Step judge (missing output_keys → RETRY) event_loop/node.py_evaluate()
Feedback to LLM on retry [Judge feedback]: ... via add_user_message()
Optional Level 2 quality judge conversation_judge.py + success_criteria
Per-step retry limit loop_config.max_iterations
Between-step loops (validate → fix) Conditional EdgeSpec in graph/edge.py
Retry telemetry (partial) ExecutionResult.total_retries, runtime logs
Whole-job Goal scorecard OutcomeAggregator — tracks only, no full-agent auto-retry

Not built: separate EvaluationNode — judge runs inside each event_loop step.

flowchart TD
 A[EventLoopNode] --> B{_evaluate / judge}
 B -->|ACCEPT| C[Next step]
 B -->|RETRY| A
 B -->|max iterations| D[Step fails]
Loading

All four feedback mechanisms

How correction, routing, approval, and scoring fit together in one agent run:

flowchart TB
 subgraph goal_layer [End of job — measurement only]
 G[Goal checklist in agent.py]
 OA[OutcomeAggregator — final score / KPIs]
 G --> OA
 end

 subgraph step_judge [Inside one step — Judge RETRY]
 N[EventLoopNode: AI work]
 J{output_keys complete?}
 RF["[Judge feedback] → retry"]
 N --> J
 J -->|no| RF --> N
 J -->|yes| OUT[Step outputs to shared memory]
 J -->|max iterations| FAIL[Step fails]
 end

 subgraph graph_loop [Between steps — validate → fix loop]
 V[Validate step]
 FX[Fix / remap step]
 V -->|fail| FX --> V
 V -->|pass| NEXT[Continue graph]
 end

 subgraph human [Human review — pause_nodes]
 P[Execution PAUSED]
 APP[Approver in web dashboard]
 INJ[inject_input → resume]
 P --> APP --> INJ
 end

 OUT --> V
 NEXT --> goal_layer
 V -->|needs approver| P
 INJ --> V
Loading

Retry vs human review

Mechanism Who acts When
Judge RETRY AI, same step Missing outputs
Graph loop Another step validate → fix edges
Human pause Person in dashboard pause_nodes / approval
Goal criteria Measurement only End of run / KPIs

ESCALATE (judge) = step fails. Not equivalent to human review.


Deliverables

Docs

  • docs/GOALS.md — overview + diagrams (this issue is the spec)
  • Link from README
  • Validate → fix graph loop documented in examples/templates/hourly_tracking/ + docs/ENGINEX_COMPLETE_GUIDE.md Section 6–8

Tests

  • RETRY when output_keys missing — test_event_loop_missing_output_keys_retried, test_event_loop_node.py
  • Max iterations exhausted → step fails cleanly
  • Feedback injected into conversation on RETRY

Examples

  • Level 2 NodeSpec.success_criteriameeting_scheduler, agreement_analysis nodes

Optional (P1)

  • Expose retry count in dashboard if not visible
  • Improvement gaps — docs/ENGINEX_COMPLETE_GUIDE.md Section 22

Out of scope: new EvaluationNode unless audit identifies a gap.


Reference templates


Definition of done

  • docs/GOALS.md published and linked from README
  • Team can explain: step judge vs Goal vs graph loops vs human review
  • Tests pass for retry + max-limit failure
  • At least one Level 2 success_criteria example in templates
  • Improvement gaps section with prioritized recommendations

Metadata

Metadata

Assignees

No one assigned

    Labels

    phase-1Phase 1 — pilot / GTMpriority-p0P0 — critical path / do nowtype-docsDocumentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions