[P0][Phase 1][Docs] Goal vs node criteria, audit auto-correction, improvement gaps

## Overview

Document how EngineX evaluates success at each layer — and audit what already exists in code.

Every agent has:
- A **Goal** — final checklist for the whole job
- **Steps (nodes)** — each step can auto-retry when AI output is incomplete
- **Human review** — a person approves in the dashboard when required

These layers are often conflated. This ticket covers documentation (with diagrams), tests, and gap analysis.

**Assignee:** @P00rkavi
**Supersedes:** #1 (closed — scope merged here)

---

## Ticket metadata

| Field | Value |
|-------|-------|
| **Phase** | Phase 1 — Pilot / GTM |
| **Priority** | **P0** — Critical path |
| **Type** | Documentation + platform audit |
| **Blocks** | Onboarding, sales conversations |
| **Related** | [#2 Hourly Tracking](https://github.com/EngineXV/engineX/issues/2) (closed — v1 shipped PR #9) |

---

## Conceptual model

Think of an agent as a factory line:

- **Goal** = Final QA on the finished product
- **Each step (node)** = One station on the line
- **Judge RETRY** = Same station sends work back to AI to fix
- **Graph loop** = Route to a fix step, then back to validation
- **Human review** = Person must sign off before the line continues

```mermaid
flowchart TB
 subgraph whole_job [Whole job — GOAL level]
 G[Goal: mission + final checklist + rules]
 end

 subgraph one_step [One step — NODE level]
 N[Node: do one piece of work]
 O[Required outputs — output_keys]
 J[Judge: good enough?]
 end

 G --> N
 N --> O --> J
 J -->|RETRY| N
 J -->|ACCEPT| Next[Next step]
```

---

## Common misconceptions

| Misconception | Actual behavior |
|---------------|-----------------|
| Every node has its own success criteria list | Usually **no** — checklist is on the **Goal** in `agent.py` |
| Goal criteria auto-retry the whole agent | **No** — they score/report at the end only |
| ESCALATE = send to human | **No** — human review = **`pause_nodes`** |
| Node criteria = Goal criteria | **Different layers** (see below) |

---

## The three layers

### Layer 1 — Goal (whole job checklist)

- **Where:** `examples/templates/<agent>/agent.py`
- **Code:** `core/engine/graph/goal.py`, `core/engine/runtime/outcome_aggregator.py`
- **Role:** Final evaluation — not the retry mechanism for individual steps

### Layer 2 — Node outputs (did this step finish?)

- **Where:** `NodeSpec.output_keys` in `nodes/__init__.py`
- **Behavior:** Missing output → **Judge RETRY** → AI tries again (limit: `loop_config.max_iterations`)
- **Code:** `core/engine/graph/event_loop/node.py` → `_evaluate()`

### Layer 3 — Node success_criteria (optional quality rubric)

- **Where:** optional `NodeSpec.success_criteria`
- **Behavior:** Second LLM quality check via `conversation_judge.py`
- **Status:** Supported in code, rarely used in templates today

---

## Auto-correction — existing implementation

Do not rebuild — document, test, and fill gaps.

| Capability | Code location |
|------------|---------------|
| Step judge (missing `output_keys` → RETRY) | `event_loop/node.py` → `_evaluate()` |
| Feedback to LLM on retry | `[Judge feedback]: ...` via `add_user_message()` |
| Optional Level 2 quality judge | `conversation_judge.py` + `success_criteria` |
| Per-step retry limit | `loop_config.max_iterations` |
| Between-step loops (validate → fix) | Conditional `EdgeSpec` in `graph/edge.py` |
| Retry telemetry (partial) | `ExecutionResult.total_retries`, runtime logs |
| Whole-job Goal scorecard | `OutcomeAggregator` — tracks only, no full-agent auto-retry |

**Not built:** separate `EvaluationNode` — judge runs inside each event_loop step.

```mermaid
flowchart TD
 A[EventLoopNode] --> B{_evaluate / judge}
 B -->|ACCEPT| C[Next step]
 B -->|RETRY| A
 B -->|max iterations| D[Step fails]
```

---

## All four feedback mechanisms

How correction, routing, approval, and scoring fit together in one agent run:

```mermaid
flowchart TB
 subgraph goal_layer [End of job — measurement only]
 G[Goal checklist in agent.py]
 OA[OutcomeAggregator — final score / KPIs]
 G --> OA
 end

 subgraph step_judge [Inside one step — Judge RETRY]
 N[EventLoopNode: AI work]
 J{output_keys complete?}
 RF["[Judge feedback] → retry"]
 N --> J
 J -->|no| RF --> N
 J -->|yes| OUT[Step outputs to shared memory]
 J -->|max iterations| FAIL[Step fails]
 end

 subgraph graph_loop [Between steps — validate → fix loop]
 V[Validate step]
 FX[Fix / remap step]
 V -->|fail| FX --> V
 V -->|pass| NEXT[Continue graph]
 end

 subgraph human [Human review — pause_nodes]
 P[Execution PAUSED]
 APP[Approver in web dashboard]
 INJ[inject_input → resume]
 P --> APP --> INJ
 end

 OUT --> V
 NEXT --> goal_layer
 V -->|needs approver| P
 INJ --> V
```

---

## Retry vs human review

| Mechanism | Who acts | When |
|-----------|----------|------|
| Judge RETRY | AI, same step | Missing outputs |
| Graph loop | Another step | validate → fix edges |
| Human pause | Person in dashboard | `pause_nodes` / approval |
| Goal criteria | Measurement only | End of run / KPIs |

**ESCALATE** (judge) = step fails. Not equivalent to human review.

---

## Deliverables

### Docs
- [ ] `docs/GOALS.md` — overview + diagrams (this issue is the spec)
- [ ] Link from README
- [x] Validate → fix graph loop documented in `examples/templates/hourly_tracking/` + `docs/ENGINEX_COMPLETE_GUIDE.md` Section 6–8

### Tests
- [x] RETRY when `output_keys` missing — `test_event_loop_missing_output_keys_retried`, `test_event_loop_node.py`
- [x] Max iterations exhausted → step fails cleanly
- [x] Feedback injected into conversation on RETRY

### Examples
- [x] Level 2 `NodeSpec.success_criteria` — `meeting_scheduler`, `agreement_analysis` nodes

### Optional (P1)
- [ ] Expose retry count in dashboard if not visible
- [x] Improvement gaps — `docs/ENGINEX_COMPLETE_GUIDE.md` Section 22

**Out of scope:** new `EvaluationNode` unless audit identifies a gap.

---

## Reference templates

- `agreement_analysis` — HITL + judge RETRY on extract
- `log_monitor` — timer + conditional edges + human review
- `hourly_tracking` (#2) — validate → fix graph loop

---

## Definition of done

- [ ] `docs/GOALS.md` published and linked from README
- [ ] Team can explain: step judge vs Goal vs graph loops vs human review
- [x] Tests pass for retry + max-limit failure
- [x] At least one Level 2 `success_criteria` example in templates
- [x] Improvement gaps section with prioritized recommendations



Field	Value
Phase	Phase 1 — Pilot / GTM
Priority	P0 — Critical path
Type	Documentation + platform audit
Blocks	Onboarding, sales conversations
Related	#2 Hourly Tracking (closed — v1 shipped PR #9)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[P0][Phase 1][Docs] Goal vs node criteria, audit auto-correction, improvement gaps #10

Overview

Ticket metadata

Conceptual model

Common misconceptions

The three layers

Layer 1 — Goal (whole job checklist)

Layer 2 — Node outputs (did this step finish?)

Layer 3 — Node success_criteria (optional quality rubric)

Auto-correction — existing implementation

All four feedback mechanisms

Retry vs human review

Deliverables

Docs

Tests

Examples

Optional (P1)

Reference templates

Definition of done

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Misconception	Actual behavior
Every node has its own success criteria list	Usually no — checklist is on the Goal in `agent.py`
Goal criteria auto-retry the whole agent	No — they score/report at the end only
ESCALATE = send to human	No — human review = `pause_nodes`
Node criteria = Goal criteria	Different layers (see below)

Capability	Code location
Step judge (missing `output_keys` → RETRY)	`event_loop/node.py` → `_evaluate()`
Feedback to LLM on retry	`[Judge feedback]: ...` via `add_user_message()`
Optional Level 2 quality judge	`conversation_judge.py` + `success_criteria`
Per-step retry limit	`loop_config.max_iterations`
Between-step loops (validate → fix)	Conditional `EdgeSpec` in `graph/edge.py`
Retry telemetry (partial)	`ExecutionResult.total_retries`, runtime logs
Whole-job Goal scorecard	`OutcomeAggregator` — tracks only, no full-agent auto-retry

Mechanism	Who acts	When
Judge RETRY	AI, same step	Missing outputs
Graph loop	Another step	validate → fix edges
Human pause	Person in dashboard	`pause_nodes` / approval
Goal criteria	Measurement only	End of run / KPIs

Uh oh!

[P0][Phase 1][Docs] Goal vs node criteria, audit auto-correction, improvement gaps #10

Description

Overview

Ticket metadata

Conceptual model

Common misconceptions

The three layers

Layer 1 — Goal (whole job checklist)

Layer 2 — Node outputs (did this step finish?)

Layer 3 — Node success_criteria (optional quality rubric)

Auto-correction — existing implementation

All four feedback mechanisms

Retry vs human review

Deliverables

Docs

Tests

Examples

Optional (P1)

Reference templates

Definition of done

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions