Skip to content

[INFO][DevOps][Integration] Complete client wiring — real APIs, IT sign-off, hypercare #15

Description

@mishrapravin114

Overview

Type: INFO — Client integration & go-live playbook
Audience: Delivery engineers, client DevOps, client IT/security, CS hypercare
Related: #13 Deployment Case Study · #14 Storage Architecture · engineX-internal #5 Onboarding runbook · engineX-internal #6 Security brief

This ticket is the complete client integration guide for taking a pilot from deployed EngineX to production with real systems — not mocks. It covers:

  1. Custom agent wiring — real DB/APIs, tools, credentials
  2. Client IT sign-off — VPN, TLS, network, security review
  3. Hypercare — first 2–4 weeks after go-live

Use alongside #13 (install) and #14 (storage).


Ticket metadata

Field Value
Phase Phase 1 — pilot delivery
Priority P0 — required for every signed client
Blocks Design partner go-live, repeatable onboarding
Out of scope Building Engine Cloud (#4), platform multi-tenant

1. Integration lifecycle (end-to-end)

flowchart LR
 K["Kickoff<br/>scope + systems"] --> D["Deploy EngineX<br/>#13"]
 D --> W["Wire agent<br/>real DB/APIs"]
 W --> I["IT sign-off<br/>VPN/TLS/firewall"]
 I --> G["Go-live"]
 G --> H["Hypercare<br/>2–4 weeks"]
 H --> S["Steady state<br/>client owns ops"]
Loading
Phase Owner Duration
Kickoff + discovery EngineX + client sponsor Week 0
Deploy + base config Client DevOps + EngineX Week 1
Custom agent wiring EngineX engineering Week 1–2
IT / security review Client IT Week 1–2 (parallel)
UAT + HITL training Client ops + EngineX Week 2
Go-live Joint End of week 2–3
Hypercare EngineX CS + client ops Weeks 3–6

2. Custom agent wiring — real DB/APIs (not mocks)

2.1 What “mock → real” means

Shipped templates often include demo tools that return hardcoded JSON. Production pilots replace or extend these with tools that call the client’s systems.

Template Mock today Production wiring
hourly_tracking fetch_broker_transactions(), fetch_investor_logs() return static JSON Client DB/API/ETL for broker + investor feeds
log_monitor Ready for real Grafana/Slack if env vars set GRAFANA_*, SLACK_* — usually minimal code change
agreement_analysis LLM-only extraction Document source: S3, SharePoint API, file drop
support_triage JSON --input CRM/ticket webhook or API poll (Zendesk, Salesforce)
deep_research Demo web search fallback BRAVE_SEARCH_API_KEY or client-approved search API

Rule: EngineX platform state stays on ~/.engine/ (#14). The client’s database is accessed only through agent tools — never as the EngineX platform store.

2.2 Integration patterns

flowchart TB
 subgraph ClientVPC["Client VPC"]
 RT["EngineX runtime"]
 Tools["Agent tools.py / MCP"]
 RT --> Tools
 end

 subgraph ClientSystems["Client systems"]
 DB["Postgres / warehouse"]
 API["REST / GraphQL APIs"]
 SaaS["Salesforce / Grafana / Slack"]
 Files["S3 / SFTP / file drop"]
 end

 Tools --> DB
 Tools --> API
 Tools --> SaaS
 Tools --> Files
Loading
Pattern When to use Example
Env + API key Simple REST, webhooks Grafana, Slack, PagerDuty
OAuth (dashboard Connect) SaaS with user consent HubSpot, Google Calendar, Zoho
Custom @tool in tools.py Client-specific SQL/HTTP Hourly tracking → Postgres read
MCP server Reusable connector, stdio/HTTP Calendar, internal microservice
Read-only DB user Reporting / reconciliation SELECT on replica, not primary write
API gateway in front of DB Client security preference EngineX calls internal API, not raw SQL

2.3 Custom tool implementation checklist

For each data source the agent needs:

  • Discovery — schema, rate limits, auth method, PII fields, retention policy
  • Credential — stored in ~/.engine/credentials/ or client secret manager → env injection
  • Tool contract — input/output JSON shape matches node output_keys
  • Read vs write — prefer read-only for pilots; explicit approval for writes
  • Error handling — timeouts, retries, idempotency documented
  • Network — outbound allowlist from EngineX subnet to endpoint (see Section 3)
  • Validate./engine validate <agent> passes with new tools registered
  • Test run./engine run <agent> --input '...' against staging data first

Code locations:

  • Agent-local tools: examples/templates/<agent>/tools.py (@tool decorator)
  • MCP config: examples/templates/<agent>/mcp_servers.json
  • Credential wizard: ./engine setup-credentials examples/templates/<agent>
  • Integration index: examples/templates/integrations/README.md

2.4 Example: hourly_tracking → real Postgres (illustrative)

Replace mock fetch tools with read-only SQL against client replica:

# examples/templates/hourly_tracking/tools.py (client fork)
@tool(description="Fetch broker transactions from last hour.")
def fetch_broker_transactions() -> dict[str, Any]:
 # Credentials from env: CLIENT_DB_URL (injected by client secret manager)
 # SELECT ... WHERE created_at > now() - interval '1 hour'
 ...

Client provides:

  • Read-only DB user + connection string (or internal HTTP API wrapping the query)
  • VPN/peering if DB is not reachable from EngineX subnet
  • Sample row schema + test dataset for UAT

2.5 Example: log_monitor (minimal wiring)

Often no custom code — configure env only:

GRAFANA_URL=https://grafana.client.internal
GRAFANA_API_TOKEN=...
GRAFANA_DATASOURCE_UID=...
SLACK_WEBHOOK_URL=...

Deploy with systemd: examples/templates/log_monitor/deploy/engine-log-monitor.service

2.6 Per-integration credential matrix

Integration Client secret How EngineX loads it
LLM (Anthropic/OpenAI) API key .env or ~/.engine/credentials
Grafana API token Agent env file
Slack Webhook or bot token Env / credentials store
HubSpot / Zoho OAuth client ID + secret Dashboard Connect + encrypted vault
Google Calendar OAuth Dashboard Connect + MCP
Client Postgres Connection string Secret manager → env (never in git)
Internal REST API Bearer / mTLS cert Env or mounted cert volume

3. Client IT sign-off — VPN, TLS, network

3.1 Architecture with security boundary

flowchart TB
 subgraph Internet["Internet / corporate network"]
 Users["Analysts / approvers"]
 end

 subgraph Edge["Client edge"]
 VPN["VPN / Zero Trust"]
 Proxy["Reverse proxy<br/>TLS + SSO"]
 end

 subgraph Private["Private subnet"]
 Eng["EngineX<br/>:8787 internal only"]
 Vol["~/.engine volume"]
 Eng --- Vol
 end

 subgraph Data["Client data plane"]
 DB["DB / APIs"]
 SaaS["External SaaS"]
 end

 Users --> VPN --> Proxy --> Eng
 Eng --> DB
 Eng --> SaaS
Loading

Principle: :8787 is never exposed directly to the public internet in production.

3.2 IT security review checklist (client IT)

Provide this to client security team for sign-off:

Network

  • EngineX runs in private subnet (no public IP on runtime, or restricted SG)
  • Inbound: only from VPN / Zero Trust / corporate IP range to reverse proxy
  • Outbound: allowlist to LLM APIs + agreed integration endpoints
  • No inbound from EngineX to client DB — EngineX initiates outbound connections only

Dashboard access (./engine serve)

  • TLS terminated at reverse proxy (nginx, ALB, Cloudflare, etc.)
  • SSO / SAML / OIDC in front of dashboard (client IdP) or VPN-only access
  • Session cookies / headers per client security policy
  • Optional: IP allowlist for :8787 backend

Example reverse proxy (nginx sketch)

server {
 listen 443 ssl;
 server_name enginex.client.internal;

 ssl_certificate /etc/ssl/certs/client.crt;
 ssl_certificate_key /etc/ssl/private/client.key;

 location / {
 proxy_pass http://127.0.0.1:8787;
 proxy_set_header Host $host;
 proxy_set_header X-Real-IP $remote_addr;
 proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
 proxy_set_header X-Forwarded-Proto $scheme;
 }
}

EngineX bind (internal only):

./engine serve --host 127.0.0.1 --port 8787
# or --host 0.0.0.0 only if SG restricts to proxy subnet

Secrets & data

  • Secrets from client secret manager (AWS SM, GCP SM, Vault) — not in image/git
  • ENGINE_CREDENTIAL_KEY set; ~/.engine on encrypted volume (#14)
  • Backup/snapshot policy for ~/.engine documented
  • LLM data handling reviewed (prompts may contain client data — DPA with LLM vendor)
  • No platform database — client data in their VPC on their disk

Access & audit

  • Named HITL approvers with dashboard access
  • Run history / ops console retention acceptable to client
  • Incident contact list (EngineX + client ops)

Sign-off artifact

  • Client IT written approval (email or security ticket closure) before production go-live
  • Reference engineX-internal #6 security brief when available

3.3 Headless-only clients (no dashboard)

If pilot is headless only (./engine run --daemon):

  • No :8787 exposure required
  • IT review focuses on outbound network + secrets + volume encryption
  • HITL may be terminal-based or deferred — document explicitly

4. Hypercare — first 2–4 weeks after go-live

4.1 Hypercare model

gantt
 title Pilot hypercare timeline
 dateFormat YYYY-MM-DD
 section Go-live
 Production cutover :milestone, m1, 2026-01-15, 0d
 section Hypercare
 Daily check-ins (week 1) :a1, 2026-01-15, 7d
 Every-other-day (week 2) :a2, after a1, 7d
 Weekly (weeks 3–4) :a3, after a2, 14d
 section Steady state
 Client-owned ops :milestone, m2, after a3, 0d
Loading
Week EngineX involvement Client involvement
1 Daily 15-min standup; monitor runs/alerts; hotfix tools Approvers available; report issues same day
2 Every-other-day check-in; tune prompts/thresholds Ops owns dashboard; IT on call for network
3–4 Weekly review; document runbook handoff Client primary on-call
After SLA per contract; escalation path only Full ownership

4.2 Hypercare daily checklist (EngineX)

  • Agent runs completed successfully (daemon/cron logs clean)
  • No stuck HITL sessions blocking pipeline
  • Integration errors (API 4xx/5xx, DB timeouts) triaged
  • Checkpoint/resume tested if client uses dashboard
  • ~/.engine disk usage within limits
  • LLM cost / rate limits normal

4.3 Hypercare daily checklist (client)

  • HITL queue reviewed (approvers acted on pending items)
  • Business outcome spot-check (e.g. reconciliations correct, alerts meaningful)
  • Escalate to EngineX with logs / session ID if anomaly

4.4 Common hypercare issues

Symptom Likely cause Fix
Agent can’t reach DB SG / VPN / wrong connection string IT opens path; rotate creds
OAuth Connect fails Redirect URI / client ID mismatch Fix OAuth app config
HITL never resumes Approver not trained / pause node stuck Training; dashboard inject
Mock data still appearing Old tools.py not deployed Redeploy client agent fork
Dashboard 502 Proxy misconfig / serve not running Check nginx → :8787
LLM errors Key expired / model access Rotate key; check quota

4.5 Hypercare exit criteria (handoff to steady state)

  • 14 days minimum hypercare completed (extend to 28 for finance clients)
  • 7 consecutive days without P1 incident
  • Client ops can: restart service, read logs, approve HITL, rotate env secrets
  • Runbook delivered (deploy, rollback, contacts, escalation)
  • Success metrics vs kickoff goals documented (see Section 5)

5. Kickoff discovery worksheet (complete before wiring)

Collect in kickoff call — blocks custom integration:

# Question Answer (client)
1 Primary agent / workflow?
2 Success metric at 30 days?
3 Data sources (DB name, API docs, owner)?
4 Read-only or read-write?
5 Staging environment available?
6 HITL approvers (names, roles)?
7 Headless, dashboard, or hybrid?
8 How analysts reach dashboard (VPN/SSO URL)?
9 LLM provider allowed (Anthropic/OpenAI/Ollama in VPC)?
10 IT security contact for sign-off?
11 Hypercare window (2 vs 4 weeks)?
12 Go-live date target?

6. Go-live checklist (joint)

Pre-go-live

  • ./engine validate <agent> passes in staging
  • Real integration tested against staging data (not mocks)
  • Production secrets injected (not staging keys)
  • ~/.engine volume mounted + backup verified
  • IT sign-off complete (Section 3)
  • HITL approvers trained on dashboard
  • Rollback plan documented (stop systemd/cron; restore volume snapshot)

Go-live day

  • Enable production schedule / daemon
  • EngineX + client on shared bridge call
  • First production run monitored end-to-end
  • HITL test case executed (if applicable)

Post-go-live

  • Hypercare schedule confirmed (Section 4)
  • Incident channel live (Slack/Teams/email)

7. Deliverables (this ticket)

Documentation

  • This issue = canonical integration reference on GitHub
  • Link from #13 Section 12 Support / Handoff
  • Update engineX-internal #5 — point hypercare + wiring here
  • Kickoff worksheet (Section 5) as copy-paste template for CS

Per-client (each pilot)

  • Completed discovery worksheet
  • Custom tools/MCP deployed (mock → real)
  • IT sign-off record
  • Hypercare log (daily notes weeks 1–4)
  • Handoff runbook at exit

Engineering (as needed per vertical)

  • Client agent fork or export with real tools.py
  • Staging + prod env file templates (no secrets in repo)

8. Definition of done

Ticket done (playbook published):

  • Team can execute mock → real wiring using Section 2
  • IT checklist (Section 3) sendable to client security unchanged
  • Hypercare schedule (Section 4) used on every pilot

Per-client integration done:

  • Real systems connected; no mock tools in production path
  • IT written sign-off
  • 2–4 weeks hypercare completed with exit criteria met
  • Client ops owns steady-state runbook

9. Related links


Version: 2026-06-28 — Complete client integration, IT sign-off, and hypercare

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationphase-1Phase 1 — pilot / GTMpriority-p0P0 — critical path / do nowtype-docsDocumentation

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions