Skip to content

Add offsite export archives and restore evidence for production data safety #58

Description

@alexeygrigorev

Add offsite export archives and restore evidence for production data safety

Status: in progress
Tags: enhancement, backend, work-engine, infra, data, testing, docs, P0
Depends on: #48 (closed)
Blocks: None

Scope

Implement the production-ready offsite archive lane for DataOps V1 execution
data.

#48 already delivered the local portable export, export validation, dry-run
import, scheduled local export route, and restore-drill documentation. This
issue extends that foundation so production execution data can be exported to
durable offsite storage and later proven restorable without relying on Lambda
local files.

Affected areas:

  • work-engine/ TypeScript export and cron/admin export path.
  • lambda-functions/template.full.yaml and related deployment configuration
    for SAM-owned export storage, environment variables, and least-privilege IAM.
  • Restore/export documentation in docs/v1-execution-data-safety.md and
    docs/restore-drill.md.
  • Tests for archive writing, S3/offsite storage behavior, restore evidence, and
    production safety gates.

The implementation must stay aligned with the current V1 runtime architecture:
the public Python portal remains the only public entry point, WorkEngineFunction
stays private, runtime execution state stays in SAM-owned DynamoDB tables, and
portable exports remain application-level JSONL snapshots independent of
DynamoDB PK/SK internals.

Acceptance Criteria

  • SAM/CloudFormation declares the offsite export storage needed for V1, or
    wires to an explicitly named existing bucket through parameters, with
    server-side encryption, public access blocked, versioning enabled,
    production retain/deletion safety, lifecycle or retention rules, and tags
    suitable for backup selection.
  • WorkEngineFunction receives export archive configuration through stack
    parameters/environment variables and has least-privilege IAM for only the
    required export archive prefix/actions. No production bucket, credential,
    account ID, or secret is hardcoded.
  • The scheduled/admin export path can write a timestamped portable export
    archive to offsite storage. The archive contains manifest.json and the
    current portable JSONL entity files, including artifacts.jsonl,
    assistant_jobs.jsonl, and audit_events.jsonl when those entities are
    emitted by the current export implementation. Entity omission must remain
    explicit in the manifest when an entity is not implemented.
  • Archive object keys include environment and generation time, are stable
    enough for retention/audit review, and avoid leaking private data in the
    key name. The route/command response includes archive URI/key,
    generated_at, schema/export format version, entity counts, and checksum
    summary, but does not return secrets, signed URLs, session tokens, or
    private credentials.
  • Export archives preserve the portable export safety rules from
    docs/v1-execution-data-safety.md: no password hashes, live sessions,
    API keys, OAuth tokens, cookies, signed temporary URLs, private
    credentials, raw binary payloads, or DynamoDB-only key dependency in
    normal exports.
  • The implementation writes restore evidence for a non-production drill:
    source archive URI/key, app git SHA, export generated_at, manifest
    checksum summary, validation result, dry-run import counts, skipped/invalid
    record counts, target environment, timestamp, and smoke-check checklist
    result. Test and local drill artifacts must live under project-local
    .tmp/exports/ paths.
  • A restore/drill command or documented workflow can fetch or read an
    archive, extract the portable export, run
    validate:export, run dry-run:import, and produce the restore evidence
    report without writing production data.
  • Production restore/import/write behavior is human-gated: automated cron,
    admin export, validation, and dry-run paths must not mutate production
    DynamoDB tables; any production restore/write action requires an explicit
    human-run command or documented manual approval step before it can write.
  • The existing local export path remains usable:
    npm --prefix work-engine run export:data -- <export-dir>,
    validate:export, and dry-run:import still work for local/test archives.
  • Documentation explains the offsite archive location/prefix, retention
    expectations, restore evidence format, safe local .tmp/exports/ drill
    path, and the human gate for production restore/write operations.
  • [HUMAN] A production operator verifies in AWS that the deployed export
    archive bucket/prefix is encrypted, versioned, private, retained/lifecycled
    as specified, and receives at least one scheduled/admin export archive.
  • [HUMAN] Before production execution data is treated as critical, a human
    runs the documented restore drill against a staging or isolated target and
    attaches the restore evidence report to the issue.

Test Scenarios

Scenario: scheduled export uploads an offsite archive

Given: the work-engine has export archive configuration and a mocked or local
S3-compatible storage client.
When: the scheduled/admin export path runs.
Then: it writes one timestamped archive under the configured environment/prefix,
returns archive metadata and manifest summary, and does not expose secrets or
signed URLs.

Scenario: archive contents stay portable and redacted

Given: exported users, tasks, bundles, templates, recurring configs, files,
notifications, artifacts, assistant jobs, and audit events with realistic
relationships and sensitive fields.
When: the archive is extracted and validated.
Then: all emitted entity files pass validate:export, relationship checks pass,
normal exports exclude sensitive fields, and any omitted entity type is listed
explicitly in the manifest.

Scenario: restore evidence is generated without production writes

Given: a previously written archive.
When: the restore drill workflow validates the archive and runs dry-run import.
Then: it writes a restore evidence report with validation status, dry-run
would-write counts, skipped/invalid counts, checksum summary, target environment,
and smoke-check checklist, while making no writes to production DynamoDB tables.

Scenario: production restore remains human-gated

Given: production archive configuration is present.
When: cron/admin export, validation, or dry-run restore code runs automatically.
Then: the code cannot restore, import, overwrite, or delete production execution
records unless a separate explicit human-approved restore/write path is invoked.

Required Verification

  • npm --prefix work-engine test

  • npm --prefix work-engine run typecheck

  • npm --prefix work-engine run build

  • Focused tests for archive upload, archive extraction/validation, restore
    evidence generation, and production no-write safety.

  • sam validate --template-file lambda-functions/template.full.yaml --lint
    or the repository's equivalent SAM validation command for the touched
    template.

  • Docs link validation if restore/data-safety docs are edited:

    uv run --project lambda-functions --extra search python -m lambda_functions.validate_docs_links \
      --repo-root . \
      --content-root content
  • uv run --project lambda-functions --extra search --with pytest python -m pytest tests/docs_app
    if Python portal broker, deployment template assumptions, or docs-app behavior
    are touched.

  • Attach or summarize the local restore evidence report path under
    .tmp/exports/ in the Tester handoff.

Out of Scope

  • Full Postgres or other target-database import tooling.
  • Destructive production restore, table replacement, table rename, or production
    data repair automation.
  • Binary backup/storage for uploaded files or generated artifacts beyond the
    portable export metadata archive. S3/object-storage backup for artifact
    binaries remains a separate production storage issue.
  • UI dashboards for browsing export history, unless a minimal status endpoint is
    needed for verification.
  • Changes to source repositories outside DataTalksClub/dataops, including
    ../dtc-operations, ../datatasks, and ../podcast-assistant.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Must havebackendBackend/APIdataData model, migration, storagedocsDocumentation or process docs workenhancementNew or improved functionalityhumanCode done or issue blocked on human verificationinfraDeployment and infrastructuretestingTests and QAwork-engineDataTasks task execution engine

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions