Skip to content

Define artifact model and storage policy #29

Description

@alexeygrigorev

Implement V1 artifact model and storage policy

Status: pending
Tags: enhancement, portal, work-engine, assistant, frontend, backend, infra, data, testing, P0
Depends on: None
Blocks: #9, #30, future artifact search/export/migration work

Scope

Implement the V1 artifact contract that lets workflow/task proof, assistant outputs, files, and external links live in one coherent model without storing binaries or secrets in DynamoDB or Git.

This issue should make artifacts usable by the work-engine and visible enough in the V1 operator flow for later Podcast and assistant slices. It is broader than the existing FileRecord upload metadata and narrower than the full assistant job lifecycle from #30.

The implementation should cover:

  • A first-class artifact metadata entity for generated or operational outputs such as podcast prep documents, transcripts, reports, event pages, public URLs, draft assistant outputs, and reviewed files.
  • The boundary between task proof fields, bundle links, file metadata, artifact records, assistant output records, and canonical Git-backed process docs.
  • Storage policy for s3://, Dropbox/Google Drive/external URLs, Git/GitHub content, and local-dev-only file paths.
  • Metadata-only DynamoDB storage for artifacts and files: stable IDs, storage URI, provider, checksum/size where available, review status, relationship IDs, and timestamps. DynamoDB must not store binary payloads, large text outputs, signed URLs, secrets, or assistant raw logs.
  • Work-engine API support to create, update, list, attach, approve/reject/archive, and export artifact metadata through the existing authenticated /work/api/* surface.
  • Task completion/proof behavior for proofRequirement.type = artifact, requiresFile, requiredLinkName, and assistant-generated outputs.
  • Portable export and validation updates so artifacts.jsonl becomes an exported entity once artifact records exist, while existing files.jsonl stays migration-safe.
  • V1 UI implications: the operator can see task/bundle artifacts, link or register an external artifact, understand review status, and see why proof-gated completion is blocked.

Use the current V1 runtime boundary from docs/v1-runtime-architecture.md: the Python portal remains the public entry point, work-engine stays private behind /work/api/*, runtime metadata lives in DynamoDB, Git stores reviewed process knowledge, and private/bulky binaries live in S3 or existing private external systems.

Required Model

Add or document the runtime artifact shape in TypeScript and export form. Runtime fields may use camelCase; portable exports must use snake_case.

Minimum artifact fields:

  • artifactId / artifact_id
  • type: documented string such as podcast-doc, transcript, recording, report, invoice, event-page, assistant-output, external-link, or other
  • title
  • description optional
  • status: draft, needs-review, approved, rejected, archived, or superseded
  • storageProvider: s3, dropbox, google-drive, github, external-url, local-dev, or unknown
  • storageUri / storage_uri
  • filename optional
  • contentType / content_type optional
  • checksum optional, required when DataOps owns the binary and can compute it
  • sizeBytes / size_bytes optional
  • visibility or dataClass: at least distinguish public, internal, private, and sensitive
  • taskId, bundleId, assistantJobId, and fileId optional relationship IDs
  • sourceType: manual-link, manual-upload, assistant-output, import, migration, or system
  • createdBy, reviewedBy, createdAt, updatedAt, reviewedAt where available
  • tags optional
  • small structured metadata optional, with redaction rules applied

The existing task and bundle artifactRefs remain lightweight references for fast context, but the artifact table/API is the durable source of artifact metadata. A ref alone is not enough to satisfy an artifact proof requirement unless the referenced artifact record exists and has an accepted proof status.

Storage Policy

Implement or document these boundaries in code comments/docs and enforce the parts that affect runtime behavior:

  • Git/GitHub: canonical SOPs, workflow definitions, reviewed templates, assistant prompts, and small reviewed process assets only. Do not commit private runtime artifacts, bulky generated files, raw assistant logs, podcast recordings, invoices, receipts, statements, or temporary assistant outputs as durable V1 storage.
  • DynamoDB: metadata only for files, artifacts, assistant job links, task proof state, bundle links, and audit references. No binaries, large generated documents, secrets, signed URLs, OAuth tokens, cookies, or raw assistant logs.
  • S3: preferred DataOps-owned storage for private/bulky uploaded binaries and generated artifacts when DataOps owns the file. If S3 upload/storage is introduced in this issue, the bucket must be SAM-managed, versioned, private, and referenced by s3://bucket/key plus checksum/size metadata. If S3 binary upload is not introduced, production binary upload must remain clearly disabled or guarded rather than silently writing to Lambda local disk.
  • Dropbox/Google Drive: acceptable external private systems for existing podcast/audio/document workflows. Store stable private-system URLs or provider URIs as metadata. Do not treat temporary signed download URLs as durable storageUri values.
  • Public URLs: Luma, Meetup, YouTube, website pages, Spotify, Apple Podcasts, and similar deliverables may be stored as task link, bundle link, and/or artifact metadata depending on whether they are proof for one task or reusable workflow output.
  • Local filesystem paths: allowed only for local development/test fixtures. Production Lambda local filesystem must not be the durable artifact store.

Acceptance Criteria

  • Work-engine has a typed artifact model and persistence path for artifact metadata with stable application IDs and explicit relationships to task, bundle, assistant job, and file records where present.
  • Production table/resource handling is explicit: either a SAM-managed DynamoDB artifacts table plus DATAOPS_ARTIFACTS_TABLE is added, or the implementation documents why artifacts are intentionally stored in an existing table without weakening export/migration safety.
  • Artifact API routes exist under /api/artifacts and work through the existing /work/api/* broker. They support create/register, list/filter by task/bundle/assistant job/status/type, get by ID, update metadata/status, attach to task/bundle, and archive. They do not expose a second public endpoint.
  • Artifact records can represent public links, private external-system links, S3-owned objects, assistant outputs, and local-dev-only files without changing the schema.
  • Artifact status/review semantics are enforced: draft and needs-review outputs are visible but do not satisfy required artifact proof; approved artifacts can satisfy proof; rejected, archived, and superseded artifacts do not satisfy proof.
  • Task completion blocks when proofRequirement.type = artifact and no approved artifact is attached to the task or its bundle according to the documented lookup rule.
  • Existing requiredLinkName, requiresFile, and proofRequirement.type = file/url/comment/external-status behavior continues to work and has regression coverage.
  • Existing file metadata is migration-safe: exports use storage_uri, provider/checksum/size fields when present, and remain backward compatible with legacy storagePath in local data.
  • Production file/artifact storage does not silently use Lambda local disk. Local filesystem storage is either guarded to local/test mode or replaced by a storage adapter that clearly separates local-dev from production storage.
  • Portable export writes artifacts.jsonl when artifacts are implemented, removes artifacts from omitted_entities, includes checksums/counts in manifest.json, and excludes binaries and secrets.
  • Export validation checks artifact required fields, duplicate IDs, parseable timestamps, enum/status values, redaction compliance, task/bundle/file relationships, and assistant-job relationships when assistant jobs are exported. Until Implement assistant job model and lifecycle #30 implements assistant jobs, assistant_job_id may be nullable or treated as an opaque optional field with a documented validator rule.
  • Docs are updated where needed: docs/v1-workflow-data-model.md, docs/v1-execution-state-schema.md, docs/v1-execution-data-safety.md, docs/v1-runtime-architecture.md, and docs/local-development.md if commands or local storage behavior change.
  • The V1 frontend shows task/bundle artifacts in the workflow context, distinguishes link/file/artifact proof, shows review status, and provides an operator path to register or attach an external artifact URL without uploading a binary.
  • Assistant outputs from assistants/podcast/ are represented as artifact metadata when attached to workflow/task context; the local assistants/podcast/documents/, inbox/, and heru_runs/ folders remain local runtime/dev storage, not durable production artifact storage.
  • Automated tests cover model validation, API routes, proof blocking, export/validate behavior, legacy file compatibility, and frontend artifact visibility/blocked-completion states.

Test Scenarios

Scenario: Register a public link artifact as workflow proof

Given: a workflow bundle has a task that requires an approved artifact proof
When: the operator registers a public URL artifact, attaches it to the task or bundle, and marks it approved
Then: the artifact appears in the task/bundle context and the task can be completed.

Scenario: Draft assistant output is visible but not accepted proof

Given: Podcast Assistant produced a draft document artifact with status needs-review
When: the operator views the related workflow task
Then: the output is visible with review status but cannot satisfy required artifact proof.
When: the operator approves the artifact
Then: it can satisfy the proof requirement.

Scenario: Private or bulky artifact is metadata-only

Given: a podcast recording, transcript, invoice, or report is stored in S3, Dropbox, or Google Drive
When: an artifact record is created
Then: DynamoDB stores provider, storageUri, relationship IDs, size/checksum when available, and review metadata, but not the binary payload or signed temporary URL.

Scenario: Required file and required artifact stay distinct

Given: one task has requiresFile = true and another has proofRequirement.type = artifact
When: the operator attaches a file metadata record only
Then: the file-gated task can complete but the artifact-gated task remains blocked until an approved artifact record exists.

Scenario: Export preserves artifact relationships

Given: exported data includes users, tasks, bundles, files, artifacts, and notifications
When: export:data and validate:export run
Then: artifacts.jsonl is present, manifest counts/checksums match, every artifact relationship points to an exported entity or follows the documented #30 compatibility rule, and no binary content or secret appears in the export.

Scenario: Legacy local file metadata remains portable

Given: existing local/test file records use storagePath
When: the portable export runs
Then: files.jsonl emits a migration-safe storage_uri value and validation still passes.

Scenario: Production local filesystem storage is guarded

Given: the work-engine runs in a production-like environment
When: an upload or artifact operation would write to Lambda local filesystem as durable storage
Then: the operation is rejected with a clear error or routed through the configured production storage adapter.

Out of Scope

  • Full assistant job queue, retry, log, and approval lifecycle; Implement assistant job model and lifecycle #30 owns that.
  • Raw intake inbox modeling; Implement raw intake inbox for operational inputs #31 owns that.
  • Unified search across artifacts and assistant output; Implement unified operator search across docs and work context #32 owns that.
  • Migrating all historical Trello/Spreadsheet/Dropbox/Google Drive artifacts.
  • Automating external systems such as Dropbox, Google Drive, Luma, Meetup, YouTube, Spotify, Apple Podcasts, Airtable, Slack, or Mailchimp.
  • Building a rich binary preview/editor experience for every artifact type.
  • Moving canonical process docs or workflow templates out of Git.
  • Committing assistant-generated files from local runtime directories as durable workflow storage.
  • A destructive production restore or live migration.

Dependencies

No issue must close before this starts, but implementation must coordinate with:

Verification Commands

Run the relevant full workflow for the changed surface. Expected minimum for this issue:

npm --prefix work-engine test
npm --prefix work-engine run typecheck
npm --prefix work-engine run build
npm --prefix work-engine run test:e2e
uv run --project lambda-functions --extra search --with pytest python -m pytest tests/docs_app
cd lambda-functions
sam validate --template-file template.full.yaml

If docs/content metadata or search-facing docs are changed, also run:

cd lambda-functions
uv run --extra search python -m lambda_functions.build_search_index \
  --docs-dir ../content \
  --output ../.tmp/dataops-content-search.index

If assistants/podcast/** behavior is changed, also run:

uv run --project assistants/podcast pytest

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Must haveassistantAssistant modules and jobsbackendBackend/APIdataData model, migration, storagedocsDocumentation or process docs workenhancementNew or improved functionalityfrontendFrontend UIinfraDeployment and infrastructureportalShared portal shell and UXtestingTests and QAwork-engineDataTasks task execution engine

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions