Define artifact model and storage policy

# Implement V1 artifact model and storage policy

Status: pending
Tags: `enhancement`, `portal`, `work-engine`, `assistant`, `frontend`, `backend`, `infra`, `data`, `testing`, `P0`
Depends on: None
Blocks: #9, #30, future artifact search/export/migration work

## Scope

Implement the V1 artifact contract that lets workflow/task proof, assistant outputs, files, and external links live in one coherent model without storing binaries or secrets in DynamoDB or Git.

This issue should make artifacts usable by the work-engine and visible enough in the V1 operator flow for later Podcast and assistant slices. It is broader than the existing `FileRecord` upload metadata and narrower than the full assistant job lifecycle from #30.

The implementation should cover:

- A first-class artifact metadata entity for generated or operational outputs such as podcast prep documents, transcripts, reports, event pages, public URLs, draft assistant outputs, and reviewed files.
- The boundary between task proof fields, bundle links, file metadata, artifact records, assistant output records, and canonical Git-backed process docs.
- Storage policy for `s3://`, Dropbox/Google Drive/external URLs, Git/GitHub content, and local-dev-only file paths.
- Metadata-only DynamoDB storage for artifacts and files: stable IDs, storage URI, provider, checksum/size where available, review status, relationship IDs, and timestamps. DynamoDB must not store binary payloads, large text outputs, signed URLs, secrets, or assistant raw logs.
- Work-engine API support to create, update, list, attach, approve/reject/archive, and export artifact metadata through the existing authenticated `/work/api/*` surface.
- Task completion/proof behavior for `proofRequirement.type = artifact`, `requiresFile`, `requiredLinkName`, and assistant-generated outputs.
- Portable export and validation updates so `artifacts.jsonl` becomes an exported entity once artifact records exist, while existing `files.jsonl` stays migration-safe.
- V1 UI implications: the operator can see task/bundle artifacts, link or register an external artifact, understand review status, and see why proof-gated completion is blocked.

Use the current V1 runtime boundary from `docs/v1-runtime-architecture.md`: the Python portal remains the public entry point, work-engine stays private behind `/work/api/*`, runtime metadata lives in DynamoDB, Git stores reviewed process knowledge, and private/bulky binaries live in S3 or existing private external systems.

## Required Model

Add or document the runtime artifact shape in TypeScript and export form. Runtime fields may use camelCase; portable exports must use snake_case.

Minimum artifact fields:

- `artifactId` / `artifact_id`
- `type`: documented string such as `podcast-doc`, `transcript`, `recording`, `report`, `invoice`, `event-page`, `assistant-output`, `external-link`, or `other`
- `title`
- `description` optional
- `status`: `draft`, `needs-review`, `approved`, `rejected`, `archived`, or `superseded`
- `storageProvider`: `s3`, `dropbox`, `google-drive`, `github`, `external-url`, `local-dev`, or `unknown`
- `storageUri` / `storage_uri`
- `filename` optional
- `contentType` / `content_type` optional
- `checksum` optional, required when DataOps owns the binary and can compute it
- `sizeBytes` / `size_bytes` optional
- `visibility` or `dataClass`: at least distinguish `public`, `internal`, `private`, and `sensitive`
- `taskId`, `bundleId`, `assistantJobId`, and `fileId` optional relationship IDs
- `sourceType`: `manual-link`, `manual-upload`, `assistant-output`, `import`, `migration`, or `system`
- `createdBy`, `reviewedBy`, `createdAt`, `updatedAt`, `reviewedAt` where available
- `tags` optional
- small structured `metadata` optional, with redaction rules applied

The existing task and bundle `artifactRefs` remain lightweight references for fast context, but the artifact table/API is the durable source of artifact metadata. A ref alone is not enough to satisfy an artifact proof requirement unless the referenced artifact record exists and has an accepted proof status.

## Storage Policy

Implement or document these boundaries in code comments/docs and enforce the parts that affect runtime behavior:

- Git/GitHub: canonical SOPs, workflow definitions, reviewed templates, assistant prompts, and small reviewed process assets only. Do not commit private runtime artifacts, bulky generated files, raw assistant logs, podcast recordings, invoices, receipts, statements, or temporary assistant outputs as durable V1 storage.
- DynamoDB: metadata only for files, artifacts, assistant job links, task proof state, bundle links, and audit references. No binaries, large generated documents, secrets, signed URLs, OAuth tokens, cookies, or raw assistant logs.
- S3: preferred DataOps-owned storage for private/bulky uploaded binaries and generated artifacts when DataOps owns the file. If S3 upload/storage is introduced in this issue, the bucket must be SAM-managed, versioned, private, and referenced by `s3://bucket/key` plus checksum/size metadata. If S3 binary upload is not introduced, production binary upload must remain clearly disabled or guarded rather than silently writing to Lambda local disk.
- Dropbox/Google Drive: acceptable external private systems for existing podcast/audio/document workflows. Store stable private-system URLs or provider URIs as metadata. Do not treat temporary signed download URLs as durable `storageUri` values.
- Public URLs: Luma, Meetup, YouTube, website pages, Spotify, Apple Podcasts, and similar deliverables may be stored as task `link`, bundle link, and/or artifact metadata depending on whether they are proof for one task or reusable workflow output.
- Local filesystem paths: allowed only for local development/test fixtures. Production Lambda local filesystem must not be the durable artifact store.

## Acceptance Criteria

- [x] Work-engine has a typed artifact model and persistence path for artifact metadata with stable application IDs and explicit relationships to task, bundle, assistant job, and file records where present.
- [x] Production table/resource handling is explicit: either a SAM-managed DynamoDB artifacts table plus `DATAOPS_ARTIFACTS_TABLE` is added, or the implementation documents why artifacts are intentionally stored in an existing table without weakening export/migration safety.
- [x] Artifact API routes exist under `/api/artifacts` and work through the existing `/work/api/*` broker. They support create/register, list/filter by task/bundle/assistant job/status/type, get by ID, update metadata/status, attach to task/bundle, and archive. They do not expose a second public endpoint.
- [x] Artifact records can represent public links, private external-system links, S3-owned objects, assistant outputs, and local-dev-only files without changing the schema.
- [x] Artifact status/review semantics are enforced: `draft` and `needs-review` outputs are visible but do not satisfy required artifact proof; `approved` artifacts can satisfy proof; `rejected`, `archived`, and `superseded` artifacts do not satisfy proof.
- [x] Task completion blocks when `proofRequirement.type = artifact` and no approved artifact is attached to the task or its bundle according to the documented lookup rule.
- [x] Existing `requiredLinkName`, `requiresFile`, and `proofRequirement.type = file/url/comment/external-status` behavior continues to work and has regression coverage.
- [x] Existing file metadata is migration-safe: exports use `storage_uri`, provider/checksum/size fields when present, and remain backward compatible with legacy `storagePath` in local data.
- [x] Production file/artifact storage does not silently use Lambda local disk. Local filesystem storage is either guarded to local/test mode or replaced by a storage adapter that clearly separates local-dev from production storage.
- [x] Portable export writes `artifacts.jsonl` when artifacts are implemented, removes `artifacts` from `omitted_entities`, includes checksums/counts in `manifest.json`, and excludes binaries and secrets.
- [x] Export validation checks artifact required fields, duplicate IDs, parseable timestamps, enum/status values, redaction compliance, task/bundle/file relationships, and assistant-job relationships when assistant jobs are exported. Until #30 implements assistant jobs, `assistant_job_id` may be nullable or treated as an opaque optional field with a documented validator rule.
- [x] Docs are updated where needed: `docs/v1-workflow-data-model.md`, `docs/v1-execution-state-schema.md`, `docs/v1-execution-data-safety.md`, `docs/v1-runtime-architecture.md`, and `docs/local-development.md` if commands or local storage behavior change.
- [x] The V1 frontend shows task/bundle artifacts in the workflow context, distinguishes link/file/artifact proof, shows review status, and provides an operator path to register or attach an external artifact URL without uploading a binary.
- [x] Assistant outputs from `assistants/podcast/` are represented as artifact metadata when attached to workflow/task context; the local `assistants/podcast/documents/`, `inbox/`, and `heru_runs/` folders remain local runtime/dev storage, not durable production artifact storage.
- [x] Automated tests cover model validation, API routes, proof blocking, export/validate behavior, legacy file compatibility, and frontend artifact visibility/blocked-completion states.

## Test Scenarios

### Scenario: Register a public link artifact as workflow proof

Given: a workflow bundle has a task that requires an approved artifact proof
When: the operator registers a public URL artifact, attaches it to the task or bundle, and marks it `approved`
Then: the artifact appears in the task/bundle context and the task can be completed.

### Scenario: Draft assistant output is visible but not accepted proof

Given: Podcast Assistant produced a draft document artifact with status `needs-review`
When: the operator views the related workflow task
Then: the output is visible with review status but cannot satisfy required artifact proof.
When: the operator approves the artifact
Then: it can satisfy the proof requirement.

### Scenario: Private or bulky artifact is metadata-only

Given: a podcast recording, transcript, invoice, or report is stored in S3, Dropbox, or Google Drive
When: an artifact record is created
Then: DynamoDB stores provider, `storageUri`, relationship IDs, size/checksum when available, and review metadata, but not the binary payload or signed temporary URL.

### Scenario: Required file and required artifact stay distinct

Given: one task has `requiresFile = true` and another has `proofRequirement.type = artifact`
When: the operator attaches a file metadata record only
Then: the file-gated task can complete but the artifact-gated task remains blocked until an approved artifact record exists.

### Scenario: Export preserves artifact relationships

Given: exported data includes users, tasks, bundles, files, artifacts, and notifications
When: `export:data` and `validate:export` run
Then: `artifacts.jsonl` is present, manifest counts/checksums match, every artifact relationship points to an exported entity or follows the documented #30 compatibility rule, and no binary content or secret appears in the export.

### Scenario: Legacy local file metadata remains portable

Given: existing local/test file records use `storagePath`
When: the portable export runs
Then: `files.jsonl` emits a migration-safe `storage_uri` value and validation still passes.

### Scenario: Production local filesystem storage is guarded

Given: the work-engine runs in a production-like environment
When: an upload or artifact operation would write to Lambda local filesystem as durable storage
Then: the operation is rejected with a clear error or routed through the configured production storage adapter.

## Out of Scope

- Full assistant job queue, retry, log, and approval lifecycle; #30 owns that.
- Raw intake inbox modeling; #31 owns that.
- Unified search across artifacts and assistant output; #32 owns that.
- Migrating all historical Trello/Spreadsheet/Dropbox/Google Drive artifacts.
- Automating external systems such as Dropbox, Google Drive, Luma, Meetup, YouTube, Spotify, Apple Podcasts, Airtable, Slack, or Mailchimp.
- Building a rich binary preview/editor experience for every artifact type.
- Moving canonical process docs or workflow templates out of Git.
- Committing assistant-generated files from local runtime directories as durable workflow storage.
- A destructive production restore or live migration.

## Dependencies

No issue must close before this starts, but implementation must coordinate with:

- #9: the Podcast end-to-end slice should consume this artifact/proof contract if #29 lands first.
- #30: assistant jobs should reference `output_artifact_ids` and reuse this artifact model when that issue is groomed/implemented.
- #31: raw intake attachments should later register files/artifacts using this storage policy.
- #32: unified search should later index safe artifact metadata, not binaries or private contents.
- `docs/v1-runtime-architecture.md`, `docs/v1-workflow-data-model.md`, `docs/v1-execution-state-schema.md`, and `docs/v1-execution-data-safety.md` are the baseline architecture/data-safety contracts.

## Verification Commands

Run the relevant full workflow for the changed surface. Expected minimum for this issue:

```bash
npm --prefix work-engine test
npm --prefix work-engine run typecheck
npm --prefix work-engine run build
npm --prefix work-engine run test:e2e
uv run --project lambda-functions --extra search --with pytest python -m pytest tests/docs_app
cd lambda-functions
sam validate --template-file template.full.yaml
```

If docs/content metadata or search-facing docs are changed, also run:

```bash
cd lambda-functions
uv run --extra search python -m lambda_functions.build_search_index \
  --docs-dir ../content \
  --output ../.tmp/dataops-content-search.index
```

If `assistants/podcast/**` behavior is changed, also run:

```bash
uv run --project assistants/podcast pytest
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Define artifact model and storage policy #29

Implement V1 artifact model and storage policy

Scope

Required Model

Storage Policy

Acceptance Criteria

Test Scenarios

Scenario: Register a public link artifact as workflow proof

Scenario: Draft assistant output is visible but not accepted proof

Scenario: Private or bulky artifact is metadata-only

Scenario: Required file and required artifact stay distinct

Scenario: Export preserves artifact relationships

Scenario: Legacy local file metadata remains portable

Scenario: Production local filesystem storage is guarded

Out of Scope

Dependencies

Verification Commands

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Define artifact model and storage policy #29

Description

Implement V1 artifact model and storage policy

Scope

Required Model

Storage Policy

Acceptance Criteria

Test Scenarios

Scenario: Register a public link artifact as workflow proof

Scenario: Draft assistant output is visible but not accepted proof

Scenario: Private or bulky artifact is metadata-only

Scenario: Required file and required artifact stay distinct

Scenario: Export preserves artifact relationships

Scenario: Legacy local file metadata remains portable

Scenario: Production local filesystem storage is guarded

Out of Scope

Dependencies

Verification Commands

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions