Skip to content

Extend process docs with stable IDs #33

Description

@alexeygrigorev

Extend process docs with stable IDs

Status: pending
Tags: enhancement, docs, process-docs, backend, work-engine, testing, data, P0
Depends on: None
Blocks: #34, #36, #37, #38, #39

Scope

Add a stable document identity policy and the first implementation slice for process documents that are used by DataOps workflows.

The current docs conventions already describe frontmatter IDs, and the docs registry can generate fallback IDs from paths. That fallback is not enough for V1 workflow execution because task templates and runtime tasks need process-document references that survive file renames, repo moves, and the future dataops-knowledge split described in docs/decisions/dataops-knowledge-repository.md.

This issue should make stable IDs an explicit contract for workflow-critical process docs and migrate the first priority set. The implementer should inspect and update the relevant documentation and code paths, especially:

  • .goal-v1.md
  • _docs/PROCESS.md
  • _docs/MERGE_PLAN.md
  • PROJECT_PLAN.md
  • PORTAL_ANALYSIS.md
  • docs/STRUCTURE.md
  • docs/sop-format.md
  • docs/local-development.md
  • docs/decisions/dataops-knowledge-repository.md
  • docs/v1-workflow-data-model.md
  • content/tasks/templates/*.md
  • Podcast workflow process docs referenced by content/tasks/templates/podcast.md and work-engine/scripts/seed-templates.ts
  • lambda-functions/src/lambda_functions/doc_registry.py
  • lambda-functions/src/lambda_functions/docs_index.py
  • docs metadata/search tests under tests/docs_app/
  • work-engine template/task metadata code and tests when sourceDocIds or instructionDocId mappings are changed

The first migration target is:

  1. Every Markdown task template in content/tasks/templates/*.md has explicit stable frontmatter id, aliases, doc_type: task-template, schema_version: 1, source, systems, tags, and related_docs where applicable.
  2. Every Podcast process doc referenced by the Podcast task template or Podcast seed-template instructionDocId has an explicit stable frontmatter id, or the reference is corrected to the stable ID already present on the target document.
  3. Workflow template records that already expose sourceDocIds and task definitions that expose instructionDocId use these stable document IDs rather than path-derived or Google Docs-only references.
  4. The policy docs explain how IDs, aliases, source metadata, task/workflow references, and repository boundaries work together.

Metadata Format

Use the existing frontmatter style and make it normative for workflow-critical process docs:

---
id: sop.media.podcast.create-podcast-document
aliases:
  - sop.media.podcast.create-a-podcast-document
title: "Create a podcast document"
summary: "Short operator-facing summary."
doc_type: sop
schema_version: 1
source: "Processes/Podcast/Create a podcast document.docx"
systems:
  - github
tags:
  - podcast
related_docs:
  - task-template.tasks.podcast
---

Rules to preserve or document:

  • id is the canonical stable identity. It must use lowercase letters, numbers, dots, dashes, and underscores only.
  • Prefer namespaces by type and domain: sop.<domain>.<area>.<slug>, template.<domain>.<area>.<slug>, reference.<domain>.<area>.<slug>, playbook.<domain>.<area>.<slug>, prompt.<domain>.<area>.<slug>, and task-template.tasks.<workflow>.
  • aliases contains old IDs, old generated IDs, or old repo-relative paths that should still resolve after a rename or migration.
  • related_docs, task-template sourceDocIds, and task/task-definition instructionDocId should prefer stable IDs over paths and external Google Docs URLs.
  • source remains provenance for imported docs. It is not the runtime identity.
  • instructionsUrl may remain for external Google Docs links during migration, but it must not be the only process-doc identity when an in-repo process doc exists.

Migration Strategy

Do this as a narrow first batch, not a repo-wide blind rewrite.

  1. Update the documentation contract first so future agents know when a stable ID is required and how to choose it.
  2. Audit the current workflow-critical docs and list any generated fallback IDs that are being relied on today.
  3. Add explicit IDs to all content/tasks/templates/*.md, preserving existing generated IDs as aliases when the explicit ID differs from the current generated ID.
  4. Add explicit IDs to the Podcast docs referenced by content/tasks/templates/podcast.md and work-engine/scripts/seed-templates.ts where missing.
  5. Update related_docs, sourceDocIds, and instructionDocId values only when needed to point at the canonical stable ID.
  6. Keep source repositories read-only. Do not edit ../dtc-operations, ../datatasks, or ../podcast-assistant.
  7. Do not split content/ into DataTalksClub/dataops-knowledge in this issue. The ADR remains a future boundary decision.

Validation And Search Impact

The implementation must keep the docs registry and search behavior deterministic:

  • Registry validation must fail loudly for duplicate id values, duplicate aliases, aliases that conflict with another document ID/path, invalid ID syntax, and broken related_docs references.
  • Workflow-critical docs migrated in this issue must report stable_id: true and id_source: frontmatter from the docs API/registry.
  • Search index generation must still succeed after the migration and must index the stable id field for migrated docs.
  • Existing path and alias resolution must keep working for migrated docs.
  • Generated fallback IDs may remain for legacy docs outside this issue's migration batch, but tests should make clear that workflow-critical docs are not allowed to rely on fallback IDs.

Acceptance Criteria

  • docs/STRUCTURE.md, docs/sop-format.md, and relevant process/development docs explain the stable-ID requirement for workflow-critical process docs, aliases, source provenance, related_docs, sourceDocIds, instructionDocId, and the future knowledge-repo boundary.
  • All content/tasks/templates/*.md files have explicit stable frontmatter IDs using the task-template.tasks.<workflow> namespace and are covered by tests.
  • Podcast workflow docs referenced by content/tasks/templates/podcast.md and Podcast seed-template instructionDocId values have explicit stable IDs or references are corrected to existing explicit IDs.
  • related_docs, sourceDocIds, and instructionDocId values for the migrated batch resolve through the document registry by stable ID.
  • Registry/search behavior preserves aliases and path resolution for migrated docs, and duplicate/conflicting IDs fail with actionable validation errors.
  • Search index build succeeds and includes stable IDs for the migrated task templates and Podcast process docs.
  • Work-engine template/task tests still prove that sourceDocIds and instructionDocId persist, export, and instantiate correctly.
  • No source repositories outside DataTalksClub/dataops are modified.

Test Scenarios

Scenario: Task template has stable identity

Given: each Markdown file under content/tasks/templates/
When: the docs registry indexes the content tree
Then: each template record has id_source: frontmatter, stable_id: true, doc_type: task-template, and a unique task-template.tasks.<workflow> ID.

Scenario: Podcast task opens process doc by ID

Given: a Podcast task definition with instructionDocId: sop.media.podcast.create-podcast-document
When: the frontend or docs API resolves the instruction document
Then: the stable ID resolves to the expected Markdown document even if the file path has an alias or generated fallback ID.

Scenario: Duplicate ID fails loudly

Given: two Markdown docs define the same id
When: registry validation or search-index build runs
Then: validation fails with a message naming both conflicting paths.

Scenario: Alias preserves old references

Given: a migrated doc has an alias for its previous generated/path reference
When: /docs/resolve?ref=<alias> or registry resolution is used
Then: it resolves to the canonical document record and reports the canonical id.

Scenario: Search indexes migrated IDs

Given: migrated task-template and Podcast process docs
When: the search index is built
Then: the index build succeeds, migrated docs include their stable id keyword field, and existing search tests still pass.

Scenario: Work-engine keeps process-doc identity

Given: a template with sourceDocIds and task definitions with instructionDocId
When: templates are created, updated, instantiated into tasks, and exported
Then: the stable process-doc IDs are preserved and no code falls back to Google Docs URLs as the only identity for migrated in-repo docs.

Out of Scope

Dependencies

  • This issue is foundational for Add internal link and related-doc validation #34 and the workflow mapping issues because they should rely on stable document identities rather than generated path IDs.
  • No external credentials or human-only checks are expected.
  • If a referenced Podcast process doc does not exist in content/**, document the missing mapping in the implementation notes and keep the current external instructionsUrl; do not invent a document or edit a source repository.

Labels

Use labels: enhancement, docs, process-docs, backend, work-engine, testing, data, P0.

Remove needs grooming after this body is applied.

Verification Commands

Run the relevant focused checks during implementation, and the full relevant workflow before handoff:

uv run --project lambda-functions --extra search --with pytest python -m pytest tests/docs_app/test_backend_docs_metadata.py tests/docs_app/test_frontend_routes_and_links.py
uv run --project lambda-functions --extra search --with pytest python -m pytest tests/docs_app
cd lambda-functions
uv run --extra search python -m lambda_functions.build_search_index \
  --docs-dir ../content \
  --output ../.tmp/dataops-content-search.index
npm --prefix work-engine test
npm --prefix work-engine run typecheck
npm --prefix work-engine run build

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Must havebackendBackend/APIdataData model, migration, storagedocsDocumentation or process docs workenhancementNew or improved functionalityprocess-docsSOPs, templates, references, playbookstestingTests and QAwork-engineDataTasks task execution engine

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions