You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add deterministic validation for internal process-document references so broken documentation links, missing images, invalid related_docs, and stale workflow-template document IDs fail in local checks and CI before they reach the deployed docs portal or work-engine seeds.
This issue builds on #33. #33 defines the stable-ID contract and migrates the first workflow-critical documents. This issue should not redo that migration; it should enforce the contract for current and future changes.
The validator should cover these reference surfaces:
Stable document IDs and aliases exposed by lambda-functions/src/lambda_functions/doc_registry.py.
related_docs frontmatter on indexed Markdown documents.
Wiki-style internal links such as [[doc-id]] and [[doc-id|label]].
doc: references used in Markdown links, for example [Open](doc:sop.media.podcast.create-podcast-document).
Repo-relative Markdown links between local docs, including links from content/**, _docs/**, docs/**, .goal-v1.md, PROJECT_PLAN.md, and PORTAL_ANALYSIS.md when those files are part of process documentation.
Local Markdown image targets under content/** and process-doc paths, including SOP screenshot references.
Task-template document references from content/tasks/templates/*.md frontmatter and workflow template data in work-engine/scripts/seed-templates.ts, especially sourceDocIds and task instructionDocId values.
Search/index behavior that depends on the docs registry, especially stable id keyword fields for indexed content.
Implementation should preserve the current registry behavior that resolves canonical IDs, aliases, paths, and wiki refs. It should add clear validation coverage rather than replacing the registry with ad hoc string checks. Structured parsing is preferred where the codebase already provides it; plain Markdown scanning is acceptable for links/images when it is tested against realistic examples.
Validation boundaries:
content/** is the current operational-documentation tree and should be registry/search validated.
docs/**, _docs/**, .goal-v1.md, PROJECT_PLAN.md, and PORTAL_ANALYSIS.md are repo/process documentation. They should be checked for broken local Markdown links/images, but they should not be forced into the public content search index unless an existing design already indexes them.
../dtc-operations, ../datatasks, and ../podcast-assistant remain read-only source systems and must not be modified.
The future DataTalksClub/dataops-knowledge split remains out of scope. The validator should be written so roots can be made configurable later, but it should validate the current dataops layout now.
The implementation should expose one local validation command that agents and CI can run directly. A module command under lambda-functions, a small root script, or a documented wrapper is acceptable if it is deterministic and works from a fresh checkout.
Acceptance Criteria
A local validation command exists for docs/process-doc link validation and is documented in the relevant local-development or process documentation.
The validator fails with actionable messages for duplicate/invalid stable IDs, duplicate/conflicting aliases, and broken related_docs references. Existing registry errors should remain specific enough to name the offending source path and missing/conflicting reference.
The validator finds broken wiki references, doc: references, and repo-relative Markdown links in content/** and the current process-doc surfaces listed in Scope.
The validator finds missing local Markdown image targets, including SOP screenshots under content/images/**, while ignoring external HTTP(S), mailto:, anchor-only, and intentionally non-local links.
Anchor handling is explicit: either validate same-file/local heading anchors with tests, or document that anchor validation is deferred while still validating the target document path/ID.
content/tasks/templates/*.md references resolve through the document registry by stable ID or accepted alias/path where applicable.
Workflow template sourceDocIds and task instructionDocId values in work-engine/scripts/seed-templates.ts are validated against the registry after Extend process docs with stable IDs #33 lands. External instructionsUrl values may remain, but for migrated in-repo process docs they must not be the only identity when a stable ID is present.
Search index build still fails if registry validation fails and still indexes stable id keyword fields for content documents.
.github/workflows/validate-dataops-content.yml runs the new validation for relevant content/process-doc changes, and .github/workflows/deploy-dataops-v1.yml continues to cover the broader app workflow without duplicate brittle logic.
Tests cover positive and negative cases for related_docs, wiki refs, doc: refs, Markdown links, images, task-template refs, and work-engine template document IDs.
Process Curator expectations are documented: when this issue changes SOP/template/reference quality rules or reports real broken process knowledge, a Process Curator review should confirm the messages and remediation path are useful for future agents.
No source repositories outside DataTalksClub/dataops are modified.
Test Scenarios
Scenario: Related docs still resolve through registry
Given: a content document has related_docs entries using a stable ID, an alias, and a relative Markdown path
When: the validation command runs
Then: all valid references resolve to canonical document records and the command exits successfully.
Scenario: Broken related doc fails clearly
Given: a content document has related_docs: [missing.process.doc]
When: the validation command or search-index build runs
Then: validation fails with a message naming the source Markdown path and missing.process.doc.
Scenario: Wiki and doc refs are validated
Given: Markdown body text contains [[sop.media.podcast.create-podcast-document]], [[missing.doc|Missing]], and [Open](doc:missing.doc)
When: the validation command runs
Then: the valid wiki ref passes and both missing refs fail with path and reference details.
Scenario: Repo-relative Markdown link is checked
Given: a process doc links to an existing local Markdown file and another link points at a missing local Markdown file
When: the validation command runs
Then: the existing target passes and the missing target is reported with the source path and link target.
Scenario: Missing SOP image is checked
Given: an SOP screenshot references ../../images/podcast/missing.png
When: the validation command runs
Then: validation fails with a message naming the source SOP and missing image path.
Scenario: External links are not treated as local files
Given: a Markdown file contains links to Google Docs, GitHub, Loom, mailto:, and an anchor-only section link
When: the validation command runs
Then: those links do not require local file existence and do not produce false failures.
Scenario: Task-template source docs resolve
Given: content/tasks/templates/podcast.md has stable frontmatter metadata and related process docs
When: validation runs after #33
Then: the template's related_docs resolve through the registry and invalid template doc references fail.
Scenario: Work-engine instruction docs resolve
Given: work-engine/scripts/seed-templates.ts contains sourceDocIds and task instructionDocId values
When: validation runs after #33
Then: each in-repo stable document ID resolves through the registry, and missing IDs fail before templates are seeded.
Scenario: Search build respects registry failures
Given: the content tree contains a broken stable-ID reference
When: lambda_functions.build_search_index runs
Then: the build fails before writing a usable index, preserving the existing registry-first search behavior.
The implementation should reuse or extend lambda-functions/src/lambda_functions/doc_registry.py, lambda-functions/src/lambda_functions/docs_index.py, and existing docs-app tests under tests/docs_app/ where possible.
Work-engine reference validation must inspect the same template source that seeds runtime templates, currently work-engine/scripts/seed-templates.ts, and should not require a live DynamoDB or deployed service.
CI changes should account for the existing path filters in .github/workflows/validate-dataops-content.yml and .github/workflows/deploy-dataops-v1.yml.
Labels
Use labels: enhancement, docs, process-docs, backend, work-engine, testing, data, P1.
Remove needs grooming after this body is applied.
Verification Commands
Run the focused docs validation command added by this issue. The exact command may differ if the implementer chooses a different module name, but the issue is not complete until a documented command equivalent to this exists:
Add internal link and related-doc validation
Status: pending
Tags:
enhancement,docs,process-docs,backend,work-engine,testing,data,P1Depends on: #33
Blocks: None
Scope
Add deterministic validation for internal process-document references so broken documentation links, missing images, invalid
related_docs, and stale workflow-template document IDs fail in local checks and CI before they reach the deployed docs portal or work-engine seeds.This issue builds on #33. #33 defines the stable-ID contract and migrates the first workflow-critical documents. This issue should not redo that migration; it should enforce the contract for current and future changes.
The validator should cover these reference surfaces:
lambda-functions/src/lambda_functions/doc_registry.py.related_docsfrontmatter on indexed Markdown documents.[[doc-id]]and[[doc-id|label]].doc:references used in Markdown links, for example[Open](doc:sop.media.podcast.create-podcast-document).content/**,_docs/**,docs/**,.goal-v1.md,PROJECT_PLAN.md, andPORTAL_ANALYSIS.mdwhen those files are part of process documentation.content/**and process-doc paths, including SOP screenshot references.content/tasks/templates/*.mdfrontmatter and workflow template data inwork-engine/scripts/seed-templates.ts, especiallysourceDocIdsand taskinstructionDocIdvalues.idkeyword fields for indexed content.Implementation should preserve the current registry behavior that resolves canonical IDs, aliases, paths, and wiki refs. It should add clear validation coverage rather than replacing the registry with ad hoc string checks. Structured parsing is preferred where the codebase already provides it; plain Markdown scanning is acceptable for links/images when it is tested against realistic examples.
Validation boundaries:
content/**is the current operational-documentation tree and should be registry/search validated.docs/**,_docs/**,.goal-v1.md,PROJECT_PLAN.md, andPORTAL_ANALYSIS.mdare repo/process documentation. They should be checked for broken local Markdown links/images, but they should not be forced into the public content search index unless an existing design already indexes them.../dtc-operations,../datatasks, and../podcast-assistantremain read-only source systems and must not be modified.DataTalksClub/dataops-knowledgesplit remains out of scope. The validator should be written so roots can be made configurable later, but it should validate the currentdataopslayout now.The implementation should expose one local validation command that agents and CI can run directly. A module command under
lambda-functions, a small root script, or a documented wrapper is acceptable if it is deterministic and works from a fresh checkout.Acceptance Criteria
related_docsreferences. Existing registry errors should remain specific enough to name the offending source path and missing/conflicting reference.doc:references, and repo-relative Markdown links incontent/**and the current process-doc surfaces listed in Scope.content/images/**, while ignoring external HTTP(S),mailto:, anchor-only, and intentionally non-local links.content/tasks/templates/*.mdreferences resolve through the document registry by stable ID or accepted alias/path where applicable.sourceDocIdsand taskinstructionDocIdvalues inwork-engine/scripts/seed-templates.tsare validated against the registry after Extend process docs with stable IDs #33 lands. ExternalinstructionsUrlvalues may remain, but for migrated in-repo process docs they must not be the only identity when a stable ID is present.idkeyword fields for content documents..github/workflows/validate-dataops-content.ymlruns the new validation for relevant content/process-doc changes, and.github/workflows/deploy-dataops-v1.ymlcontinues to cover the broader app workflow without duplicate brittle logic.related_docs, wiki refs,doc:refs, Markdown links, images, task-template refs, and work-engine template document IDs.DataTalksClub/dataopsare modified.Test Scenarios
Scenario: Related docs still resolve through registry
Given: a content document has
related_docsentries using a stable ID, an alias, and a relative Markdown pathWhen: the validation command runs
Then: all valid references resolve to canonical document records and the command exits successfully.
Scenario: Broken related doc fails clearly
Given: a content document has
related_docs: [missing.process.doc]When: the validation command or search-index build runs
Then: validation fails with a message naming the source Markdown path and
missing.process.doc.Scenario: Wiki and doc refs are validated
Given: Markdown body text contains
[[sop.media.podcast.create-podcast-document]],[[missing.doc|Missing]], and[Open](doc:missing.doc)When: the validation command runs
Then: the valid wiki ref passes and both missing refs fail with path and reference details.
Scenario: Repo-relative Markdown link is checked
Given: a process doc links to an existing local Markdown file and another link points at a missing local Markdown file
When: the validation command runs
Then: the existing target passes and the missing target is reported with the source path and link target.
Scenario: Missing SOP image is checked
Given: an SOP screenshot references
../../images/podcast/missing.pngWhen: the validation command runs
Then: validation fails with a message naming the source SOP and missing image path.
Scenario: External links are not treated as local files
Given: a Markdown file contains links to Google Docs, GitHub, Loom,
mailto:, and an anchor-only section linkWhen: the validation command runs
Then: those links do not require local file existence and do not produce false failures.
Scenario: Task-template source docs resolve
Given:
content/tasks/templates/podcast.mdhas stable frontmatter metadata and related process docsWhen: validation runs after #33
Then: the template's
related_docsresolve through the registry and invalid template doc references fail.Scenario: Work-engine instruction docs resolve
Given:
work-engine/scripts/seed-templates.tscontainssourceDocIdsand taskinstructionDocIdvaluesWhen: validation runs after #33
Then: each in-repo stable document ID resolves through the registry, and missing IDs fail before templates are seeded.
Scenario: Search build respects registry failures
Given: the content tree contains a broken stable-ID reference
When:
lambda_functions.build_search_indexrunsThen: the build fails before writing a usable index, preserving the existing registry-first search behavior.
Out of Scope
content/**.content/intoDataTalksClub/dataops-knowledge.../dtc-operations,../datatasks, or../podcast-assistant.[HUMAN]in a separate issue.Dependencies
lambda-functions/src/lambda_functions/doc_registry.py,lambda-functions/src/lambda_functions/docs_index.py, and existing docs-app tests undertests/docs_app/where possible.work-engine/scripts/seed-templates.ts, and should not require a live DynamoDB or deployed service..github/workflows/validate-dataops-content.ymland.github/workflows/deploy-dataops-v1.yml.Labels
Use labels:
enhancement,docs,process-docs,backend,work-engine,testing,data,P1.Remove
needs groomingafter this body is applied.Verification Commands
Run the focused docs validation command added by this issue. The exact command may differ if the implementer chooses a different module name, but the issue is not complete until a documented command equivalent to this exists:
uv run --project lambda-functions --extra search python -m lambda_functions.validate_docs_links \ --repo-root . \ --content-root contentRun docs app tests:
Build the search index:
cd lambda-functions uv run --extra search python -m lambda_functions.build_search_index \ --docs-dir ../content \ --output ../.tmp/dataops-content-search.indexRun work-engine checks because workflow-template
sourceDocIdsandinstructionDocIdvalidation touches template seed behavior:npm --prefix work-engine test npm --prefix work-engine run typecheck npm --prefix work-engine run buildBefore handoff, include: