Skip to content

Add internal link and related-doc validation #34

Description

@alexeygrigorev

Add internal link and related-doc validation

Status: pending
Tags: enhancement, docs, process-docs, backend, work-engine, testing, data, P1
Depends on: #33
Blocks: None

Scope

Add deterministic validation for internal process-document references so broken documentation links, missing images, invalid related_docs, and stale workflow-template document IDs fail in local checks and CI before they reach the deployed docs portal or work-engine seeds.

This issue builds on #33. #33 defines the stable-ID contract and migrates the first workflow-critical documents. This issue should not redo that migration; it should enforce the contract for current and future changes.

The validator should cover these reference surfaces:

  • Stable document IDs and aliases exposed by lambda-functions/src/lambda_functions/doc_registry.py.
  • related_docs frontmatter on indexed Markdown documents.
  • Wiki-style internal links such as [[doc-id]] and [[doc-id|label]].
  • doc: references used in Markdown links, for example [Open](doc:sop.media.podcast.create-podcast-document).
  • Repo-relative Markdown links between local docs, including links from content/**, _docs/**, docs/**, .goal-v1.md, PROJECT_PLAN.md, and PORTAL_ANALYSIS.md when those files are part of process documentation.
  • Local Markdown image targets under content/** and process-doc paths, including SOP screenshot references.
  • Task-template document references from content/tasks/templates/*.md frontmatter and workflow template data in work-engine/scripts/seed-templates.ts, especially sourceDocIds and task instructionDocId values.
  • Search/index behavior that depends on the docs registry, especially stable id keyword fields for indexed content.

Implementation should preserve the current registry behavior that resolves canonical IDs, aliases, paths, and wiki refs. It should add clear validation coverage rather than replacing the registry with ad hoc string checks. Structured parsing is preferred where the codebase already provides it; plain Markdown scanning is acceptable for links/images when it is tested against realistic examples.

Validation boundaries:

  • content/** is the current operational-documentation tree and should be registry/search validated.
  • docs/**, _docs/**, .goal-v1.md, PROJECT_PLAN.md, and PORTAL_ANALYSIS.md are repo/process documentation. They should be checked for broken local Markdown links/images, but they should not be forced into the public content search index unless an existing design already indexes them.
  • ../dtc-operations, ../datatasks, and ../podcast-assistant remain read-only source systems and must not be modified.
  • The future DataTalksClub/dataops-knowledge split remains out of scope. The validator should be written so roots can be made configurable later, but it should validate the current dataops layout now.

The implementation should expose one local validation command that agents and CI can run directly. A module command under lambda-functions, a small root script, or a documented wrapper is acceptable if it is deterministic and works from a fresh checkout.

Acceptance Criteria

  • A local validation command exists for docs/process-doc link validation and is documented in the relevant local-development or process documentation.
  • The validator fails with actionable messages for duplicate/invalid stable IDs, duplicate/conflicting aliases, and broken related_docs references. Existing registry errors should remain specific enough to name the offending source path and missing/conflicting reference.
  • The validator finds broken wiki references, doc: references, and repo-relative Markdown links in content/** and the current process-doc surfaces listed in Scope.
  • The validator finds missing local Markdown image targets, including SOP screenshots under content/images/**, while ignoring external HTTP(S), mailto:, anchor-only, and intentionally non-local links.
  • Anchor handling is explicit: either validate same-file/local heading anchors with tests, or document that anchor validation is deferred while still validating the target document path/ID.
  • content/tasks/templates/*.md references resolve through the document registry by stable ID or accepted alias/path where applicable.
  • Workflow template sourceDocIds and task instructionDocId values in work-engine/scripts/seed-templates.ts are validated against the registry after Extend process docs with stable IDs #33 lands. External instructionsUrl values may remain, but for migrated in-repo process docs they must not be the only identity when a stable ID is present.
  • Search index build still fails if registry validation fails and still indexes stable id keyword fields for content documents.
  • .github/workflows/validate-dataops-content.yml runs the new validation for relevant content/process-doc changes, and .github/workflows/deploy-dataops-v1.yml continues to cover the broader app workflow without duplicate brittle logic.
  • Tests cover positive and negative cases for related_docs, wiki refs, doc: refs, Markdown links, images, task-template refs, and work-engine template document IDs.
  • Process Curator expectations are documented: when this issue changes SOP/template/reference quality rules or reports real broken process knowledge, a Process Curator review should confirm the messages and remediation path are useful for future agents.
  • No source repositories outside DataTalksClub/dataops are modified.

Test Scenarios

Scenario: Related docs still resolve through registry

Given: a content document has related_docs entries using a stable ID, an alias, and a relative Markdown path
When: the validation command runs
Then: all valid references resolve to canonical document records and the command exits successfully.

Scenario: Broken related doc fails clearly

Given: a content document has related_docs: [missing.process.doc]
When: the validation command or search-index build runs
Then: validation fails with a message naming the source Markdown path and missing.process.doc.

Scenario: Wiki and doc refs are validated

Given: Markdown body text contains [[sop.media.podcast.create-podcast-document]], [[missing.doc|Missing]], and [Open](doc:missing.doc)
When: the validation command runs
Then: the valid wiki ref passes and both missing refs fail with path and reference details.

Scenario: Repo-relative Markdown link is checked

Given: a process doc links to an existing local Markdown file and another link points at a missing local Markdown file
When: the validation command runs
Then: the existing target passes and the missing target is reported with the source path and link target.

Scenario: Missing SOP image is checked

Given: an SOP screenshot references ../../images/podcast/missing.png
When: the validation command runs
Then: validation fails with a message naming the source SOP and missing image path.

Scenario: External links are not treated as local files

Given: a Markdown file contains links to Google Docs, GitHub, Loom, mailto:, and an anchor-only section link
When: the validation command runs
Then: those links do not require local file existence and do not produce false failures.

Scenario: Task-template source docs resolve

Given: content/tasks/templates/podcast.md has stable frontmatter metadata and related process docs
When: validation runs after #33
Then: the template's related_docs resolve through the registry and invalid template doc references fail.

Scenario: Work-engine instruction docs resolve

Given: work-engine/scripts/seed-templates.ts contains sourceDocIds and task instructionDocId values
When: validation runs after #33
Then: each in-repo stable document ID resolves through the registry, and missing IDs fail before templates are seeded.

Scenario: Search build respects registry failures

Given: the content tree contains a broken stable-ID reference
When: lambda_functions.build_search_index runs
Then: the build fails before writing a usable index, preserving the existing registry-first search behavior.

Out of Scope

  • Migrating stable IDs onto task templates or Podcast process docs. That belongs to Extend process docs with stable IDs #33.
  • Adding stable IDs to every imported SOP, reference, playbook, template, prompt, or archive under content/**.
  • Refactoring all Markdown rendering or building a new docs frontend.
  • Validating external HTTP(S) availability, Google Docs permissions, Loom links, GitHub issue URLs, or other network resources.
  • Moving content/ into DataTalksClub/dataops-knowledge.
  • Editing ../dtc-operations, ../datatasks, or ../podcast-assistant.
  • Changing production DynamoDB data, deployed docs cache state, or external process documents.
  • Requiring human-only checks for this validation; any real external-account verification should be marked [HUMAN] in a separate issue.

Dependencies

  • Extend process docs with stable IDs #33 must land first or this issue must explicitly handle partially migrated docs as warnings only where Extend process docs with stable IDs #33 has not yet created stable IDs.
  • The implementation should reuse or extend lambda-functions/src/lambda_functions/doc_registry.py, lambda-functions/src/lambda_functions/docs_index.py, and existing docs-app tests under tests/docs_app/ where possible.
  • Work-engine reference validation must inspect the same template source that seeds runtime templates, currently work-engine/scripts/seed-templates.ts, and should not require a live DynamoDB or deployed service.
  • CI changes should account for the existing path filters in .github/workflows/validate-dataops-content.yml and .github/workflows/deploy-dataops-v1.yml.

Labels

Use labels: enhancement, docs, process-docs, backend, work-engine, testing, data, P1.

Remove needs grooming after this body is applied.

Verification Commands

Run the focused docs validation command added by this issue. The exact command may differ if the implementer chooses a different module name, but the issue is not complete until a documented command equivalent to this exists:

uv run --project lambda-functions --extra search python -m lambda_functions.validate_docs_links \
  --repo-root . \
  --content-root content

Run docs app tests:

uv run --project lambda-functions --extra search --with pytest python -m pytest tests/docs_app

Build the search index:

cd lambda-functions
uv run --extra search python -m lambda_functions.build_search_index \
  --docs-dir ../content \
  --output ../.tmp/dataops-content-search.index

Run work-engine checks because workflow-template sourceDocIds and instructionDocId validation touches template seed behavior:

npm --prefix work-engine test
npm --prefix work-engine run typecheck
npm --prefix work-engine run build

Before handoff, include:

git diff --check

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1ImportantbackendBackend/APIdataData model, migration, storagedocsDocumentation or process docs workenhancementNew or improved functionalityprocess-docsSOPs, templates, references, playbookstestingTests and QAwork-engineDataTasks task execution engine

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions