Skip to content

Add dataops-knowledge migration scaffold and template validation #59

Description

@alexeygrigorev

Add dataops-knowledge migration scaffold and template validation

Status: pending
Tags: enhancement, docs, migration, process-docs, testing, data, P1
Depends on: #47 (closed)
Blocks: Future repository creation, content migration, template conversion, template sync/loading, portal edit commits, and deployed cache refresh issues.

Scope

Build the first safe implementation slice for the future DataTalksClub/dataops-knowledge repository without creating that repository, moving production content, or changing the operator-facing read path.

This issue should add a migration scaffold and validation contract inside DataTalksClub/dataops so later agents have an exact, testable target for the knowledge repository layout and workflow-template YAML migration. The deployed portal, work-engine, and operator UX must continue to use the existing in-repo content/ fallback during this slice.

Expected implementation shape:

  • Add a repo-template/scaffold directory for the future dataops-knowledge repository, using a clear name such as templates/dataops-knowledge/, scaffolds/dataops-knowledge/, or another project-consistent path.
  • The scaffold must represent the target layout from docs/decisions/dataops-knowledge-repository.md: content/, workflow-templates/, assistant-prompts/, assistant-process/, examples/, images/, indexes/, schemas/, scripts/, and tests/.
  • Include concise README/guidance in the scaffold stating that it is a migration target only and is not yet the live source for the portal or work-engine.
  • Add strict validation assets for the future workflow-template YAML format, preferably a JSON Schema under the scaffold's schemas/ directory.
  • Add a machine-readable migration inventory/manifest that maps every current content/tasks/templates/*.md file to its future workflow-templates/*.yaml target and stable template ID. The manifest must cover exactly the current 11 templates: book-of-the-week, course, maven-ll, newsletter, office-hours, oss, podcast, social-media, tax-report, webinar, and workshop.
  • Add a validation command in the existing DataOps validation/tooling stack, preferably under lambda-functions/src/lambda_functions/, that checks the scaffold, schema, and migration manifest from the current repo. It should fail on missing scaffold directories, invalid schema JSON, missing or duplicate template mappings, missing current Markdown template files, missing stable IDs, non-task-template doc types for current templates, and target filenames outside workflow-templates/*.yaml.
  • Add focused automated tests for the new validator, including at least one passing fixture and failure coverage for missing template mappings, duplicate target IDs or paths, and invalid target path/schema shape.
  • Wire the new validation into the appropriate content/planning CI path so future edits to the scaffold, manifest, schemas, task templates, or validator are checked automatically.
  • Preserve the current operator experience: no deployed portal route, docs search, work-engine seed, runtime task creation, or portal edit behavior should start reading from the scaffold or future repo in this slice.

Process Curator requirements:

  • Keep stable document IDs as the migration boundary. File paths may change later; current task-template IDs and future template IDs must remain explicit and testable.
  • Keep content/tasks/templates/*.md as the transitional canonical source for this issue.
  • Do not copy SOPs, images, prompts, assistant process files, examples, or generated indexes into the scaffold except for placeholder README files, .gitkeep-style placeholders, schemas, tests, scripts, or minimal synthetic fixtures needed to validate the scaffold.
  • If the implementation includes an example workflow-template YAML file, it must be clearly synthetic or fixture-only unless the issue explicitly explains why copying production template content is safe. Production template conversion belongs to a later issue.

Acceptance Criteria

  • A future dataops-knowledge scaffold exists in dataops with the target top-level directories from the accepted ADR and guidance that the scaffold is not yet live production content.
  • The scaffold includes a strict workflow-template schema covering the core contract from the ADR: stable id, runtime type, name, schema_version, trigger model, bundle links, phases/stage mapping, tasks with stable task IDs, scheduling offsets or rules, required proof/link declarations, default assignee references by stable role/user ID, instruction_doc_id/source document references, and optional migration source metadata.
  • A machine-readable migration manifest maps exactly all current content/tasks/templates/*.md files to future workflow-templates/*.yaml targets with stable IDs, with no duplicates or missing current files.
  • A repository validation command checks the scaffold, schema, and manifest and exits non-zero with actionable messages for missing directories, invalid JSON schema, missing/duplicate mappings, missing Markdown template files, invalid template frontmatter, and invalid target YAML paths.
  • Automated tests cover the validator success path and the required failure cases without depending on production secrets, GitHub writes, or the future external repository.
  • CI validates the new scaffold/schema/manifest/validator on relevant path changes.
  • Current portal/content behavior remains unchanged: Lambda/docs search still reads from content/, work-engine runtime templates are not loaded from the scaffold, and portal edit commits are not redirected to a new repository.
  • No files are moved out of content/, assistants/, work-engine/, or source repos outside dataops.
  • No DataTalksClub/dataops-knowledge repository is created and no GitHub write/token/branch-protection automation is added in this slice.
  • Source repos outside dataops, including ../dtc-operations, ../datatasks, and ../podcast-assistant, remain read-only.

Test Scenarios

Scenario: Scaffold defines the future repository without changing runtime behavior

Given the accepted ADR and the current content/ fallback
When the validation command runs against the repository
Then it confirms the future layout, schemas, and migration manifest while the portal and work-engine still use the current in-repo content paths.

Scenario: Current Markdown templates are fully inventoried

Given the 11 current files under content/tasks/templates/
When the migration manifest is validated
Then every current file is mapped to one unique workflow-templates/*.yaml target and one stable template ID, with no missing or duplicate mappings.

Scenario: Bad migration mappings fail clearly

Given a fixture manifest with a missing template, duplicate target path or ID, invalid target path, or missing source Markdown file
When the validator runs in tests
Then it exits with a useful error that tells the next agent what to fix.

Scenario: Workflow-template schema is enforceable before conversion

Given a minimal valid fixture workflow template and invalid fixtures missing required fields
When schema validation runs in tests
Then the valid fixture passes and invalid fixtures fail before any production template conversion begins.

Scenario: Operator UX stays unified during the transition

Given the scaffold exists in the repo
When an operator opens the current portal or a workflow task references process docs
Then the app continues resolving docs from content/ and existing stable document IDs; the scaffold is not exposed as a disconnected second operator tool.

Required Verification

Software Engineer should run and report exact commands and exit codes for:

  • git diff --check
  • Focused tests for the new validator, for example uv run --project lambda-functions --extra search --with pytest python -m pytest tests/docs_app/test_validate_knowledge_repo.py
  • The new validation command against the real repository, using its documented CLI invocation.
  • Existing docs/content link validation: uv run --project lambda-functions --extra search python -m lambda_functions.validate_docs_links --repo-root . --content-root content
  • Search-index build to prove current content remains readable: cd lambda-functions && uv run --extra search python -m lambda_functions.build_search_index --docs-dir ../content --output ../.tmp/dataops-content-search.index
  • Full docs-app test workflow unless the implementation only adds isolated schema files and a narrowly scoped validator; if narrowed, explain why the focused validator tests plus content validation prove the acceptance criteria.

Tester should rerun the relevant verification independently. Screenshots are not required unless the implementation changes portal UI, routes, or rendered content.

Out of Scope

  • Creating DataTalksClub/dataops-knowledge or changing GitHub repository settings, branch protection, repository visibility, secrets, or tokens.
  • Moving, copying, or deleting production SOPs, images, prompts, assistant process files, examples, generated indexes, or task templates into a new canonical location.
  • Converting the 11 Markdown task templates into production canonical YAML files.
  • Implementing work-engine template loading/sync from Git-backed YAML, template version tracking, or source commit tracking.
  • Changing Lambda/portal configuration to read from dataops-knowledge.
  • Implementing portal edit commits to the knowledge repository.
  • Refreshing deployed portal cache/search index from an external knowledge repository.
  • Performing the data-safety review for content, images, prompts, assistant knowledge, examples, or private artifacts.
  • Migrating runtime state, DynamoDB data, assistant outputs, recordings, transcripts, invoices, receipts, sponsor/client records, or other private/bulky files.
  • Modifying ../dtc-operations, ../datatasks, or ../podcast-assistant.

Dependencies

  • Decide separate repository strategy for operations documents #47 is the accepted decision baseline. Follow docs/decisions/dataops-knowledge-repository.md, especially the rule to keep content/ in dataops until read/sync/edit/refresh support exists.
  • Use docs/repository-structure-recommendation.md, docs/STRUCTURE.md, and docs/README.md for current content structure, frontmatter, stable ID, and repo-meta conventions.
  • The next implementation owner is Software Engineer.
  • Later issues must handle repository creation, data-safety review, production YAML conversion, work-engine sync/load, portal config, portal edit commits, external-repo CI, deployed cache refresh, assistant knowledge migration, and private artifact handling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1ImportantdataData model, migration, storagedocsDocumentation or process docs workenhancementNew or improved functionalitymigrationImport or migration workprocess-docsSOPs, templates, references, playbookstestingTests and QA

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions