Skip to content

Hydrate content images in production docs cache #75

Description

@alexeygrigorev

Hydrate content images in production docs cache

Status: pending
Tags: bug, backend, portal, process-docs, infra, P0
Depends on: None
Blocks: None

Scope

Production Process Quality reports broken-asset-reference warnings for SOP screenshots used by seeded daily-task documentation even though the same Markdown/image references pass local validation and the referenced files exist under content/images.

Fix the production docs cache hydration path so Lambda has the non-Markdown content assets needed to validate and render process docs consistently with the repository. The primary suspected area is lambda-functions/src/lambda_functions/github_store.py: GitHubStore.sync_markdown() calls download_markdown_tarball(), whose tarball loop currently keeps only repo paths that start with content/ and end with .md. In production, lambda-functions/src/lambda_functions/full_app_handler.py sets CONTENT_ROOT to /tmp/dataops/content, and GET /docs/process-quality builds the report from that cache via api_handler.get_process_quality() and process_quality.build_report(CONTENT_ROOT.parent, CONTENT_ROOT). Because content/images/** is not hydrated, process-quality sees missing local image targets.

The implementation should make the Lambda cache include process-doc image assets required by local Markdown references while keeping the cache bounded to safe repo content paths. The fix should cover both cold sync_markdown() hydration and /admin/refresh/internal refresh behavior, because refresh resets /tmp/dataops and rehydrates the cache.

Concrete production examples from the raw intake:

  • content/community/book-of-the-week/sops/invite-people-to-slack-from-the-airtable-form.md references ../../../images/book-of-the-week-events/invite-people-to-slack-from-the-airtable-form/media/image1.jpg through image5.jpg/png; those files exist locally under content/images/book-of-the-week-events/invite-people-to-slack-from-the-airtable-form/media/.
  • content/internal-admin/trello/sops/how-to-create-an-event-through-trello.md references ../../../images/other-process-documents/how-to-create-an-event-through-trello/media/image1.jpg through image3.jpg; those files exist locally under content/images/other-process-documents/how-to-create-an-event-through-trello/media/.

Acceptance Criteria

  • Production cache hydration downloads Markdown docs and the referenced image/static assets needed for docs validation from content/**, including content/images/** files with common image extensions used by current SOPs (.png, .jpg, .jpeg, .gif, .webp, .svg).
  • Cache hydration does not download unrelated repository files outside content/, and it preserves the existing path traversal protections already enforced by GitHubStore.local_path() / normalize_repo_path().
  • GET /docs/process-quality no longer reports broken-asset-reference findings for the two known SOPs listed in Scope when the image files exist in GitHub and the Lambda cache has just been hydrated.
  • /admin/refresh and internal GitHub Actions refresh still reset and rebuild the cache/search index successfully with the expanded content asset set.
  • Existing docs APIs, search behavior, and static /content/... asset serving continue to work. Lazy STORE.ensure_file() behavior for static content must not regress.
  • Automated tests cover the GitHub tarball/cache filter behavior so a future Markdown-only regression would fail locally.
  • The production deploy smoke includes a Process Quality check after main is deployed and confirms the target false-positive broken-asset-reference warnings are gone or documents exact remaining findings if they are real content defects. [HUMAN] only if production credentials or external access are required.

Test Scenarios

Scenario: Tarball sync includes docs images

Given: a mocked or fixture tarball containing content/ops/sops/example.md, content/images/ops/example.png, and unrelated paths such as README.md or non-content build artifacts
When: GitHubStore.sync_markdown() / tarball hydration runs against an empty cache
Then: the cache contains the Markdown file and allowed content/images/** asset, and it does not contain unrelated non-content files.

Scenario: Process Quality uses hydrated image assets

Given: a temporary repo/cache with a Markdown SOP that references an existing image under content/images/**
When: process_quality.build_report(cache_root, cache_root / "content") or the /docs/process-quality handler runs after cache hydration
Then: no broken-asset-reference finding is emitted for that valid image reference.

Scenario: Missing image still reports real defect

Given: a Markdown SOP references an image path that is not present in content/images/**
When: docs link validation or Process Quality runs
Then: the existing broken-asset-reference warning still appears with a useful source path and remediation action.

Scenario: Refresh rebuilds expanded cache

Given: the production full app store has been populated once
When: /admin/refresh or the internal refresh event calls refresh_from_github()
Then: the cache is reset, Markdown and allowed content image assets are rehydrated, search rebuild still completes, and docs/process-quality endpoints keep returning successful responses.

Scenario: Static asset serving remains compatible

Given: a browser or docs renderer requests /content/images/... for an image that is already cached or can be fetched through the GitHub tree
When: full_app_handler.serve_content() calls STORE.ensure_file()
Then: the response returns the correct binary body/content type and existing lazy fetch behavior still handles cache misses.

Out of Scope

  • Rewriting SOP Markdown image references or moving existing image files.
  • Changing the Process Quality model to suppress valid broken-asset-reference findings.
  • Broadly mirroring the entire repository into Lambda cache.
  • Replacing GitHubStore with a different storage provider.
  • Manual app deployment outside this repo's GitHub Actions CI/CD flow.
  • Modifying source repos listed in AGENTS.md such as ../dtc-operations, ../datatasks, or ../podcast-assistant.

Dependencies

  • Production app deploy remains through this repo's GitHub Actions OIDC deployment after main is pushed.
  • No CloudFormation update is expected for this bug unless the implementation discovers a Lambda package/runtime limit or IAM permission issue. If ../aws-infra/sandbox/dataops/template.github-actions.yaml changes, a credentialed AWS operator must apply the dataops-github-actions stack; committing it alone is not sufficient.
  • Use the DataOps process gates: Software Engineer implements without commit, Tester verifies with real commands/evidence, PM accepts from the operator perspective, then orchestrator merges/pushes and On-Call monitors CI/CD.
  • Recommended verification commands for the engineer/tester:
    • uv run --project lambda-functions --extra search pytest ../tests/docs_app/test_process_quality.py ../tests/docs_app/test_validate_docs_links.py
    • Add/run focused tests for GitHubStore tarball hydration behavior.
    • uv run --project lambda-functions --extra search python -m lambda_functions.validate_docs_links --repo-root . --content-root content
    • uv run --project lambda-functions --extra search python -m lambda_functions.process_quality --repo-root . --content-root content --output .tmp/process-quality-report.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Must havebackendBackend/APIbugSomething is brokeninfraDeployment and infrastructureportalShared portal shell and UXprocess-docsSOPs, templates, references, playbooks

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions