Skip to content

GitHub Trees API truncation silently drops files from index runs #55

@GoodbyePlanet

Description

@GoodbyePlanet

Problem

File discovery (list_repo_files in server/indexer/github_source.py:112-155) fetches the entire repo tree in a single GitHub Trees API call with recursive=1:

tree = await _gh_get(
    c,
    f"{_GITHUB_API}/repos/{repo}/git/trees/{ref}",
    token,
    {"recursive": "1"},
    timeout=30,
)

The GitHub Trees API has an undocumented response size limit. When the response exceeds it, GitHub sets truncated: true and returns only a partial tree. semcode detects this but only logs a warning — it does not paginate, retry, or fall back:

if tree.get("truncated"):
    logger.warning(
        "GitHub trees response truncated for %s@%s — repo is very large; "
        "some files may be missing from the index.",
        repo,
        ref,
    )

It then iterates over whatever partial tree.get("tree", []) contains.

Impact

For very large repositories, an unknown subset of files is silently absent from the index run. Those files are never parsed, embedded, or upserted, so they simply can't be found via search — with no error, just a log line that's easy to miss.

Suggested fix

Handle the truncated case instead of logging and continuing. Options:

  • Recursively walk the tree non-recursively per subtree (fetch git/trees/<sha> for each directory, paginating down) when truncated is true.
  • Or fall back to the Contents API / git pack-based listing for the affected repo.

References

  • server/indexer/github_source.py:112-155
  • docs/ingestion.md:40

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions