Problem
File discovery (list_repo_files in server/indexer/github_source.py:112-155) fetches the entire repo tree in a single GitHub Trees API call with recursive=1:
tree = await _gh_get(
c,
f"{_GITHUB_API}/repos/{repo}/git/trees/{ref}",
token,
{"recursive": "1"},
timeout=30,
)
The GitHub Trees API has an undocumented response size limit. When the response exceeds it, GitHub sets truncated: true and returns only a partial tree. semcode detects this but only logs a warning — it does not paginate, retry, or fall back:
if tree.get("truncated"):
logger.warning(
"GitHub trees response truncated for %s@%s — repo is very large; "
"some files may be missing from the index.",
repo,
ref,
)
It then iterates over whatever partial tree.get("tree", []) contains.
Impact
For very large repositories, an unknown subset of files is silently absent from the index run. Those files are never parsed, embedded, or upserted, so they simply can't be found via search — with no error, just a log line that's easy to miss.
Suggested fix
Handle the truncated case instead of logging and continuing. Options:
- Recursively walk the tree non-recursively per subtree (fetch
git/trees/<sha> for each directory, paginating down) when truncated is true.
- Or fall back to the Contents API / git pack-based listing for the affected repo.
References
server/indexer/github_source.py:112-155
docs/ingestion.md:40
Problem
File discovery (
list_repo_filesinserver/indexer/github_source.py:112-155) fetches the entire repo tree in a single GitHub Trees API call withrecursive=1:The GitHub Trees API has an undocumented response size limit. When the response exceeds it, GitHub sets
truncated: trueand returns only a partial tree. semcode detects this but only logs a warning — it does not paginate, retry, or fall back:It then iterates over whatever partial
tree.get("tree", [])contains.Impact
For very large repositories, an unknown subset of files is silently absent from the index run. Those files are never parsed, embedded, or upserted, so they simply can't be found via search — with no error, just a log line that's easy to miss.
Suggested fix
Handle the truncated case instead of logging and continuing. Options:
git/trees/<sha>for each directory, paginating down) whentruncatedis true.References
server/indexer/github_source.py:112-155docs/ingestion.md:40