Skip to content

Bulk/projection accessors for the Python facade to avoid N+1 reconstruction on the Neo4j backend #180

Description

@rahlk

Is your feature request related to a problem? Please describe.

The Neo4j-backed Python facade is slow for whole-application enumeration. PythonAnalysis.get_methods()get_all_methods_in_application() walks get_symbol_table(), which does one query for modules (good) but then reconstructs each module faithfully via an N+1 fan-out:

  • _module_full(module) → per-module queries for classes, functions, module-vars, imports
  • _class_full(class) → queries for methods, attributes, inner classes (recurses)
  • _callable_full(callable) → queries for call-sites, declared callables, declared classes, declared vars (recurses)

On a large app (odoo: ~1028 modules, ~1100 classes, ~7102 callables) this is tens of thousands of serialized Bolt round-trips → ~110s for a single get_methods(). It's a classic N+1 reconstruction — deliberately faithful (rebuilds identically to the in-memory PyCodeanalyzer), with fidelity bought in round-trips.

Two aggravating factors:

  • PyNeo4jBackend._run opens a fresh session() per call (neo4j_backend.py:147-150), so every one of those ~30k queries also pays session-acquisition overhead, not just round-trip latency.
  • get_methods_with_decorators() raises NotImplementedError (python_analysis.py:944) and its docstring points callers at "manually filter get_methods()" — i.e. the slow path.

The root mismatch: agent workloads (catalog/extract/heap/reach) need set-at-a-time, field-projected reads ("give me {signature, decorators} for all callables"; "give me code for these 600 signatures"), but the SDK offers only one-at-a-time, fully-reconstructed reads. Neo4j excels at the former; the N+1 reconstruction defeats it. Consumers work around it by hand-writing Cypher, which leaks graph schema into agent prompts.

Describe the solution you'd like

Add a small set of bulk, projected, single-round-trip accessors to the PythonAnalysisBackend ABC, implemented on both the Neo4j and in-process backends (parity, so the facade stays backend-agnostic) and surfaced on the PythonAnalysis facade. Return typed Pydantic models (matching cldk.models.python conventions), not dicts. Ranked by impact:

  1. get_callables_overview() -> List[CallableOverview] (the big one) — one round-trip, a lightweight projection per callable instead of full reconstruction:
    { signature, class_signature | None, kind, file, start_line, end_line, decorators: list[str], is_entrypoint_hint? }.
    Replaces get_methods() for enumeration; callers body-inspect only the few that need it via the existing get_method(...). Turns ~110s into one query.
    Cypher shape: MATCH (c:PyCallable) WHERE c._module IN $mods RETURN c.signature, c.decorators, ....

  2. get_method_bodies(signatures: list[str]) -> Dict[str, str] — batch body fetch for a known frontier:
    MATCH (c:PyCallable) WHERE c.signature IN $sigs RETURN c.signature, c.code. One round-trip for N bodies (serves body-embedding at scale).

  3. get_callsites_for(signatures: list[str]) -> Dict[str, List[PyCallSite]] — batch call-sites keyed by owner signature, off the existing PY_HAS_CALLSITE edges, avoiding the per-callable _callable_full fan-out.

  4. get_decorated_callables(markers: list[str]) -> List[CallableOverview] — fills the get_methods_with_decorators stub:
    MATCH (c:PyCallable) WHERE any(d IN c.decorators WHERE d IN $markers) RETURN .... Makes framework-entrypoint detection one query instead of a full scan.

Plus an orthogonal quick win (separate commit): make _run (or the reconstruction helpers) reuse a single session/transaction instead of one session per query — speeds up the existing get_methods()/get_symbol_table() path without changing fetch shape.

Describe alternatives you've considered

  • Hand-written Cypher in consumers — what's happening today; leaks graph schema into agent prompts and isn't backend-agnostic. The point of these accessors is that the SDK owns the query and return shape.
  • Speeding up the existing reconstruction only (session reuse, batching the fan-out) — helps, but doesn't address over-fetch: enumeration still rebuilds call-sites, inner callables, and locals that catalog throws away. Projection is the real fix; session reuse is complementary.

Additional context

Priority: #1 + #4 first (share the CallableOverview model; together they unblock catalog's enumeration + entrypoint scan — the slow path), then #2 (heap-phase bodies). #3 optimizes an already-workable path. Implementation to land on feat/issue-XXX-granular-accessors, separate from the TypeScript work in #179.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions