Skip to content

StateClient.get() scans the whole collection on every call -> unbounded memory growth #56

Description

@atsmsmr

Summary

StateClient.get(max_count=N, metadata=...) materializes the entire matching history on every call (just to sort by timestamp and slice the newest N). Called frequently on a large collection, this causes unbounded process-memory growth.

Observed in the plantbot exhibition deployment: the agent became unresponsive (and the Raspberry Pi host unreachable) ~1h after every boot. main.py grew to ~2.3 GB RSS (~1.9 GB anonymous) and exhausted RAM + swap, leading to thrashing / hang. All modules run as threads in one process (Agent.start), so they share that heap.

Root cause

get() first calls collection.get(include=['metadatas'], where=metadata) with no limit, materializing every matching row's ids + metadata, sorts client-side, then fetches the full data for only the top-N. The first pass is O(collection size) per call. conversation_prompter calls it every 1 s for 3 kinds; the collection is 8k+ rows and grows monotonically (persists across restarts), so each call materializes an ever-larger set and RSS ratchets up (Python / glibc do not return it to the OS).

This continues the work in ccf4313 (optimize state get function), which removed the heavy embeddings/documents from the first pass but left the full metadata scan.

Evidence (chromadb 0.5.23, collection = 8059 rows)

  • Looping the current get() (3 kinds x 150) -> +67 MB, monotonic, no plateau (~0.5 MB/call). At 1 Hz that is ~33 MB/min ~= 2 GB/hr, which matches the ~1h-to-OOM timeline.
  • gc.collect() reclaims most of it -> allocation churn outpacing GC, not a C-level leak.
  • A bounded fetch (limit=10) -> +0 MB.

Proposed fix

When max_count is set, avoid the full scan. States are appended in time order, so the newest live at the tail by insertion order: fetch a small tail window via offset/limit (a few x max_count), then sort by timestamp and slice. Validated against this collection: returns the exact same newest-N as today, and drops growth from +67 MB to +0.3 MB over the same 450 calls.

Related (separate issue, not this PR)

Each StateClient builds its own DefaultEmbeddingFunction (all-MiniLM ONNX) in-process (~842 MB measured); multiple modules in one process multiply this. Will file separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions