Skip to content

web_capture URL cache: dedupe concurrent in-flight identical captures #128

@RFAI-Seer

Description

@RFAI-Seer

Follow-up from #127 (opt-in URL dedup cache, Codex review round-5).

Current behavior: the cache dedupes against already-persisted captures. Two identical /web/capture requests that arrive before either persists doc-index.json both miss the lookup and run separate browser captures (producing two doc_ids). This is correct (no wrong data) but the dedup is ineffective for the concurrent multi-agent burst — degrading to the pre-cache behavior.

Proposal: serialize cache misses per composite key (url + content_type + extract_tables + lang) with a per-key async lock + recheck-under-lock before launching the browser, mirroring the docreader's per-doc_id WeakValueDictionary lock registry.

Why deferred: sequential repeats (capture now, re-request later) — the common case — already hit the cache; the tight concurrent-burst window is narrow, and a correct keyed-lock registry held across the slow browse+persist is non-trivial. Not blocking the opt-in v1.

Scope: services/browser/mantisfetch_browser/__init__.py capture() cache-miss path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions