All coordinates crossing an IPC boundary follow this contract:
| Direction | Source Space | Conversion | Target Space |
|---|---|---|---|
Overlay → Main (dot-selected) |
CSS/DIP | × scaleFactor |
physical screen pixels |
| Main → Overlay (regions) | physical screen pixels | ÷ scaleFactor |
CSS/DIP |
| Main → Click injection | physical screen pixels | (none — native) | physical screen pixels |
| UIA bounds (from .NET host) | physical screen pixels | (none — native) | physical screen pixels |
scaleFactorisscreen.getPrimaryDisplay().scaleFactor(e.g. 1.25 at 125% DPI).denormalizeRegionsForOverlay(regions, sf)inindex.jshandles all Main → Overlay conversions.dot-selectedhandler inindex.jsaddsphysicalX/physicalYto every selection event.- Region bounds stored in
inspectServiceare always in physical screen pixels. - The overlay renderer operates entirely in CSS/DIP; it never needs to know about physical pixels.
Deliver a DevTools-like overlay + automation loop where:
- The overlay stays up while you keep interacting with background apps.
- The system can explicitly control window layering (front/back/minimize/restore/maximize) and reliably target UI elements for interaction.
- Behavior is grounded in the
System.Windows.Automation(UI Automation) API surface (WindowsDesktop 11.0) rather than ad-hoc assumptions.
- Extracted .NET API reference (from the attached PDF)
- Codebase modules to align
- Overlay: src/renderer/overlay/overlay.js
- Main orchestration: src/main/index.js
- Inspect pipeline: src/main/inspect-service.js
- Watcher pipeline: src/main/ui-watcher.js
- System action executor: src/main/system-automation.js
- UI automation toolkit: src/main/ui-automation/index.js
- Window control: src/main/ui-automation/window/manager.js
- UIA .NET host(s):
- Overlay is already implemented as a transparent always-on-top window with click-through forwarding; inspect regions are rendered and can be refreshed.
- Explicit window operations already exist across UI layer + system actions + CLI:
- z-order/state: front/back/minimize/restore/maximize
- flexible window target resolution (by hwnd/title/process/class)
Second-pass priority (Vision + Overlay-grounded Actions) This repo already contains major building blocks for “AI vision”, but they aren’t yet unified into a tight loop where the AI reliably sees what the user sees and can target actions using overlay/region semantics.
What exists today (ground truth):
- Screen/region capture (with “hide overlay before capture” safeguards):
- src/main/index.js
- Chat IPC entrypoints: src/renderer/chat/preload.js
- Visual context buffering + provider-specific multimodal message formatting:
- “Visual awareness” analysis primitives (OCR + UIA element discovery + point hit-testing + diffing):
- Overlay can already render “actionable regions” and hover-test them:
- Inspect data contracts already support
source: accessibility|ocr|heuristic:
What’s missing (advancement features to add):
- A first-class vision grounding loop that ties together capture → analyze → regions → prompt context → action targeting.
- Multi-monitor/virtual-desktop correctness for both capture and overlay (current capture is primary-display oriented).
- Region-targeted actions (e.g., “click region #12”) so the AI can act using the same structures the overlay draws, instead of only raw coordinates.
- ROI (region-of-interest) capture as the default for “what am I looking at?” so the AI gets high-resolution detail where it matters without sending the entire screen every time.
This plan focuses on what the PDF implies we should harden/extend next.
UIA surfaces like AutomationElement.BoundingRectangle, AutomationElement.FromPoint(Point), and clickable point APIs specify physical screen coordinates. Bounding rectangles can include non-clickable areas; FromPoint does not imply clickability.
Implication for this repo:
- Overlay renderer coordinates (CSS/DIP) must be converted to physical screen coordinates before they are used for UIA or input injection.
- Region modeling should treat bounding rectangles as “visual bounds”, and a separate “click point” (if available) as the preferred click target.
Relevant implementation touchpoints:
- Overlay mouse handling: src/renderer/overlay/overlay.js
- Click injection expects real screen coordinates: src/main/ui-automation/mouse/click.js
- Existing point-based UIA query in visual awareness: src/main/visual-awareness.js
The PDF explicitly notes AutomationElement.SetFocus() does not necessarily bring an element/window to the foreground or make it visible.
Implication:
- Keep Win32 foreground/z-order primitives for
front/back. - Treat UIA
SetFocus()as “keyboard focus within the already-visible UI”. Use it as a complement before pattern actions (Value/Invoke/etc.), not as the mechanism for “bring to front”.
Relevant code touchpoints:
- Window primitives: src/main/ui-automation/window/manager.js
- Agent action executor focus path: src/main/system-automation.js
The PDF surfaces the standard interaction patterns:
- Invoke, Value, Scroll, ExpandCollapse, Toggle, Selection/SelectionItem, Text, WindowPattern, etc.
Implication:
- Prefer pattern-based interaction (Invoke/Value/Scroll/ExpandCollapse/Toggle/SelectionItem) over “click center of rectangle”.
- When mouse fallback is required, prefer
TryGetClickablePointover rect-center whenever possible.
Relevant code touchpoints:
- Element click pipeline: src/main/ui-automation/interactions/element-click.js
- System action dispatcher: src/main/system-automation.js
UIA event APIs (Automation.AddAutomationFocusChangedEventHandler, AddStructureChangedEventHandler, AddAutomationPropertyChangedEventHandler, plus TextPattern.* events via AddAutomationEventHandler) require long-lived registrations.
Implication:
- The current polling-based PowerShell watcher cannot be “made event-driven” with small tweaks; event subscriptions need to run inside a persistent .NET process.
- The repo already has .NET UIA programs; they are the natural place to add an event-stream mode.
Relevant code touchpoints:
- Polling watcher today: src/main/ui-watcher.js
- Existing .NET hosts: src/native/windows-uia-dotnet/Program.cs, src/native/windows-uia/Program.cs
The PDF calls out that AutomationElement.GetSupportedPatterns() can be expensive.
Implication:
- Avoid calling
GetSupportedPatterns()in hot paths (poll loops / frequent updates). - When snapshots are needed, consider UIA
CacheRequest/GetUpdatedCache(...)patterns in the managed host.
Why (high priority): This is the shortest path to “AI can see what users see” using existing primitives, and it directly enables safer, more reliable action selection from the overlay.
Work items:
- Standardize “visual context” as a typed artifact
- Define a shared schema for a visual frame that always includes:
dataURL(or base64),width,height,timestamporigin/ offsets (x,y) when capturing a regioncoordinateSpace(physical screen pixels)
- Ensure the same schema is used for:
- Full screen captures (
capture-screen) - ROI captures (
capture-region) - Optional window/element captures using the existing UI automation screenshot module: src/main/ui-automation/screenshot.js
- Full screen captures (
- Make
{"type":"screenshot"}a scoped capture request (not just “some screenshot”)
- The action executor already supports a
screenshotaction as a control signal. - Extend the action schema to support (without adding new UX):
scope: "screen" | "region" | "window" | "element"region: { x, y, width, height }(physical coordinates)hwnd/ window criteria (for window capture)- Element criteria (for element capture)
- This lets the AI request exactly the pixels it needs for reasoning and verification.
- ROI-first capture for overlay selection + inspect
- When the user selects an inspect region (or hovered region), capture a tight ROI around it and store it as visual context.
- Use ROI capture as the default for “describe this area” / “what is this control?” prompts.
- Wire “visual awareness” analysis into inspect regions (OCR + UIA + heuristics)
- Run
visualAwareness.analyzeScreen(...)on the latest visual frame (or ROI) to produce:- OCR text blobs
- UIA element candidates
- Active window context
- Convert these into
InspectRegionobjects (sourceocr/accessibility/heuristic) and push them through the existing region merge logic: - Feed the merged regions into the overlay’s existing
update-inspect-regionspath.
- Add region-grounded action targeting (AI acts like a human pointing)
- Extend the action contract so the AI can target by:
targetRegionId(stable) ortargetRegionIndex(as displayed by overlay)- Optional
targetClickPointif provided by UIA (TryGetClickablePoint)
- Resolve those targets in main using inspect-service’s region registry, then execute via existing safe click paths.
- Make visual context inclusion deterministic (not keyword-heuristic)
- Today,
includeVisualContextis enabled by keyword heuristics and/or existing visual history. - For overlay-driven interactions and region-based actions, force
includeVisualContext: truewith the corresponding ROI frame.
- Ensure multimodal calls always use a vision-capable model
- The AI layer already supports vision-capable models and builds provider-specific image message payloads.
- Keep (and make explicit in the plan) the invariant: if a message contains images, route to a vision-capable model automatically (fallback as needed).
Acceptance criteria:
- After the user captures the screen once, the AI can answer “what’s on screen?” with visual grounding (not just Live UI State).
- When the user selects a region, the AI receives an ROI image of that region and can propose actions referencing it.
- The AI can execute an action like “click region #N” without guessing coordinates.
Primary files:
- Capture + storage: src/main/index.js, src/main/ai-service.js
- Analysis: src/main/visual-awareness.js
- Region registry: src/main/inspect-service.js
- Overlay render + hit-test: src/renderer/overlay/overlay.js
Why: UIA + input injection both assume physical screen coordinates; today overlay coordinates are not explicitly converted and the overlay is sized to the primary display.
Work items:
- Define a single coordinate contract for actions and regions
- Add a clear contract document section (in this file or a short follow-up doc) stating:
- Region bounds are in physical screen coordinates.
- Optional
clickPointis also in physical screen coordinates. - Every region/action includes the coordinate space.
- Convert overlay pointer coordinates to physical screen coordinates before action execution
- Implement conversion in the overlay→main IPC boundary.
- Ensure “screenX/screenY” is not used for unconverted values.
- Make overlay cover the virtual desktop (union of all displays)
- Replace primary-only sizing with a union-of-displays rectangle.
- Ensure regions on a non-primary monitor render and are clickable.
- Make capture cover the virtual desktop too
- Current capture paths are primary-display sized and positioned (x=0,y=0).
- Update capture to support:
- Multi-display captures (one per display) with per-display offsets
- Or a stitched virtual-desktop capture with correct origin
- Ensure ROI cropping uses the same coordinate basis as overlay regions.
Acceptance criteria:
- Clicking a point selected on the overlay lands on the correct pixel on 100% and scaled (125%/150%) displays.
- Regions on monitor 2 can be selected and clicked with no offset.
Primary files:
Why: DevTools-style interaction depends on reliable hit-testing and re-targeting without fragile “re-find by Name” logic.
Work items:
- Add a point-based element resolver using
AutomationElement.FromPoint(Point)
- Input: physical screen coordinates.
- Output: element payload with bounding rectangle and key identity fields.
- Add runtimeId to element payloads
- Include
AutomationElement.GetRuntimeId()in element results where feasible. - Use runtimeId as a session-scoped stable identity (better than AutomationId-only).
- Add clickable point support
- Prefer
TryGetClickablePoint(out Point)and storeclickPointwhen available.
Acceptance criteria:
- Given a screen point, the system returns an element with bounding rectangle + (when available) clickable point + runtimeId.
- The element can be “re-resolved” later in the same session without relying on Name-only matching.
Primary files:
Why: Bounding rectangles are not guaranteed clickable; patterns are the intended automation surface.
Work items:
- Add ValuePattern-based set value
- New high-level operation: set value on a target element.
- Prefer
ValuePattern.SetValue(string). - Fallback: focus + typing only when ValuePattern is not supported.
- Add ScrollPattern-based scrolling
- New operation: scroll a specific element/container.
- Prefer
ScrollPattern.Scroll(...)orSetScrollPercent(...). - Fallback: mouse wheel simulation.
- Add ExpandCollapsePattern operations
- Expand/collapse tree/menu items without coordinate clicking.
- Add TextPattern read support (inspection)
- New inspection feature: read text content via
TextPattern.DocumentRangewhere supported.
Acceptance criteria:
- For a control that supports a pattern, actions succeed without mouse injection.
- For a control that does not, the system returns a structured “pattern unsupported” result and falls back only when safe/appropriate.
Primary files:
Why: Polling is coarse and expensive; UIA events can provide fast deltas, but only with a persistent host.
Work items:
- Extend the .NET UIA host to support an “event stream” mode
- Register focus changed handler (system-wide) only when inspect mode is enabled.
- On focus changes, attach structure/property-changed handlers to the focused window subtree.
- Emit JSON deltas over stdout.
- Update Node watcher to support “event backend”
- Spawn the managed host; translate deltas into the existing overlay region update format.
- Keep polling as a fallback/recovery mechanism.
Acceptance criteria:
- With inspect mode enabled, regions update within <250ms after UI changes without full rescans.
- The pipeline recovers gracefully when elements disappear (no crashes; falls back to re-snapshot).
Primary files:
Window z-order/state primitives exist, but the PDF suggests we should treat UIA window semantics as first-class for validation and state constraints.
Work items:
- Unify “bring to front” implementation across CLI and agent actions so they behave consistently under foreground-lock constraints.
- Optionally consult
WindowPatternfor capability checks (CanMinimize/CanMaximize) and state confirmation, while still using Win32 for actual foreground/z-order.
Primary files:
- This plan file (you are reading it).
- A small set of targeted PRs, ideally one per phase:
- Phase 1: coordinate contract + virtual desktop overlay
- Phase 2: point picking + runtimeId + clickable points
- Phase 3: pattern-first actions (value/scroll/expand/text)
- Phase 4: optional event-host + event backend
- Extend existing script-based tests under scripts/ where feasible.
- Add manual smoke steps:
- Multi-monitor: verify overlay regions render on all displays and clicks land correctly.
- DPI: verify click offsets at 125%/150% scale.
- Pattern actions: verify ValuePattern/ScrollPattern/ExpandCollapse behave without mouse.
- Watcher: verify inspect-mode gating of system-wide focus event subscriptions.