# Advancing Features (PDF-grounded Implementation Plan) ## Coordinate Contract (Phase 1 — enforced) All coordinates crossing an IPC boundary follow this contract: | Direction | Source Space | Conversion | Target Space | |-----------|-------------|-----------|-------------| | Overlay → Main (`dot-selected`) | CSS/DIP | `× scaleFactor` | physical screen pixels | | Main → Overlay (regions) | physical screen pixels | `÷ scaleFactor` | CSS/DIP | | Main → Click injection | physical screen pixels | (none — native) | physical screen pixels | | UIA bounds (from .NET host) | physical screen pixels | (none — native) | physical screen pixels | - `scaleFactor` is `screen.getPrimaryDisplay().scaleFactor` (e.g. 1.25 at 125% DPI). - `denormalizeRegionsForOverlay(regions, sf)` in `index.js` handles all Main → Overlay conversions. - `dot-selected` handler in `index.js` adds `physicalX`/`physicalY` to every selection event. - Region bounds stored in `inspectService` are always in **physical screen pixels**. - The overlay renderer operates entirely in CSS/DIP; it never needs to know about physical pixels. ## Goal Deliver a DevTools-like overlay + automation loop where: - The overlay stays up while you keep interacting with background apps. - The system can explicitly control window layering (front/back/minimize/restore/maximize) **and** reliably target UI elements for interaction. - Behavior is grounded in the `System.Windows.Automation` (UI Automation) API surface (WindowsDesktop 11.0) rather than ad-hoc assumptions. ## Sources of truth - Extracted .NET API reference (from the attached PDF) - [docs/pdf/system.windows.automation-windowsdesktop-11.0.txt](docs/pdf/system.windows.automation-windowsdesktop-11.0.txt) - [docs/pdf/system.windows.automation-windowsdesktop-11.0.index.txt](docs/pdf/system.windows.automation-windowsdesktop-11.0.index.txt) - Extractor: [scripts/extract-pdf-text.py](scripts/extract-pdf-text.py) - Codebase modules to align - Overlay: [src/renderer/overlay/overlay.js](src/renderer/overlay/overlay.js) - Main orchestration: [src/main/index.js](src/main/index.js) - Inspect pipeline: [src/main/inspect-service.js](src/main/inspect-service.js) - Watcher pipeline: [src/main/ui-watcher.js](src/main/ui-watcher.js) - System action executor: [src/main/system-automation.js](src/main/system-automation.js) - UI automation toolkit: [src/main/ui-automation/index.js](src/main/ui-automation/index.js) - Window control: [src/main/ui-automation/window/manager.js](src/main/ui-automation/window/manager.js) - UIA .NET host(s): - [src/native/windows-uia-dotnet/Program.cs](src/native/windows-uia-dotnet/Program.cs) - [src/native/windows-uia/Program.cs](src/native/windows-uia/Program.cs) ## Current state (baseline) - Overlay is already implemented as a transparent always-on-top window with click-through forwarding; inspect regions are rendered and can be refreshed. - Explicit window operations already exist across UI layer + system actions + CLI: - z-order/state: front/back/minimize/restore/maximize - flexible window target resolution (by hwnd/title/process/class) **Second-pass priority (Vision + Overlay-grounded Actions)** This repo already contains major building blocks for “AI vision”, but they aren’t yet unified into a tight loop where the AI reliably sees what the user sees **and** can target actions using overlay/region semantics. What exists today (ground truth): - Screen/region capture (with “hide overlay before capture” safeguards): - [src/main/index.js](src/main/index.js) - Chat IPC entrypoints: [src/renderer/chat/preload.js](src/renderer/chat/preload.js) - Visual context buffering + provider-specific multimodal message formatting: - [src/main/ai-service.js](src/main/ai-service.js) - “Visual awareness” analysis primitives (OCR + UIA element discovery + point hit-testing + diffing): - [src/main/visual-awareness.js](src/main/visual-awareness.js) - Overlay can already render “actionable regions” and hover-test them: - [src/renderer/overlay/overlay.js](src/renderer/overlay/overlay.js) - Inspect data contracts already support `source: accessibility|ocr|heuristic`: - [src/shared/inspect-types.js](src/shared/inspect-types.js) What’s missing (advancement features to add): - A first-class **vision grounding loop** that ties together capture → analyze → regions → prompt context → action targeting. - Multi-monitor/virtual-desktop correctness for *both* capture and overlay (current capture is primary-display oriented). - Region-targeted actions (e.g., “click region #12”) so the AI can act using the same structures the overlay draws, instead of only raw coordinates. - ROI (region-of-interest) capture as the default for “what am I looking at?” so the AI gets high-resolution detail where it matters without sending the entire screen every time. This plan focuses on what the PDF implies we should harden/extend next. --- ## Key PDF-driven findings to incorporate ### 1) Coordinate systems are **physical screen coordinates** UIA surfaces like `AutomationElement.BoundingRectangle`, `AutomationElement.FromPoint(Point)`, and clickable point APIs specify *physical screen coordinates*. Bounding rectangles can include non-clickable areas; `FromPoint` does not imply clickability. Implication for this repo: - Overlay renderer coordinates (CSS/DIP) must be converted to physical screen coordinates before they are used for UIA or input injection. - Region modeling should treat bounding rectangles as “visual bounds”, and a separate “click point” (if available) as the preferred click target. Relevant implementation touchpoints: - Overlay mouse handling: [src/renderer/overlay/overlay.js](src/renderer/overlay/overlay.js) - Click injection expects real screen coordinates: [src/main/ui-automation/mouse/click.js](src/main/ui-automation/mouse/click.js) - Existing point-based UIA query in visual awareness: [src/main/visual-awareness.js](src/main/visual-awareness.js) ### 2) Foreground (Win32) vs focus (UIA) are not the same The PDF explicitly notes `AutomationElement.SetFocus()` does **not** necessarily bring an element/window to the foreground or make it visible. Implication: - Keep Win32 foreground/z-order primitives for `front/back`. - Treat UIA `SetFocus()` as “keyboard focus within the already-visible UI”. Use it as a complement before pattern actions (Value/Invoke/etc.), not as the mechanism for “bring to front”. Relevant code touchpoints: - Window primitives: [src/main/ui-automation/window/manager.js](src/main/ui-automation/window/manager.js) - Agent action executor focus path: [src/main/system-automation.js](src/main/system-automation.js) ### 3) UIA patterns are the reliable interaction API (use mouse as fallback) The PDF surfaces the standard interaction patterns: - Invoke, Value, Scroll, ExpandCollapse, Toggle, Selection/SelectionItem, Text, WindowPattern, etc. Implication: - Prefer pattern-based interaction (Invoke/Value/Scroll/ExpandCollapse/Toggle/SelectionItem) over “click center of rectangle”. - When mouse fallback is required, prefer `TryGetClickablePoint` over rect-center whenever possible. Relevant code touchpoints: - Element click pipeline: [src/main/ui-automation/interactions/element-click.js](src/main/ui-automation/interactions/element-click.js) - System action dispatcher: [src/main/system-automation.js](src/main/system-automation.js) ### 4) Event-driven watcher is possible but requires a **persistent managed host** UIA event APIs (`Automation.AddAutomationFocusChangedEventHandler`, `AddStructureChangedEventHandler`, `AddAutomationPropertyChangedEventHandler`, plus `TextPattern.*` events via `AddAutomationEventHandler`) require long-lived registrations. Implication: - The current polling-based PowerShell watcher cannot be “made event-driven” with small tweaks; event subscriptions need to run inside a persistent .NET process. - The repo already has .NET UIA programs; they are the natural place to add an event-stream mode. Relevant code touchpoints: - Polling watcher today: [src/main/ui-watcher.js](src/main/ui-watcher.js) - Existing .NET hosts: [src/native/windows-uia-dotnet/Program.cs](src/native/windows-uia-dotnet/Program.cs), [src/native/windows-uia/Program.cs](src/native/windows-uia/Program.cs) ### 5) Performance guidance matters The PDF calls out that `AutomationElement.GetSupportedPatterns()` can be expensive. Implication: - Avoid calling `GetSupportedPatterns()` in hot paths (poll loops / frequent updates). - When snapshots are needed, consider UIA `CacheRequest`/`GetUpdatedCache(...)` patterns in the managed host. --- ## Implementation plan (phased) ### Phase 0 — Give the AI “human vision” (capture → analyze → overlay regions → grounded actions) **Why (high priority):** This is the shortest path to “AI can see what users see” using existing primitives, and it directly enables safer, more reliable action selection from the overlay. Work items: 1) Standardize “visual context” as a typed artifact - Define a shared schema for a visual frame that always includes: - `dataURL` (or base64), `width`, `height`, `timestamp` - `origin` / offsets (`x`,`y`) when capturing a region - `coordinateSpace` (physical screen pixels) - Ensure the same schema is used for: - Full screen captures (`capture-screen`) - ROI captures (`capture-region`) - Optional window/element captures using the existing UI automation screenshot module: [src/main/ui-automation/screenshot.js](src/main/ui-automation/screenshot.js) 2) Make `{"type":"screenshot"}` a scoped capture request (not just “some screenshot”) - The action executor already supports a `screenshot` action as a control signal. - Extend the action schema to support (without adding new UX): - `scope: "screen" | "region" | "window" | "element"` - `region: { x, y, width, height }` (physical coordinates) - `hwnd` / window criteria (for window capture) - Element criteria (for element capture) - This lets the AI request *exactly* the pixels it needs for reasoning and verification. 3) ROI-first capture for overlay selection + inspect - When the user selects an inspect region (or hovered region), capture a tight ROI around it and store it as visual context. - Use ROI capture as the default for “describe this area” / “what is this control?” prompts. 4) Wire “visual awareness” analysis into inspect regions (OCR + UIA + heuristics) - Run `visualAwareness.analyzeScreen(...)` on the latest visual frame (or ROI) to produce: - OCR text blobs - UIA element candidates - Active window context - Convert these into `InspectRegion` objects (source `ocr` / `accessibility` / `heuristic`) and push them through the existing region merge logic: - [src/main/inspect-service.js](src/main/inspect-service.js) - [src/shared/inspect-types.js](src/shared/inspect-types.js) - Feed the merged regions into the overlay’s existing `update-inspect-regions` path. 5) Add region-grounded action targeting (AI acts like a human pointing) - Extend the action contract so the AI can target by: - `targetRegionId` (stable) or `targetRegionIndex` (as displayed by overlay) - Optional `targetClickPoint` if provided by UIA (`TryGetClickablePoint`) - Resolve those targets in main using inspect-service’s region registry, then execute via existing safe click paths. 6) Make visual context inclusion deterministic (not keyword-heuristic) - Today, `includeVisualContext` is enabled by keyword heuristics and/or existing visual history. - For overlay-driven interactions and region-based actions, force `includeVisualContext: true` with the corresponding ROI frame. 7) Ensure multimodal calls always use a vision-capable model - The AI layer already supports vision-capable models and builds provider-specific image message payloads. - Keep (and make explicit in the plan) the invariant: if a message contains images, route to a vision-capable model automatically (fallback as needed). Acceptance criteria: - After the user captures the screen once, the AI can answer “what’s on screen?” with visual grounding (not just Live UI State). - When the user selects a region, the AI receives an ROI image of that region and can propose actions referencing it. - The AI can execute an action like “click region #N” without guessing coordinates. Primary files: - Capture + storage: [src/main/index.js](src/main/index.js), [src/main/ai-service.js](src/main/ai-service.js) - Analysis: [src/main/visual-awareness.js](src/main/visual-awareness.js) - Region registry: [src/main/inspect-service.js](src/main/inspect-service.js) - Overlay render + hit-test: [src/renderer/overlay/overlay.js](src/renderer/overlay/overlay.js) ### Phase 1 — Coordinate contract + multi-monitor correctness (highest leverage) **Why:** UIA + input injection both assume physical screen coordinates; today overlay coordinates are not explicitly converted and the overlay is sized to the primary display. Work items: 1) Define a single coordinate contract for actions and regions - Add a clear contract document section (in this file or a short follow-up doc) stating: - Region bounds are in physical screen coordinates. - Optional `clickPoint` is also in physical screen coordinates. - Every region/action includes the coordinate space. 2) Convert overlay pointer coordinates to physical screen coordinates before action execution - Implement conversion in the overlay→main IPC boundary. - Ensure “screenX/screenY” is not used for unconverted values. 3) Make overlay cover the **virtual desktop** (union of all displays) - Replace primary-only sizing with a union-of-displays rectangle. - Ensure regions on a non-primary monitor render and are clickable. 4) Make capture cover the **virtual desktop** too - Current capture paths are primary-display sized and positioned (x=0,y=0). - Update capture to support: - Multi-display captures (one per display) with per-display offsets - Or a stitched virtual-desktop capture with correct origin - Ensure ROI cropping uses the same coordinate basis as overlay regions. Acceptance criteria: - Clicking a point selected on the overlay lands on the correct pixel on 100% and scaled (125%/150%) displays. - Regions on monitor 2 can be selected and clicked with no offset. Primary files: - [src/main/index.js](src/main/index.js) - [src/renderer/overlay/overlay.js](src/renderer/overlay/overlay.js) - [src/main/ui-automation/mouse/click.js](src/main/ui-automation/mouse/click.js) ### Phase 2 — “Pick element at point” + stable element identity **Why:** DevTools-style interaction depends on reliable hit-testing and re-targeting without fragile “re-find by Name” logic. Work items: 1) Add a point-based element resolver using `AutomationElement.FromPoint(Point)` - Input: physical screen coordinates. - Output: element payload with bounding rectangle and key identity fields. 2) Add runtimeId to element payloads - Include `AutomationElement.GetRuntimeId()` in element results where feasible. - Use runtimeId as a session-scoped stable identity (better than AutomationId-only). 3) Add clickable point support - Prefer `TryGetClickablePoint(out Point)` and store `clickPoint` when available. Acceptance criteria: - Given a screen point, the system returns an element with bounding rectangle + (when available) clickable point + runtimeId. - The element can be “re-resolved” later in the same session without relying on Name-only matching. Primary files: - [src/main/system-automation.js](src/main/system-automation.js) - [src/main/visual-awareness.js](src/main/visual-awareness.js) - [src/native/windows-uia-dotnet/Program.cs](src/native/windows-uia-dotnet/Program.cs) ### Phase 3 — Pattern-first interaction primitives (DevTools-like “actions”) **Why:** Bounding rectangles are not guaranteed clickable; patterns are the intended automation surface. Work items: 1) Add ValuePattern-based set value - New high-level operation: set value on a target element. - Prefer `ValuePattern.SetValue(string)`. - Fallback: focus + typing only when ValuePattern is not supported. 2) Add ScrollPattern-based scrolling - New operation: scroll a specific element/container. - Prefer `ScrollPattern.Scroll(...)` or `SetScrollPercent(...)`. - Fallback: mouse wheel simulation. 3) Add ExpandCollapsePattern operations - Expand/collapse tree/menu items without coordinate clicking. 4) Add TextPattern read support (inspection) - New inspection feature: read text content via `TextPattern.DocumentRange` where supported. Acceptance criteria: - For a control that supports a pattern, actions succeed without mouse injection. - For a control that does not, the system returns a structured “pattern unsupported” result and falls back only when safe/appropriate. Primary files: - [src/main/system-automation.js](src/main/system-automation.js) - [src/main/ui-automation/interactions/element-click.js](src/main/ui-automation/interactions/element-click.js) ### Phase 4 — Event-driven watcher (optional, but aligns strongly with UIA) **Why:** Polling is coarse and expensive; UIA events can provide fast deltas, but only with a persistent host. Work items: 1) Extend the .NET UIA host to support an “event stream” mode - Register focus changed handler (system-wide) only when inspect mode is enabled. - On focus changes, attach structure/property-changed handlers to the focused window subtree. - Emit JSON deltas over stdout. 2) Update Node watcher to support “event backend” - Spawn the managed host; translate deltas into the existing overlay region update format. - Keep polling as a fallback/recovery mechanism. Acceptance criteria: - With inspect mode enabled, regions update within <250ms after UI changes without full rescans. - The pipeline recovers gracefully when elements disappear (no crashes; falls back to re-snapshot). Primary files: - [src/main/ui-watcher.js](src/main/ui-watcher.js) - [src/main/index.js](src/main/index.js) - [src/native/windows-uia/Program.cs](src/native/windows-uia/Program.cs) --- ## Window operations alignment (follow-up hardening) Window z-order/state primitives exist, but the PDF suggests we should treat UIA window semantics as first-class for validation and state constraints. Work items: - Unify “bring to front” implementation across CLI and agent actions so they behave consistently under foreground-lock constraints. - Optionally consult `WindowPattern` for capability checks (`CanMinimize/CanMaximize`) and state confirmation, while still using Win32 for actual foreground/z-order. Primary files: - [src/main/system-automation.js](src/main/system-automation.js) - [src/main/ui-automation/window/manager.js](src/main/ui-automation/window/manager.js) - [src/cli/commands/window.js](src/cli/commands/window.js) --- ## Proposed deliverables - This plan file (you are reading it). - A small set of targeted PRs, ideally one per phase: - Phase 1: coordinate contract + virtual desktop overlay - Phase 2: point picking + runtimeId + clickable points - Phase 3: pattern-first actions (value/scroll/expand/text) - Phase 4: optional event-host + event backend ## Suggested validation (repo-local) - Extend existing script-based tests under [scripts/](scripts/) where feasible. - Add manual smoke steps: - Multi-monitor: verify overlay regions render on all displays and clicks land correctly. - DPI: verify click offsets at 125%/150% scale. - Pattern actions: verify ValuePattern/ScrollPattern/ExpandCollapse behave without mouse. - Watcher: verify inspect-mode gating of system-wide focus event subscriptions.