Infra, Scaffolding & Actuation

Based on our understanding of CUA as an ensemble of models, respective model architecture, packing and distribution across Operator as a product and the responses API, the following key things stand out.

Local Execution vs Remote Execution

CUA today follows a “screenshot-in, action instruction-out” architecture, producing guidance on what to click or type, but leaving actuation to external tooling (e.g., Playwright). While cloud-hosted execution suits consumer environments, enterprise adoption demands local actuation, especially for desktop apps like SAP or in settings that restrict screenshots. Full on-device inference is impractical due to model size, but hybrid execution, where the model runs remotely and actuates locally, is a realistic and necessary path forward. Over time, distilled task-specific variants of CUA could bring reasoning closer to the edge, enabling secure, performant execution across regulated enterprise workflows.

Browser Actuation Mechanism

CUA currently detects UI components and suggests actions, but the actuation relies on manual mouse and keyboard emulation. While sufficient for controlled remote environments, this method can introduce fragility across variable browser setups or local enterprise machines. Using JavaScript injection-based actuation libraries (e.g., Playwright,

Midscene.js⁠

) to interact directly with DOM elements would create faster, more accurate, and resolution-agnostic interactions, crucial for scaling across heterogeneous environments.

Pre-Training on DOMX Paths

CUA currently relies on screenshot-prompt pairs to detect UI elements visually. However, most modern web apps expose a rich DOM structure, and pretraining models directly on DOMX paths, or blending DOM data with visual inputs, could meaningfully improve accuracy and reduce vision model dependence. Approaches like

browser-use⁠

or Meta’s SeeAct exemplify this hybrid path, where structured DOM data improves inference speed and stability while vision can serve as a fallback. That said, one key advantage of a vision-first approach is transferability across platforms, including desktop apps, where DOM data may not be available.

Dependance on Existing Browsers

CUA today operates over Chrome, a browser ultimately controlled by Google. Chrome’s evolving security and automation policies pose a risk to agentic browsing reliability. Long-term, there is a strong case for agent-native browsers, browsers built to support secure automation by agents, enabling native rendering of structured information across websites, credential management, secure & high-frequency automation without violating user trust or platform norms. Browserbase and Hyperbrowser are examples of companies adopting this approach, where instead of relying on existing browsers, they go a level deeper and control the browser layer.

Other Observed Issues

Infinite Loops

In a medium-complexity use case (finding hotels in Cabo), Operator repeatedly switched between search results and maps without showing meaningful progress, suggesting it was close to entering an infinite loop. Although it eventually exited, this behavior points to a lack of clear termination conditions when reasoning across visual inputs. Moreover, it remains unclear how Operator synthesized map data to infer proximity-based trade-offs without explicit user prompts. This reflects a deeper training gap: models need stronger reasoning guardrails when juggling multi-source inputs and clearer strategies for consolidating competing options, especially in workflows requiring dynamic prioritization.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.