Performance & Reliability with Tasks Complexity
Operator’s scalability and reliability correlates strongly with workflow complexity. As we tested tasks ranging from simple shopping to complex multi-entity travel bookings, we observed predictable performance in some dimensions while degradation across others, tied both to product orchestration and CUA’s model-level capabilities.
Browsing Performance
CUA performs well on familiar websites, especially those in OpenAI’s preset use case list (For eg. Zillow), demonstrating consistent UI navigation, error recovery, and flow control. However, performance drops noticeably on unfamiliar or non-standard third-party sites. In a travel booking example involving the AVIS website, Operator repeatedly failed to select dates from a calendar picker, eventually requiring human intervention. This reflects a broader limitation of vision-only systems: they struggle with diverse, human-optimized UI patterns like dropdowns, toggles, and custom calendars. Surprisingly, CUA handled known applications like Zillow more gracefully, suggesting its performance depends heavily on training exposure. Improving generalization will require expanding the model’s pretraining or fine-tuning on a broader set of 3P sites, possibly mined from Operator usage data, and combining vision with DOM-level pattern matching.
Missing Higher Order Reasoning & Task Decomposition
Today’s autonomy models like CUA excel at localized execution, handling filters, dropdowns, and straightforward form submissions within individual sites. But when faced with more open-ended goals that require context, prioritization, or clarification, they often fall short. For example, in a shopping use case for a hiking jacket under $100, Operator failed to ask follow-up questions around brand preference, usage needs, or budget flexibility, factors that a more intelligent agent would infer, rather than wait for the user to specify explicitly. This kind of behavior highlights the gap in higher-order reasoning: agents need the ability to break down broad user intent into modular subtasks, resolve missing inputs, and coordinate decisions across steps. A useful contrast here is Manus, which generates a structured todo.md file from a user goal, listing out discrete subtasks like “search for flights,” “book hotel,” or “rent a car.” Each is then executed independently through browser-based actions, effectively decoupling planning from execution. This scaffolded approach mirrors how a competent human assistant would operate: break down, prioritize, then act. Bridging this gap at scale will require both model-level improvements, reasoning traces, intent abstraction, and product scaffolding that treats planning as a first-class citizen.
Memory, Hallucination & Information Reliability
Hallucinations in long-range workflows often stem from architectural gaps, specifically the lack of stateful memory, task decomposition, and validation mechanisms. In one test, Operator fabricated hotel options without actually conducting a search, an error driven less by interface control and more by token-level prediction trying to maintain conversational momentum. This reflects a deeper architecture gap: without structured task decomposition, discrete tool use per sub-task, and state retention of outputs, the model cannot reliably validate intermediate steps. Hallucinations like these aren’t UI failures, they’re orchestration failures, driven by the absence of persistent memory and grounded planning. While o3 demonstrates strong retrieval-anchored reasoning in web search, similar scaffolds have yet to extend to Operator-style task execution. Bridging this gap will require integrating persistent context memory, modular tool orchestration, and output validation layers to enable trustworthy, long-range autonomy.
Retrieval Across Structured & Unstructured Systems (CUA)
CUA can extract unstructured information reliably when explicitly directed but struggles to autonomously retrieve necessary supporting details unless prompted, across a multitude of websites, scrolling to find relevant information. However, For complex workflows where structured and unstructured information (e.g., booking portals, calendars, local maps) must be synthesized, CUA lacks proactive information dependency modeling. This again ties back to the absence of higher-order reasoning: understanding broader task goals, identifying implicit data gaps, and actively retrieving cross-system information is critical for seamless agent execution.
Human-In-The-Loop Intensity and ROI Trade-Off
As task complexity grows, with more systems, more dependencies, and more sequential or parallel steps, Operator requires heavier human intervention. Users must prompt, correct, or revalidate at key workflow stages, making high-complexity automation less efficient relative to manual completion in the short term. This is a common pattern in early automation systems. However, personalization, memory accumulation, and vertical-specific tuning over time could shift the ROI curve in favor of Operator, reducing friction and enabling scalable task automation even in complex, multi-step workflows.