Operator Testing

icon picker
M - Return Flight Booking

Prompt: Find me a return flight from SFO-ORD 7-10 May, 2025
Personalization Setting: None
Parameter
Description
UX - Information Exchange
No clarifying questions on any preferences of airlines and timings,budget restrictions or number of stops
UX - Ease of Use
Relatively simple to use. The operator experience is built using an embedded browser – so there is no way for it use to the local browser (with any saved logins, cookies, etc,) – this hints at the fact that this is a “testing only” capability and real adoption will come from the CUA model APIs.
Clarification & Higher-Order Thinking
It does not ask any questions to begin with – Immediately comes up with a plan and starts actuation. However, on returning the first set of results I am asked if I have an airline preference. If this would’ve been asked before, could’ve saved time and the search would’ve been more targeted.
Overall, this interaction wasn’t very strong – it didn’t seem to follow a logical chain of thought. This capability feels subpar to the reasoning traces observed when using gpt-4o for a similar task
Task Decomposition & Modularity
Task decomposition (Reasoning and planning) is average – I found some inconsistencies in the thought process. For example, on the first search, it returned a lowest flight price of $457 for a direct flight, but upon giving some preferences and more probing, it found a cheaper alternative for $197.
When prodded for an explanation for this, it basically ignored my question and reiterates the latest search results
Application Identification
It performs a google search with my user request as the search query, and clicks on the first link on Skyscanner. From then on, it stays within the website and does not explore others
Personalization
Through the interaction, I shared more information about what I’m looking for, but the collection of this information is rather haphazard and not very organized.
I did note though – Operator does have integrations with travel websites like Booking.com, Priceline and Expedia – but at no point did it ask me if it can access my saved profile to better understand my preferences. This is an example of a case where feature discovery is difficult.
Risk Management, Intervention & Handover
No manual invention was necessary, however
Browser Navigation & Integration
Performs well in this area – no browser actuation errors as such.
Exception Handling & Reliability
No exceptions encountered. Reliability would be low since rather than understanding what the user truly wants, it follows a generic path of searching on google and clicking on the first search result. This is bound to not be repeatable.
Time to Completion
Overall, a slow process. Since it primarily uses vision (includes screenshot of the webpage at every step) – the request payload sizes are larger than normal
Consistency
Credential Management
None encountered. As noted in the personalization section, the model does not ask the user to link any accounts for a better experience.
There are no rows in this table

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.