Operator Testing

icon picker
H - Travel Booking from SEA to SFO

One of the most illustrative consumer use cases for autonomous agents like OpenAI Operator is travel booking , a task that seems simple at a glance but reveals deep complexity upon closer inspection. Let’s take the example of a user based in Seattle planning a mid-month trip to San Francisco. This task spans multiple sub-steps across different applications , from checking calendar availability, booking flights and hotels, to aligning car rentals , each layered with user-specific constraints like schedule, preferences, loyalty programs, and budget. The information is scattered across web and desktop interfaces, making this a perfect candidate for end-to-end automation powered by agents.
Core Step
Information & Application
Nuance / Constraints
Identify optimal travel dates
App: Calendar (Web/Desktop) , Email, Notes
Info: Days/Availability
Constraints from existing meetings & schedule
Distance from user’s home address to airport
Personal preferences on travel times
Select flight options
App: Travel portals (Google Flights, Expedia, etc.)
Budgetary constraints
Loyalty programs & airline preferences
Preferred airport & proximity to home
Book hotel stay
App: Hotel sites (Booking.com, Hotels.com)
Location near meeting site
Check-in/out aligned with flights
Room type preferences & constraints
Reserve rental car
App: Rental services (Hertz, Avis, etc.)
Pickup/dropoff synced with flights
Car type, Budget, Company policy
There are no rows in this table
Prompt: I am traveling from Seattle to the Bay Area, planning to travel from May 14th - 18th, 2025. I need to book a flight, a rental car and hotels.
Accommodation: I want to stay close to Peninsula or Palo Alto, but don’t mind staying in San Francisco either, however parking a car can be tough there which is why I prefer it less. I am cost conscious here, and so don’t mind staying a bit far too, to stay under the $200 per day budget.
Flight: I need to book a flight after work (5:00 PM) from Seattle to San Francisco on Wednesday, May 14th, that gets my to San Francisco or nearby airports such as Oakland or San Jose. I typically prefer flights that aren’t too late and so don’t mind leaving a little earlier from office either. For return flight, I typically fly back on Sunday evenings between 3 PM & 9 PM in a way that gets me back before midnight to Seattle, so I have time to rest and get ready for the next week.
Rental Car: I typically prefer Avis rental cars but if you cannot get those, I wouldn’t mind going through Turo either, based on price and availability. I typically try to pick up the car close to the airport or at the airport (thus Avis as a choice) and return it back prior to going to the airport.
Show me options for each one by one and piece them together in the right combinations to tell me the trade off from price, comfort / amenities & proximity to where my business meetings would be (Around Palo Alto & San Francisco). Be as comprehensive as you can.
Parameter
Description
UX - Information Exchange
Leveraged the following global level customization as well as system level prompts
Customization Settings: I am generally a budget conscious customer who is generally traveling for networking and business trips, and so ease of travel and time saving is important but it has to be traded off with price. I like being given a few options, trade offs between price, timing and comfort and like to make my own decisions. I live in [NAME OF PLACE] currently, am [AGE] and work in tech. Ask me as many clarifying questions before you go off on a task and execute so you have all the right information but make sure you think through the steps and gather all the right information.
Booking.com Personalization: I typically like looking for hotels under $200 a night, and don't mind staying slightly far away. You can ask me for my credit card information and fill forms / do the booking on my behalf
Priceline & Expedia Personalizaton: I typically prefer flights under $500 round trip for domestic travel, and don't go above that unless there are no other options. I am fine with flying economy / basic economy and don't carry a lot of luggae
Turo Personalization: I typically prefer cars under $50 per day of base far and typically like lightweight, fuel efficient sedans or SUVs. Since I am tall, I prefer SUVs but don't mind a sedan if the price trade off is correct
Clarification & Higher-Order Thinking
No specific clarification was needed here and it seemed like Operator was able to pick up the global instructions I had put together, as well as the application specific prompts.
Task Decomposition & Modularity
For the multi-step complex use case, operator started booking things one by one despite me mentioning in the prompt that I wanted to put all the pieces together before making all the bookings. It tried to book flight first and then rental car and then hotel, whereas I typically as a user look at all these things simultaneously.
It did the same with car booking, wherein it tried to book the rental car right after the flight despite me previously telling it that it needed to give me all the options holistically before making a booking and I had to nudge it using the same prompt.
Application Identification & navigation
AVIS Website: For the high complexity use cases, that involved going to the Avis website for booking travel, operator specifically struggled in setting dates and wasn’t sure how to read through the calendar inputs on the AVIS website, although it kept re-trying. So vision capabilities might be largely limited to the set of websites that OpenAI has trained the model on. I had to intervene and set the right dates for it since it consistently struggled. In a similar way, it took the model multiple attempts to set dates even on bing, when looking for hotels
Personalization
I put my personalization on the list of applications in the “Settings” tab assuming that it would book on Expedia but it ended up going to Kayak instead thus making my app specific personalization almost useless wherein I had entered specifics on my budget and timing.
Risk Management, Intervention & Handover
Placeholder
Browser Navigation & Integration
Hallucination: After searching for flights, the agent asked to confirm if it could go ahead with booking, post which it was nudged to go look at options holistically and then it switched to rental cars. Post that, it again asked to book the rental car post which it was again nudged to look holistically. Subsequently, it hallucinated on hotel results and showcased without browsing. When asked, it apolgized that it hadn’t looked for hotel results and asked the user to proceed forward. We believe this behavior of continually booking for one service might be a result of the way model was fine tuned to finish tasks sequentially, and hallucinations might be a result of it not being trained on scenarios where multiple sub-tasks require consolidation and trade off analysis. This is important for complex travel style use cases.
In a similar way, upon searching for rental cars, hotel and an onward flight, when I asked it to put all the options together, it ended up quoting a different flight than the one it had initially suggested which it hadn’t listed in the first place and hallucinated on a return flight while it hadn’t searched for a return flight. Upon nudging, it went and searched again.
Credential Management
None available currently and asks for credentials when needed for user to take control
There are no rows in this table

Agent Struggling on the AVIS website

unnamed.png

Setting Dates on Bing

unnamed.png

Agent Hallucinating on Hotel Results without Explicit Nevigation

unnamed.png
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.