I do like prompt based approach to finding web elements on the screen. To the best of my understanding, this is a smaller / medium version of the Fuyu model under the wraps. Here are a few cases where this may or may not work:
There are multiple elements on the screen with the same HTML names. See example below where Acme - 1,200 Widgets appears in “My Opportunities” as well as the search results.
Open Question: Can I type the nature of the element for it to search and say things such as “Search result stating Acme - 1,200 Widgets” or “List Item within My Opportunities Section”?
Recording / Pointing instead of Searching:
Existing workflow automation approaches leverage an HTML DOM based path for web elements. They allow a user to configure by selecting objects on a screen. This in a way helps a user mimic a process while the agent is recording the actions such as “Click”, “Entering Text” and “Scrolling”, which thus makes the overall development of the agent block code a lot easier.
I understand that Adept doesn’t use a DOM based model and is rather powered by object detection, however, if recording user actions can be reverse engineered into automatically generated prompts, or snippets of the screen can be leveraged as multi-modal model inputs, then that enhances the overall user experience.
Loading…
Adapting to Changes in UI
Here are some ways in which I’ve seen UIs evolve and break rules based workflow agents. While this can be handled in a more robust way by Adept, these edge cases still could be helpful to test on:
Application UI style stays same, but position of objects changes drastically from one corner to another on a screen, however HTML code remains the same
Application UI style stays same, position stays the same, however, the HTML positioning changes
Application style changes completely, and maybe HTML code as well. For eg. Sales lightening to classic
Element type changes from search box to dropdown (& vice versa). This happens for a lot of on-prem applications.
Specific screens get removed and consolidated with other screens.
Sometimes, pop-ups show up in the middle which can be sporadic.
Sometimes, UI elements can be offset, especially when browser resolution is not set to 100%
Taking HTML Addresses + Image + Text as Input
While I am not entirely sure about the parameters that Fuyu is ingesting, I am hoping that it can ingest the text inside an image, the look and feel of it, as well as the HTML source for identifying objects accurately. In former tools, one can either pick solely based on HTML source, or image. Taking a deterministic approach facilitates leveraging these multiple signals to make a robust workflow, thus creating a bigger differentiating advantage. This may already be something that is being done under the hood, so please ignore it if that’s the case.
Variables Manager
Variable Manager: Not entirely sure if the enterprise version comes with this but having a variable manager to define variable types, assign default values and tag them as task specific or global variables is usually quiet helpful. I talk about this in detail in
Credentials Manager: For handling usernames and passwords, enterprises typically like them to be stored as secure variables stored centrally in an application database or pulled form their existing credentials manager (Cyberark for example). Furthermore, the access to the variables for specific tasks is handled by an RBAC that is defined by an admin.
Transient UI Elements
There are instances where the search bars are transient and only show up when one clicks on the search bar. For example. The below is a screenshot from Salesforce where if I would like the bot to click on “Acme - 140 Widgets (Sample)”, there is no way for me to test it with the existing way of configuring a “Click” command. This is where a “recording” functionality comes more handy.
Click
Dynamic Clicking Logic
Sometimes picking options from a list need some sort of dynamic search / wildcard based approach. For example, in this example, out of the 3 Acme opportunities, I may want to pick 140 but often times, I may not know exactly what I’m searching for. Here are a few examples of why this could be dynamic:
From the previous system step, I searched for a value and found that “Acme - 140” is the step that I need to search for, and this gets passed on
Here is an example of how this is accomplished today step by step:
Find a common HTML element that changes as I cycle through the search options
Variable-ize the dynamic HTML element
Find a common HTML element that changes as I cycle through the search options
Variable-ize the dynamic HTML element
Loop through every element and get text from each of the search results
Check whether the result contains 140.
If it does, issue a command that clicks on the specific element.
Go to URL
Choice of Browser
While I understand that this may be only limited to Chrome for now, as I understand, a lot of organizations use browsers such as Edge and some even use Explorer to this date. Is the same agent architecture extensible to other browsers?
Existing vs New Browser
Having an option to pick between navigating to a URL on an existing or a new browser is quite helpful since in some scenarios, a user might want to switch web pages on the same screen, versus open them on new screens to perform steps that require a lot of back and forth between the two
Error Handling
Error Message
This might be in the enterprise version but I do not see any error handling or error messages when something fails. For example, I added a wrong command and it silently failed w/o showing an error message. I talk about error handling in