With MLLMs, Adept has the capability to watch a user screen and create a summary recording of the processes that the user is working on
2
Multi-Modality
Adept Fuyu powered agents are able to ingest images, beyond just text. This is helpful for understanding not just what is written on a user screen, but also the way the screen looks like, based on which, it can efficiently detect objects on a screen.
For example, in the diagram below, the model is able to identify whether the second email is starred or not, purely with image ingestion and prompt based mechanism without explicit coding constructs
Traditional RPA tools as of today have limited capability to ingest the text that is available on a screen and hinge decisions based on unstructured text. This is powerful especially for scenarios where one would resort to keyword based approaches to build complex logics. However, a lot of logic in complex enterprise automation relies on both images and text.
Building logic equivalent to what Fuyu can do with a simple prompt equates to multiple lines of code with complex HTML DOM based identification mechanisms and complex coding constructs which an average business user won’t be able to build.
Following is a snippet from Automation Anywhere’s package repo:
3
In-House vs 3P LLM API
If Fuyu has the capability to run inference on the edge (eg. someone’s computer, it’s a more secure way of ingesting images, text and running the model locally
Most RPA vendors leverage 3P models such as OpenAI which act as sub-processors, thus adding privacy concerns since enterprise data is being sent back to a third party cloud which may not be conducive to customer security postures.
4
Runtime Architecture
Adept agent runs natively on the browser which then provides the capability to run across a browser instance independent of the OS type.
Caveat: For desktop apps, I’m not sure how the Adept agent architecture would pan out and whether it would have cross-OS compatibility.
Most RPA agents are designed to run on Windows, and install a local exe agent build using JDK that is downloaded from the web app, and performs local execution, interfacing with web browsers.
Most RPA vendors do not have the capability to interface with Linux / Mac based OS systems, however, there are an increasing # of users, leveraging these systems for accessing browsers.
5
Nature of Model
Fuyu is a simpler architecture which is optimized for image capture and Q&A, which is very specific to the use cases that RPA is most used for. Thus, the efficiency and accuracy of using a dedicated model, as opposed to a generic model is much larger. This may translate to lower inference times and accuracy regardless of the deployment type (on-prem versus cloud based)
Existing RPA tools leverage an OOTB GPT 4 model, which although is powerful, it isn’t dedicated for enterprise user screens which directly affects efficiency and accuracy.
6
Fine Tuning
If Adept plans to provide fine tuning capabilities wherein, one can take an OOTB Fuyu model and leverage techniques such as RHLF and Reward Modelling to fine tune the performance of an agent to make it perform better over a period of time, that would be a great advantage over traditional RPA vendors
Existing RPA vendors integrate with a cloud based GPT 4 model which may or may not be fine tuned to customer images.
Additionally, fine tuning a cloud model to customer’s use cases may involve sending data to 3P cloud which may not be acceptable for all customers.
There are no rows in this table
Want to print your doc? This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (