GTC

R&D Projects

Quotes

About Me

Contact

Deprecated

Explore

R&D Projects

Real-Time Mobile Inferencing

How To Achieve Near-Zero LLM Latency

Bill French

This effort documents a few approaches I’ve learned to nudge generative AI inference times toward near-zero latency.

First, a definition of near-zero latency.

Assume we agree that zero latency is defined as roughly the

Dougherty Threshold⁠

(~400ms). This is approximately the response time that a human would begin to perceive latency in a text manipulation process.

Most experts in the field of mobile latency suggest response times for apps using on-device data hover at 200ms or less. For use cases that require the integration of connected services, latency greater than 950ms begin to make users uncomfortable. This is especially the case for voice-enabled apps, but less so for text and image apps.

Near-Zero Latency

The term near-zero-latency is subject to a number of variables, but my definition is not more than 1.5 times the Dougherty Threshhold - e.g., approximately 600ms with 500ms as an ideal target.

We can achieve near-zero latency (~500 to ~600ms) using an approach that would be generally consistent regardless of connectivity performance.

In the case of requirements where data or content external to the device is necessary, integration developers typically assume mobile apps should directly communicate with REST APIs. When milliseconds matter, my approach is to avoid this by using a few approaches.

First Principle

Mobile apps should communicate over sockets, not REST interfaces.

This is a general principle. Ideally, mobile apps should perform everything on-device to avoid latency issues. Sometimes, this is impractical.

Round trip latencies over a web sockets infrastructure, measured as the time required to publish a message on one connection and receive a message on another connection, typically range from ~5ms to ~200ms, with a median of ~46ms globally. This is commonly my experience using

Ably⁠

as the sockets infrastructure. PubNub and other real-time networks are also capable of this performance.

The global median assumed will obviously vary slightly depending on your location, proximity to the data centers, quality of your connection etc. But, on average, instructing a server with high-throughput to [then] reach out to

Groq⁠

and return inference results is about 50ms.

⁠

⁠

Now, the question becomes - how fast can Groq perform the requested inference on behalf of the sockets layer instruction?

Practical Test Without Web Sockets

To simulate the worst possible conditions, we’ll first test this process without the benefit of the real-time websockets layer and over a weak LTE signal. To achieve this, I’ve configured my Mac to use a Gen 2 iPad with two bars of signal.

Let’s examine the performance of a simple test given a topic such as Kansas City, MO and a query field that prompts Groq for an inference result that acts as an autocomplete feature.

The goal of this test harness is to generate three possible query completions in near-real-time.

⁠

⁠

The generative AI prompt is typical and constructed as shown in the snapshot.

It includes three variables:

Number of completion options to generate (3)

Topic (Kansas City, MO)

Query (typed by the user)

As each word is completed in the query field, an event handler triggering on the space bar dispatches a direct call to Groq using its JavaScript SDK.

An added complexity requires the query completions to be rendered on a canvas as a mindmap.

The average elapsed time is pushed into a field displayed in the web app at the bottom.

⁠

⁠

⁠

Click to Enlarge

⁠

CleanShot April 2.mp4

⁠

This video demonstrates several inference outcomes. The observations show that even without the benefit of a sockets layer, the performance is near-zero latency (defined as ~500ms) with an average combined latency of ~573 milliseconds. Most users will get the sensation this is a real-time app.

⁠

Loading…

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.