Background and Context
Background and Context

We want to scale the number of pods that are used to run Browserless services based on some metrics. Currently, Browserless pods supports no autoscaling, which means that there are a fixed number of workflow workers doing all carrying out Browserless tasks. Thus, when there are a lot of tasks, the end to end latency tends in increase because tasks end up getting queued. In other cases, if there aren’t many tasks, pods end up being under utilized.

But first lets take a step back and breakdown whats written above.

What is Browserless?

Browserless is a service that Coda uses for 2 things - Pre-render Documents and Export PDFs (turns Coda Docs to PDFs). Browserless service runs on a few fixed number of pods.

What are the metrics we are interested in?

We need metrics about Browserless pods to scale these pods. At this point, any sort of metrics can he helpful - queue length, CPU use, latency etc. But queue length seems most promising. Luckily, Browserless gives us those metrics.

This is all the metrics we can get from browserless:

here⁠

⁠

How does Browserless work right now?

This is not very important to for this project but gives more context about the problem we want to solve.

Browserless, like mentioned above, is used for 2 things Pre-render documents and Exports PDFs. When either of these tasks are issued by the user, on the backend, this creates a new export workflow that is stored in postgres. Using a queuing service the workflow worker keeps checking if a new service is added or not. When it finds one, it runs browserless, and keeps on running till the time the workflow is completed and pdf is created. Once completed, the workflow worker updates postgress. It also uploads the PDF on S3 and its link is also sent to postgres. An IsComplete flag is sent to the client meaning the workflow was done. The PDF link from S3 is shown on the browser.

Here is Diagram of whats happening.

⁠

From the picture, there is bunch of new things and some of them autoscale while some do not.

Frontend: CPU, HPA, easy to scale

Postgres: Does not Autoscale, doesn't make sense to scale

Browserless: no autoscale, need to scale.

Currently browserless has allows 2 active session and queue length of 20 sessions

workflow worker: no autoscaler, need to scale

S3: scaled by AWS

Issues with this model:

No real scaling happening. There are a fixed number of pods that run Browserless. There are no optimizations currently.

These browserless pods are all part of a Service. Our application opens a websocket to the Service, which (I believe) gets randomly proxied to one of the pods. The problem with this is that you could end up with a pod that is forwarded a bunch of browserless sessions and ends up queueing stuff, while another browserless pod is totally sitting idle (just because of how the requests were randomly forwarded).

This project tackles the first problem.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP