Skip to content

Envision Resiliency Systems

Envision Backend System Infrastructure will be supplemented with three systems to increase it’s resiliency and issue traceability for jobs. Ensuring that Users are not blocked from creating jobs due to a fault in our system and we can debug at a later time as to why that issue occurred and make any improvements in our system.
These four systems are as follows
/Redundancy for job status event monitoring services:
Redundancy is added for services that update kubernetes job status in firestore database
Two publish job status change events from kubernetes to pub/sub topic for GKE job events
k8-event-streamer service
log sink

/Pub/Sub Dead Letter Topic:
It is used in case a message in pub sub topic fails to deliver to it’s subscriber repeatedly and consumes all of it’s retries
When the above case happens pub/sub will publish that message in the Dead Lettering Topic.
The subscriber to dead letter Topic will handle any cleanup and logging processes required for the specific message, like failing the job and saving it’s logs in database and notifying development teams about the failure
The case where even pub/sub as a whole fails and dead lettering topic can’t handle the failure, that scenario will be handled by failure recovery system

/Failure Recovery System:
Failure Recovery system is a separate isolated system that periodically scans for any jobs that are stuck in such a status that it blocks the users from continuing the use of our app.
It will handle all such scenarios where our infrastructure or any of it’s components fails (backend, face detection, pub/sub, event streamer and log sink)
It is basically a cron job that will run every hour and check for the defined rules that can filter out stuck jobs and handle these job failures and unblock the user
The Rules that check for the failed jobs are as follows
Is the job stuck on backend side
Is the job stuck on face detection
is the job stuck on kubernetes
if due to high load notify user about increased waiting times and let the job complete
if due to any kubernetes failure job does not exist now on cluster move to next step
Checking outputs from storage bucket for job.
If kubernetes does not have job status, then there could be 2 reasons
job failed on kubernetes and kubernetes removed it’s entry after the time for keeping job data has passed
in this case kubernetes will return unspecified status for job
on checking from storage bucket there will be no output directory, the system will mark job as failed and notify user
job completed and job status updation mechanism failed to update status in db and job entry was removed from kubernetes after the time for keeping job data has passed
in this case too kubernetes will return unspecified status for job
on checking from storage bucket there will be an existing output directory and the failure was in the kubernetes job status update services, the system will mark job as completed and notify user
As a last step the system will notify about any job failures on a specified slack channel so that development team is notified and can debug the reasoning for failure of the specified component
/Job Logs/Phase Tracing Mechanism:
before completion a job passes through many steps in our system.
We made a mechanism that will save logs for every step a job passes in the database
Through this we can see at what step the job failed

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
) instead.