Envision Resiliency Systems

Envision Backend System Infrastructure will be supplemented with three systems to increase it’s resiliency and issue traceability for jobs. Ensuring that Users are not blocked from creating jobs due to a fault in our system and we can debug at a later time as to why that issue occurred and make any improvements in our system.

These four systems are as follows

/Redundancy for job status event monitoring services:

Redundancy is added for services that update kubernetes job status in firestore database

Two publish job status change events from kubernetes to pub/sub topic for GKE job events

k8-event-streamer service

log sink

/Pub/Sub Dead Letter Topic:

It is used in case a message in pub sub topic fails to deliver to it’s subscriber repeatedly and consumes all of it’s retries

When the above case happens pub/sub will publish that message in the Dead Lettering Topic.

The subscriber to dead letter Topic will handle any cleanup and logging processes required for the specific message, like failing the job and saving it’s logs in database and notifying development teams about the failure

The case where even pub/sub as a whole fails and dead lettering topic can’t handle the failure, that scenario will be handled by failure recovery system

/Failure Recovery System:

Failure Recovery system is a separate isolated system that periodically scans for any jobs that are stuck in such a status that it blocks the users from continuing the use of our app.

It will handle all such scenarios where our infrastructure or any of it’s components fails (backend, face detection, pub/sub, event streamer and log sink)

It is basically a cron job that will run every hour and check for the defined rules that can filter out stuck jobs and handle these job failures and unblock the user

The Rules that check for the failed jobs are as follows

Is the job stuck on backend side

Is the job stuck on face detection

is the job stuck on kubernetes

if due to high load notify user about increased waiting times and let the job complete

if due to any kubernetes failure job does not exist now on cluster move to next step

Checking outputs from storage bucket for job.

If kubernetes does not have job status, then there could be 2 reasons

job failed on kubernetes and kubernetes removed it’s entry after the time for keeping job data has passed

in this case kubernetes will return unspecified status for job

on checking from storage bucket there will be no output directory, the system will mark job as failed and notify user

job completed and job status updation mechanism failed to update status in db and job entry was removed from kubernetes after the time for keeping job data has passed

in this case too kubernetes will return unspecified status for job

on checking from storage bucket there will be an existing output directory and the failure was in the kubernetes job status update services, the system will mark job as completed and notify user

As a last step the system will notify about any job failures on a specified slack channel so that development team is notified and can debug the reasoning for failure of the specified component

/Job Logs/Phase Tracing Mechanism:

before completion a job passes through many steps in our system.

We made a mechanism that will save logs for every step a job passes in the database

Through this we can see at what step the job failed

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.