Explore

Incident Management

[Incident 1] Website is down

📊 Metadata

Postmortem owner:

@Eric Zhan

⁠

Collaborators:

@Alan Chowansky

⁠

@Lola Tseudonym

⁠

Incident:

⁠

Incident Commander:

@Eric Zhan

⁠

Severity:

@Huge

⁠

Escalation Time:

2021-09-08, 00:00

⁠

Triage Time:

2021-09-08, 00:03

⁠

3 mins

⁠

)

Mitigation Time:

2021-09-08, 00:10

⁠

10 mins

⁠

)

Resolution Time:

2021-09-08, 00:10

⁠

10 mins

⁠

)

📋 Summary

⁠

The entire website was inaccessible for 10 minutes.

⁠

💥 Impact

⁠

The production website is down for everyone

⁠

📜 Background

We use frontend proxy servers to connect users to our backends.

🔎 Root Cause

The frontend proxy servers were crashlooping due to a bad configuration push.

🎓 Lessons Learned

✅ What went well

The frontend proxy servers restart very quickly.

❎ What went wrong

The configuration push slipped through canary testing.

It took 7 minutes to rollback the bad configuration.

🍀 Where we got lucky

The oncall SRE was a domain expert in frontend proxy server and was able to identify the root cause.

📝 Action Items

Action Item

Type

Priority

Tracking

Rollback the bad configuration

Mitigate

Fix tests

Prevent

Add checks to make sure configurations commited to main is correctly formatted

Prevent

Create a tool to automatically rollback configurations if servers are crashing

Mitigate

There are no rows in this table

⁠

🕓 Timeline

2021-09-07

11:54 PM: Bad configuration was committed to main

2021-09-08

12:00 AM: Bad configuration was pushed, frontend proxy servers started crashing <INCIDENT BEGINS>

12:00 AM: Automated alerts fired

12:03 AM: Oncall SRE traced the crashes to the bad configuration push

12:10 AM: The bad configuration was rolled back and frontend proxy servers stopped crashing <INCIDENT ENDS>

🛠️ Operation Log

⁠

Pulled logs from crashing frontend proxy servers Rolled back the config push at 12AM

⁠