icon picker
[Incident 1] Website is down

📊 Metadata

Postmortem owner:
@Eric Zhan
Collaborators:
@Alan Chowansky
,
@Lola Tseudonym
Incident:
Incident Commander:
@Eric Zhan
Severity:
@Huge
Escalation Time:
2021-09-08, 00:00
Triage Time:
2021-09-08, 00:03
(+
3 mins
)
Mitigation Time:
2021-09-08, 00:10
(+
10 mins
)
Resolution Time:
2021-09-08, 00:10
(+
10 mins
)

📋 Summary

The entire website was inaccessible for 10 minutes.

💥 Impact

The production website is down for everyone

📜 Background

We use frontend proxy servers to connect users to our backends.

🔎 Root Cause

The frontend proxy servers were crashlooping due to a bad configuration push.

🎓 Lessons Learned

✅ What went well

The frontend proxy servers restart very quickly.

❎ What went wrong

The configuration push slipped through canary testing.
It took 7 minutes to rollback the bad configuration.

🍀 Where we got lucky

The oncall SRE was a domain expert in frontend proxy server and was able to identify the root cause.

📝 Action Items

Action Item
Type
Priority
Tracking
Rollback the bad configuration
Mitigate
P0
Fix tests
Prevent
P0
Add checks to make sure configurations commited to main is correctly formatted
Prevent
P0
Create a tool to automatically rollback configurations if servers are crashing
Mitigate
P1
There are no rows in this table

🕓 Timeline

2021-09-07
11:54 PM: Bad configuration was committed to main
2021-09-08
12:00 AM: Bad configuration was pushed, frontend proxy servers started crashing <INCIDENT BEGINS>
12:00 AM: Automated alerts fired
12:03 AM: Oncall SRE traced the crashes to the bad configuration push
12:10 AM: The bad configuration was rolled back and frontend proxy servers stopped crashing <INCIDENT ENDS>

🛠️ Operation Log

Pulled logs from crashing frontend proxy servers
Rolled back the config push at 12AM
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.