📊 Metadata
📋 Summary
The entire website was inaccessible for 10 minutes.
💥 Impact
The production website is down for everyone 📜 Background
We use frontend proxy servers to connect users to our backends.
🔎 Root Cause
The frontend proxy servers were crashlooping due to a bad configuration push.
🎓 Lessons Learned
✅ What went well
The frontend proxy servers restart very quickly. ❎ What went wrong
The configuration push slipped through canary testing. It took 7 minutes to rollback the bad configuration. 🍀 Where we got lucky
The oncall SRE was a domain expert in frontend proxy server and was able to identify the root cause. 📝 Action Items
🕓 Timeline
2021-09-07
11:54 PM: Bad configuration was committed to main
2021-09-08
12:00 AM: Bad configuration was pushed, frontend proxy servers started crashing <INCIDENT BEGINS>
12:00 AM: Automated alerts fired
12:03 AM: Oncall SRE traced the crashes to the bad configuration push
12:10 AM: The bad configuration was rolled back and frontend proxy servers stopped crashing <INCIDENT ENDS>
🛠️ Operation Log
Pulled logs from crashing frontend proxy servers Rolled back the config push at 12AM