The section is a short survey in which the interviewer tests the candidate's knowledge of a broad set of topics related to computer systems architecture and systems engineering. The questionnaire includes, but is not limited to:
Linux
Networks
Database principles and concepts
Protocols
Cryptography
File systems
Questions may involve either a one-word answer or a detailed narrative, which the interviewer interprets as a score based on the candidate's level of immersion in the topic. It is important to note that the questions do not include topics related to specific tools and technologies (Docker, Ansible, Terraform, CI/CD, etc.) — we believe that an engineer with sufficient training will easily master any modern tools.
Troubleshooting section
1. Problem statement
The interview begins with the description of the interview rules. According to the legend, the candidate, and the interviewer work together in the SRE-team. The candidate plays the role of Lead, and the interviewer plays the role of Junior. According to the legend, the Lead leaves for a conference and the Junior stays on duty. And then an incident occurs, which they solve together, because Junior calls to the Lead (the candidate) at the start of the incident and asked to solve the incident together.
Before the incident starts:
Interviewer presents the system architecture that our SRE team maintains. The interviewer shows the architecture diagram and talks about the system components and interrelationships.
Next, the candidate questions about the system; unfortunately, often candidates don't ask clarifying questions and go straight to the incident. This is similar to task formalization in the system design section.
2. Start of the incident
As an incident starts, accompanied by some external effects which the interviewer communicates to the candidate. Next, the ball moves to the candidate's side, and they lead the interview, while the interviewer only answers the questions and applies the commands given by the candidate. It should be noted that both questions and commands should be as precise and specific as possible. Our interviewer is Junior and will not be able to answer complex questions and apply complex commands due to the fact that he is asks for guidance.…
3. Formulating hypotheses and conducting experiments
The third step stretches almost to the end of the interview — it is a story about continuous diagnosis of problems by formulating hypotheses and conducting experiments. The goal of this step is to figure out what is wrong and fix the system.
4. Workaround
The fourth step comes when the candidate has accumulated enough information and understanding about the situation to propose a workaround solution. That will help mitigate the issue for the end users of the system and quietly dig into the issue further.
5. Fixing algorithm
The sixth step is the full-fledged fixing of the issue, when we clearly understand that the issue is fixed, and it will not come back. In this step, it is also important to describe what this repair algorithm looks like, because sometimes candidates manage to fix the system by some chaotic movements or chaotic switching components on and off.
6. Root cause
At step 6, the candidate gets to the root cause of the problem. They have all the elements of the puzzle assembled into a complete picture and are able to explain why all this happened, why such symptoms were observed and why the repair algorithm worked out.
7. Long-term improvements
And the final step is about how to improve the system so that next time our Lead can rest easy away from their laptop. To do this, it's critical that similar issues are not repeated in the future, or at least we learn about them quickly.
Example interview
The given context:
Several million customers every day.
The system is deployed in two data centers.
The application consists of two monolithic stateless components: a frontend application in React and a backend application in Python Django that provides APIs for the frontend.
Both applications are deployed in K8s as multiple instances.
The data layer is represented as a Postgres persistent storage and a cache in the form of Redis.
You can get some more details out of the image, but it's time to move on to the incident.
As a procedure, SRE Lead (our candidate) goes on vacation and Junior (our interviewer) remains on duty. While on duty, tech support comes to us and informs us that there is an increase in the number of customer requests (by phone or in the support chat) with complaints about the speed of the site and periodically unloaded pages. Our interviewer "calls" the candidate and asks for help with the incident.
Next, I will show how such a dialog could develop (interviewer's answers are italicized)
Do we have a system for log gathering, such as the ELK stack?
Yes, we have one, but I'm not too sure how to navigate through it yet. What should I be looking at?
Let's look at visualizing the information from the balancer logs, shall we?
Opened the dashboard, what do I look for on it?
Let's use the RED Method and look for the number of Requests, Errors, Duration?
I see that the number of requests has not increased, but the number of errors has increased and the average Duration of requests has increased.
And what type of errors are prevalent?
There are different types of errors flying, but most of them are 504s
Hmm, 504 is Gateway Timeout. Let's take a look at the application logs...
How we evaluate a candidate
We use six criteria for evaluation:
Candidate's horizon — it is important for us to assess the range of tools and approaches used to diagnose and repair the system
Logical and methodical search for a solution — it is important that the candidate goes through the task meaningfully, eliminating incorrect hypotheses and getting closer to fixing the system.
Whether a workaround solution was found to quickly mitigate user issues — this item allows us to assess how quickly the candidate has focused on and eliminated the impact on users
Whether a complete solution to the problem was found, and whether it was possible to formulate an algorithm for its execution. Here it is critical that the candidate did not stop at the dummy fix from the previous point, but actually repaired the system and figured out how he did it himself.
Did they uncover the root cause of the issue, which explains the original symptoms and why the problem-solving algorithm has worked: here it is essential that the candidate understood what the issue was, i.e., took the issue to its logical conclusion
Have they suggested ways to improve the system so that: Similar issues will not recur in the future; Or at least we would know about them immediately. We want the candidate to leave behind a more reliable system than the one that was in place before the incident, to paraphrase the Boy Scout rule.
This dialog continues throughout the interview and the candidate tries to go through all the steps above, and the interviewer helps them as much as possible.
Let's compare Troubleshooting and the System Design Interview
Troubleshooting
There's already a production system at the begining
The main point is to firefigt the incident
Critical is: how systematic the approach to examining symptoms, finding a temporary fix and then fully repairing the system.
Those, who are close to the infrastructure and operation of production systems, usually show up pretty well in this section.
System Design is a little different.
At the start, it's all about the requirements.
The essence is to design a system that is resilient to incidents.
Critical is: how systematic the approach to designing a solution, taking into account functional and non-functional requirements, and how robust is the resulting the system.
Those, who are close to designing and developing new functionality in complex systems usually pass it better.
What's kind of intriguing is that in an ideal world, an SRE should have both of these skills.
This is needed so that SREs can not just firefight but also work systemically on systems reliability. But the real world is tough and SRE candidates were typically passing System Design at the Junior level, so we decided to stop taking System Design section for SRE. So, we decided to teach it to our team, both developers and SREs. To become proficient in both areas, it is about both: theory and practice.
In terms of Troubleshooting:
Theory involves
Learning the practices and approaches outlined in, for example, Google's SRE Book and SRE Workbook and Building Secure & Reliable Systems.
Learning the tools to apply these practices
And practice can be gained
Working in an SRE team on real systems
Handling public postmortems of large and complex systems
Practicing troubleshooting on tasks based on postmortems
System Design is slightly different:
The theory involves
Studying the principles of distributed systems design
Studying the main classes of solution designs, their strengths and limits of applicability
And practice can be gained
Working in the role of an architect on the design of real systems
Parsing the architecture of large complex systems (from Google, Meta,...)