Skip to content
Hiring — shared
Share
Explore
Interview Guide

icon picker
System design section

System design section

Intro

This section is not required for junior and middle roles, but you may want to ask the candidate, if they want to take it.
Why this section is important: it is all about decentralizing the decision-making process and selecting people in the team, who are able to take the responsibility for parts of the products and think holistically. And yes, those powers are not always connected to algorithmic thinking and the ability to convert time to code at the highest rate. If you discover such abilities, but the candidate gets a not satisfying you score in the algorithmic section, you can still propose them to a different department.
The main stages of the system-design section are depicted below. Typical timing for the interview is 2h and purely dependent on how well are you prepared as an interviewer. Remember, that managing the time constrains is your responsibility, but share it with the candidate, and add points, if the candidate tracks time themselves. Cut the edges, decrease the scope, guide the candidate and stop from extra grounding.
image.png

1. Problem statement

Common sense

First, the interviewer provides the name of the task, gives additional context and formulates the main functional and non-functional requirements. In the following steps, the candidate takes control of the interview and leads the interview (if they can).
The interviewer narrates the task and at the same time adds the requirements to the board so that they are always in front of the candidate, and it is easy to go back to them and check if the solution meets the expectations. After the requirements have been explained, the ball goes to the candidate, and they can clarify anything that remains unclear about the task. For example, they can ask questions that relate to the planned load on the system, as these points often have a dramatic effect on what the conceptual design will end up being. The answers to these questions are usually good to capture next to the initial requirements, because we'll need them later on.
It’s important to give enough context here, and not simply request for “the best practices” or “design based on your experience”. That will show the candidate either how chaotic are you (do you want to work in chaos yourself?) or how unprepared are you (you don’t value the candidate’s time). In return, you’ll get not what you want, and that will frustrate you and force to a decision fallacy. With no guide rails (requirements), candidates tend to simplify the design (quite fairly) or to focus on the topics they know the best, which will consume all the time of the interview and also drive you to a decision fallacy.
The notation doesn’t matter, and the candidate allowed to use any and combine many if needed, but should justify the choice.
I personally suggest a candidate to use a service, since it integrates to the Teams call and can be set up quite fast during the call. Use only Software sketching tool set. Not a solution to advise, rather own experience.
image.png

Example interview

This article will consider a typical build a YouTube exercise, that shows us some often forgotten high-load specifics.
image.png
# Problem statement Design an application that allows content creators to upload a video and everyone else to view that video. ## Functional requirements * The system should allow channel creators to upload videos as quick as possible * Videos should be published to the feeds of channel subscribers * Viewers should be able to change the quality of the video when watching it
## Non-functional requirements — architectural characteristics * The system must be highly available * The system must be scalable and fault-tolerant * We must ensure that the cost of the service infrastructure is as low as possible

2. Formalize the task

Common sense

The second stage is the formalization stage, in which the candidate asks questions of the interviewer to clarify exactly what the system should be able to do, as well as identifying key architectural characteristics of the system, such as: high availability, data consistency, high throughput, scalability, auditability, and other -ilities.
During the formalization, the candidate will gather the missing requirements, and the interviewer will gladly answer. Functional requirements in the form of desired system scenarios that the system should implement. And non-functional requirements or architectural characteristics. Along it, the candidate should determine the relative importance of these requirements among themselves — not all functional requirements are equally important and often the most complex of them needs to be worked out the best. Similarly, for architectural characteristics — sometimes they contradict each other, and we need to know which of them should be prioritized, even to the detriment of others.
In addition, it is necessary for the candidate to ask questions that will allow them to understand the load on the system: how many users there will be, what is the load (number of requests, RPS, RPM), how much data will have to be stored, and so on. It is valuable not just to ask these questions in a row, but to select them based on the context of the task. Having asked some questions, it is useful to formulate some hypotheses about specific numbers that are essential for the system. This data that we will need to use to work out the system sizing issues.
Here an interviewer better should remember some reference data themselves, and share with the candidate, if needed.
The “back-of-the-envelope estimation”:
— 2⁰=1, 1 byte, — 2¹⁰≈10³ byte, 1 Kilobyte, 1 KB — 2²⁰ ≈10⁶ byte, 1 Megabyte, 1 MB — 2³⁰ ≈10⁹ byte, 1 Gigabyte, 1 GB — 2⁴⁰ ≈10¹² byte, 1 Terabyte, 1 TB — 2⁵⁰ ≈10¹⁵ byte, 1 Petabyte, 1 PB — 2⁶⁰ ≈10¹⁸ byte, 1 Exabyte, 1 EB
Latency numbers everyone should know:
image.png

Availability numbers:
SLA=99%, unavailability 1%, 14.4 min per day, 3.65 day per year
SLA=99.9%, unavailability 0.1%, 1.44 min per day, 0.365 day per year
SLA=99.99%, unavailability 0.01%, 8.64 sec per day, 52.6 min per year
SLA=99.999%, unavailability 0.001%, 0.864 sec/day, 5.26 min/year
SLA=99.9999%, unavailability 0.0001%, 0.0864 sec/day, 31.56 sec/year
Help and guide the candidate, if necessary, but let them lead the interview. Join the candidate, use “we” noun, relax them, but keep the structure.

Example interview

Here is a sample dialog with questions from the candidate and answers from the interviewer in italic.
Do we need to consider client authentication and authorization?
No, it's pretty standard here and out of scope for our case.
Can we use external services, for example, CDN for content distribution?
Yes, we can, if we justify why we need them and how we will interact with them. Specifically regarding CDN, I can say that creating a full-fledged CDN is a separate System Design task.
Will we dig deep into video transcoding?
No, we won't, as this is already quite a specific domain, and we don't have a goal to check your knowledge of it.
What resolutions should we convert video to?
Let's say 360, 480, 720, 1080.
Do we keep the original video after all conversions or not?
We will delete the original video after all conversions.
Should the video be available to subscribers when all resolutions are ready, or at least one?
Video will be made available when all resolutions are ready.
Subscriber feeds are generated in a smart way or are they just the latest videos added by the authors we are subscribed to
Let the feeds in our case be simple with sorting by time from most recent to latest, we will receive the feed in portions of 10 elements.
How many users do we expect our service to have?
DAU (Daily Active Users) of our service will be 10 mln
Are our users geographically distributed?
Yes, they are located in different regions.
How many videos per day will our users watch?
On average, 10 videos per day.
How many videos will be uploaded per day.
On average, users upload 0.1 videos per day
What is the average size of videos that will be uploaded to our service?
Average video size will be 300 Mb
How often users will request their feed with video?
On average, 5 times a day
How long do we store uploaded videos
Storage time is not limited
Now that we have collected the load requirements, we can roughly estimate the sizing of the system. This directly affects our approaches to its design. And that means we should do some basic calculations.
Number of video views — 10⁷ × 10 = 10⁸ Daily traffic on download — 10⁸ × 300 Mb = 30 Pb Bandwidth to download — 30Pb / 86400 (seconds in a day) = 43.4 Gb/sec Number of videos downloaded per day — 10⁷ × 0.1 = 10⁶ Number of videos uploaded per second — 10⁶ / 86400 ~12 rps Number of feed requests per second — 5×10⁷ / 86400 ~ 580 rps Daily upload traffic — 10⁶ × 300 Mb =300 Tb We will need to store the original videos until we prepare the video in the required resolutions, then the original video can be deleted — as a result, we will need to have — a relatively permanent buffer storage for original videos, let's take 1 Pb as a reserve — and constantly expandable storage for converted videos, let's assume that all converted videos are the same as original videos — then we need to increase the storage at the rate of 300 Tb per day.

3. System boundaries

Common sense

The third stage is to start designing the system and at this stage it is important to define the boundaries of the system, what scenarios we should implement and what public APIs the system provides. At this stage, the system is a black box for us, but we already know what this black box promises to its users. In general, at this stage, we actually want to build a System Context Diagram from the C4 Model.
A System Context diagram provides a starting point, showing how the software system in scope fits into the world around it.
A candidate needs to be able to choose the right way to interact between system components, external systems and users. In total, there are 4 integration options that Gregor Hohpe talked about in his book “Enterprise Integration Patterns”:
Files — integration via file uploads.
Database — integration through a common database.
API — integration via API (the most convenient way for most cases).
Messaging — integration via sending and receiving messages, including events as a subclass of messages.
Knowing about these methods, we need to understand their pros and cons and decide the method that is more suitable for our case. If we select integration via API, we need to understand what approaches exist there, for example: REST (Open API), RPC (gRPC, JSON-RPC, …), GraphQL, AsyncAPI, Webscokets…

Example interview

At first glance, the system in our case has the following boundaries:
API for uploading videos by the author of the channel
Notification of the author that the video is ready
Notification of channel subscribers about a new video
API for receiving a user's feed of videos based on channel subscriptions
Receiving a binary stream with a specific video
image.png
In this case, it is important to note by the candidate that the main write path happens asynchronously — the video creator (VideoMaker) uploads a file, and they know that video processing will take some time. Next, both the video creator and viewers subscribed to the uploading video are notified when it is ready. Another scenario for getting information about a new video is a viewer scrolling (requesting) their feed of available videos. Both in the notifications and in the feed, the URL of the video should be available, so first it will be processed by the pull CDN, which will pull a specific file from the blob storage in the system.

4. Define the main workflow

Common sense

The fourth stage is the most interesting, as it is where the system is iteratively designed. In this stage, it is important to work out the data flows: how a request comes into our system, what do we do with them: save data to the database, put it in the queue or cache it, send it to /dev/null and so on. As we work through the data flow, we have system components that perform some of the functions that are required to implement the desired scenario.
The best strategy at this point is to start working out the happy path of our system scenarios and then go back to the exceptional flows and try to account for them. Special attention should be pointed to read/write paths, because balancing it may have a drastic effect on the overall architecture. It is also significant not to forget non-functional requirements, such as the fact that we can't lose user requests and must complete them in some limited time.
At this stage that it is important to work out the system sizing on a conceptual level, i.e., that the system can be scaled to the planned loads if necessary. For example, it is worth checking that the most loaded parts are horizontally scalable, and we know how to do it.

Example interview

The easiest thing to do here is to take the core scenario and start from there. In this task, it is upload video, where the creator of the video calls our upload API by passing a binary file. Our API should save the original video in blob storage, then save the task to convert the video to the required resolutions and return the task ID to the client.
Next, we need video conversion workers to sort out the tasks. They download the original video from blob storage, and then convert the video and put the finished video into a separate blob storage, and then store meta information about the finished video. After that, a message is opened in the queue that the video is ready. This queue is handled by 2 different workers:
The first one is notification worker, and it sends notifications about video readiness to the author and his subscribers via notification service. The second is a feed worker, which will prepare feeds for subscribers and put it into the feed database.
Another scenario is to retrieve the feed of the video channels to which the user is subscribed. This scenario starts with a call to the feed API, which makes a request to the feed database where user feeds are stored.
The final scenario is video browsing, which happens when accessing a CDN that distributes videos from its local cache or, if the video is not already in the cache, it goes to blob storage where the converted videos are.
image.png
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.