Talent Resources

Coding assessments

Video interviews

Employer interviews

Explore

Coding assessments

Introduction to the Data Science Assessment

‌The Data Science Assessment is a standardised test that uses the

Certify framework⁠

to measure core programming and data science knowledge. The Coding Score obtained from the test is a singular measure that provides a quick overview of the test taker’s skills. The score measures the test taker’s coding and analytics skills as well as knowledge of probability theory, statistics, and machine learning.

‌

The skills that the assessment aims to assess:

‌

The Data Science Assessment is designed to measure the skills that are important for almost all data science roles. To achieve this, CodeSignal used the following sources of data:

‌

Most common topics taught in different data science programs across main universities in the US;

“An Introduction to Statistical Learning: with Applications in R”11 for reference in adapting/formulating stats/modelling task descriptions;

Most common topics covered during data science interviews at successful US-based companies;

Overlapping topics from different data science certifications (such as Microsoft Certification for Azure Data Scientist Associate22, Data Science Council of America (DASCA) Certification for Associate Big Data Analyst33, and Dell Technologies Certification for Associate Data Science and Big Data Analytics44).

‌

Based on their research, the common expectations for all professional data scientists are the following:

Table 1

Table 1

Column 1

Column 2

Topic

Details

Basic coding

This covers the basic building blocks for data processing and modelling. Although data scientists are not software engineers, they should be able to write a simple program at ease to deal with data processing and modelling. Included topics: Primitive data types (ints/floats/strings) and basic arithmetic/logical operations, loops and decision constructs, basic collections (arrays, lists, and dictionaries/sets), using libraries highly relevant to data science/analysis (e.g., python numpy/pandas/scikit-learn).

Query language

This covers the basic building blocks for analysing data. Regardless of whether SQL or Pandas DataFrame or another tool is used, data scientists should be familiar with querying on a dataset to gain insights. Included topics: Using SQL or python/R libraries such as pandas, Basics: Filtering, Sorting, Aggregate Functions, If, Case and String Functions, subqueries (inner queries), joins (inner, left, right), window functions and window-specific aggregates.

Probability basics

This covers the basic building blocks for setting up a well-defined modelling problem. Without appropriate use of the basic probability theory, it’s difficult, if not impossible, to clearly communicate with other data scientists and engineers what modelling problem one is trying to solve. Included topics: Random Variables, Events, and Probability Distributions.

Statistics basics

This covers the basic building blocks for statistical modelling/learning and quantitative analysis. The first thing that a data scientist needs to know about any data is its size, distribution, statistics such as mean/standard deviation, and skewness. In addition, given a hypothesis, one should be able to choose and apply one of the many well-established statistical tests in order to accept or reject the said hypothesis. Data scientists should know the statistical basics by heart and be able to apply them appropriately. Included topics: Mean/Median/Mode, Standard Deviation, z-score, p-value, and t-statistic.

Conditional probability and Bayes Theorem

Conditional Probability and Bayes Theorem are crucial in evaluating models and comparing their performances. For instance, selection bias in sampling often leads data scientists to incorrectly evaluate the accuracy or performance of the models they are testing, which would subsequently lead to unexpectedly underperforming models when rolled out in production.

Linear regression

One of the most powerful yet simple linear models for predicting continuous, scalar response which is often the first step towards building more complex, accurate prediction models. Its importance cannot be over emphasied.

Logistic regression

One of the most widely used statistical models for classification. Similarly to linear regression, it is often the first model data scientists use to analyse important features before moving on to more complex models.

Clustering algorithms

Unsupervised learning is essential for data scientists and k-means clustering is the first thing to know. Included topics: k-means clustering

Regularisation

Regularisation plays an important role in modelling, and linear models are of no exception - for one, regularisation prevents models from overfitting. Different regularization methods lead to different results in terms of feature selection and bias, and understanding their implications is important. Included topics: Regularisation in Linear Regression and Logistic Regression.

Model evaluation

With a plethora of useful, easy-to-use libraries, we can easily train linear models, such as linear regression models and logistic regression models, and come up with stunning charts that demonstrate how accurate the models are. Yet it is critical to choose the right validation/error metrics depending on various factors like skewness of the data, lack of or excess of data, and evaluation metrics. Without careful validation, one may come to wrong, biased conclusions and the models will not perform as well in production as expected. Included topics: Training/test error, validation methods such as k-fold cross validation, and various evaluation metrics.

There are no rows in this table

⁠

‌

References

‌

11J. Gareth et al., An Introduction to Statistical Learning: with Applications in R. Springer, 2013. 22Microsoft. Certication for Azure Data Scientist Associate. [Online]. Available:

Microsoft Certified: Azure Data Scientist Associate - Learn | Microsoft Docs⁠

33The Data Science Council of America. Associate Big Data Analyst (ABDA) Certification Program. [Online]. Available:

Associate Big Data Analyst | ABDA | Data Analysis Certificate (dasca.org)⁠

44Dell Technologies. Associate Data Science and Big Data Exam. [Online]. Available:

DEA-7TT2_Associate-Data_Science_and_Big_Data_Analytics_v2_Exam.pdf (emc.com)⁠

⁠

‌

DSA coding questions explained

‌

Question #1

‌

This task is designed to assess the test taker's familiarity with a programming language of the test taker’s choice. The supported languages are Python2, Python3 and R. The average time for solving this task should be approximately 20 minutes.

‌

Expected knowledge:

‌

Working with primitive data types;

Working with loops and decision constructs;

Working with basic collections (arrays, lists, dictionaries/sets).

‌

Example:

‌

Note: This task checks the basic coding knowledge, particularly about strings, sorting, for-loops, and decision constructs. There exist many ways to solve this problem, and it only requires basic skills. You are given a list of strings (data) where each string is in the form “$device_id,$usage_in_minutes" (quotes for clarity) such that $device_id contains exactly five lowercase English alphabets (’a’-’z’) and $usage_in_minutes is a positive integer between 1 and 1,440 with leading zeroes (if necessary, to make its length equal to 4). For instance, “abxyz,0010" describes $device_id = “abxyz” and $usage_in_minutes = 10 minutes. Given data, return the $device_id with the largest value of $usage_in_minutes. You may assume that $device_id’s are distinct and $usage_in_minutes’ are distinct in data.

‌

Sample solution in Python:

⁠

⁠

‌

Sample solution in R:

⁠

⁠

‌

Question #2

‌

This task is designed to assess the test taker’s ability to find minor bugs in a given code extract and is intended to be language-specific. The supported languages are Python2, Python3 and R. The average time for solving this task should be approximately 5 minutes.

‌

Expected knowledge:

‌

Working with primitive data types;

Working with loops and decision constructs;

Working with basic collections (arrays, lists, dictionaries/sets);

Lambda functions and list comprehensions.

‌

Example:

‌

Note: This task tests the understanding of lambda and array/list indices in python. Although this example is language-specific, similar tasks can be created for other languages. Consider a function that is given an array/list A of distinct integers and an integer k (where 1 ≤ k ≤ len(A)), and returns all possible subarrays of A by removing k contiguous elements in A. That is, you wish to obtain a subarray of A by removing the first k elements in A, another subarray of Aby removing the next k elements in A, and so on. For instance, when A = [2, 4, 6, 8, 10] and k = 3, you can remove [2, 4, 6] (which results in [8, 10]), [4, 6, 8] (which results in [2, 10]), or [6, 8, 10] (which results in [2, 4]). Since you are removing k elements from A, you always obtain a subarray of length (len(A) − k), and there are (len(A) − k + 1) such subarrays. In the provided code, the given function contains a buggy line of code. You are asked to find and fix one line of code so that the function will return a list of subarrays correctly.

‌

Sample solution in Python:

⁠

‌

Line 3 should be changed to:

⁠

‌

Sample solution in R:

⁠

‌

Line 4 should be changed to:

⁠

‌

Question #3

‌

This is a recovery task in Query Language (standard SQL). All the tasks can also be solved using pandas with python or R. The average time for solving this task should be approximately 10 minutes.

‌

Expected knowledge:

‌

All subtopics of Query Language, including:

‌

Filtering, Sorting;

Aggregate Functions;

If and Case Functions;

String Functions.

‌

Example:

‌

This task checks the understanding of RANK() function, but this recovery task should not be limited to just RANK()function. There are many kinds of window functions (including, but not limited to ‘value’ functions (FIRST_VALUE(), LAST_VALUE(), LAG(), and LEAD()), basic ‘aggregate’ functions (AVG(), COUNT(), MAX(), MIN(), and AVG() with OVER()), and ‘ranking’ functions (RANK(), ROW_NUMBER(), DENSE_RANK(), CUME_DIST( ))).

Table 2

Table 2

Column name

Type

Description (quotes added for clarity)

gender

string

either "F" or "M"

decade

string

one of: "1910s", "1920s", "1930s", "1940s", etc. up until "2010s"

name

string

a non-empty string of a person's given name

frequency

integer (INT64)

Number of newborns of a given name per gender and per decade

There are no rows in this table

⁠

Table 3

Table 3

Column 1

Column 2

Column 3

Column 4

Gender

Decade

Name

Frequency

1950s

Alice

1960s

John

2010s

Emily

...

There are no rows in this table

⁠

‌

Consider an input table data that contains four columns (quotes for clarity only):

‌

Complete the following query (replace the blank ··· with your answer) so that the query results will produce a table that contains one row for each decade with the most popular female name (for gender “F”) and male name (for gender “M”) during the decade in columns name_f and name_m, respectively. You can assume that the data table contains no ties for the most popular name for each gender and decade pair.

⁠

‌

There are several other ways to obtain the same results, but this forces the test-taker to use the RANK() window-function over other approaches (e.g., inner-joins or self-joins).

‌

Question #4

‌

This is a recovery task, focusing on probability and statistics. The supported languages are Python2, Python3 and R. The average time for solving this task should be approximately 10 minutes.

‌

Expected knowledge:

‌

Statistics Basics (Mean/Median/Mode, Standard Deviation, Z-score, P-value, t-statistic);

Probability (Random variables, Events, Conditional probability, Bayes Theorem);

Programming (Primitive data types (ints/oats/strings) and basic operations, Loops and decision constructs, Basic Collections (arrays, lists, dictionaries/sets)).

‌

Example:

exit: ⌘↩

Note: This task focuses on two-tailed, paired t-test and data cleaning.

‌

You are analysing the effectiveness of a tutorial used in your statistics class. You asked n students to take a standardised test before starting the tutorial and to take another test after completing the tutorial. A valid score from this test ranges from 300 to 500, inclusive. However, due to some bugs in the grading system, some of the tests were (incorrectly) graded such that scores of those tests were increased by a factor of two (i.e., they range from 600 to 1,000). After correcting this error, you want to perform a t-test with the following hypotheses:

‌

H0(null): The average difference between before-tutorial and after-tutorial is 0;

H1(alternative): The average difference is non-zero.

‌

Complete the following code so that should_reject() returns true if the null hypothesis is rejected at the significance specified by alpha, and false otherwise. Assume that the conditions of your t-test are met (after fixing the incorrect scores). Constraints: score_before and score_after will contain the same number of elements (between 10 and 50 elements each, inclusive). Each element in score_before and score_after will be an integer between 300 and 1,000, inclusive. Alpha will be between 0.001 and 0.10, inclusive.

‌

Sample solution in Python:

⁠

‌

Sample solution in R:

⁠

‌

Question #5

‌

This is a recovery task, focusing on probability and statistics. The supported languages are Python2, Python3 and R. The average time for solving this task should be approximately 10 minutes.

‌

Expected knowledge:

‌

Linear regression;

Logistic Regression;

K-means;

Mean squared error;

Cross-validation.

‌

Example:

‌

Note: This task checks the understanding of the linear regression model, mean squared error, and leave-one-out cross-validation. Consider a real-value independent variable A and real value dependent variable B. You wish to model the relationship between A and B using linear regression and compute the mean squared error of the said model using the leave-one-out validation method. Write a function that returns the mean squared error of the linear regression model, given n observed values of A and B as obs_A and obs_B. Assume that both obs_A and obs_B contain exactly n elements each.

‌

Example function call:

⁠

‌

Expected output (returned values): 4.08333333333 You may import and use any of the following libraries: sklearn/pandas/NumPy.

‌

Sample solution in Python:

⁠

‌

Sample solution in R:

⁠

‌

DSA multiple-choice questions explained

‌

For the quiz tasks, nine topics are divided into 3 groups of 4 quiz tasks in each group. So, in total there are 12 different quiz tasks. The table below shows the description of the groups and the task distribution between the topics.