Basic coding
This covers the basic building blocks for data processing and modelling. Although data scientists are not software engineers, they should be able to write a simple program at ease to deal with data processing and modelling.
Included topics: Primitive data types (ints/floats/strings) and basic arithmetic/logical operations, loops and decision constructs, basic collections (arrays, lists, and dictionaries/sets), using libraries highly relevant to data science/analysis (e.g., python numpy/pandas/scikit-learn).
Query language
This covers the basic building blocks for analysing data. Regardless of whether SQL or Pandas DataFrame or another tool is used, data scientists should be familiar with querying on a dataset to gain insights.
Included topics: Using SQL or python/R libraries such as pandas, Basics: Filtering, Sorting, Aggregate Functions, If, Case and String Functions, subqueries (inner queries), joins (inner, left, right), window functions and window-specific aggregates.
Probability basics
This covers the basic building blocks for setting up a well-defined modelling problem. Without appropriate use of the basic probability theory, it’s difficult, if not impossible, to clearly communicate with other data scientists and engineers what modelling problem one is trying to solve.
Included topics: Random Variables, Events, and Probability Distributions.
Statistics basics
This covers the basic building blocks for statistical modelling/learning and quantitative analysis. The first thing that a data scientist needs to know about any data is its size, distribution, statistics such as mean/standard deviation, and skewness. In addition, given a hypothesis, one should be able to choose and apply one of the many well-established statistical tests in order to accept or reject the said hypothesis. Data scientists should know the statistical basics by heart and be able to apply them appropriately.
Included topics: Mean/Median/Mode, Standard Deviation, z-score, p-value, and t-statistic.
Conditional probability and Bayes Theorem
Conditional Probability and Bayes Theorem are crucial in evaluating models and comparing their performances. For instance, selection bias in sampling often leads data scientists to incorrectly evaluate the accuracy or performance of the models they are testing, which would subsequently lead to unexpectedly underperforming models when rolled out in production.
Linear regression
One of the most powerful yet simple linear models for predicting continuous, scalar response which is often the first step towards building more complex, accurate prediction models. Its importance cannot be over emphasied.
Logistic regression
One of the most widely used statistical models for classification. Similarly to linear regression, it is often the first model data scientists use to analyse important features before moving on to more complex models.
Clustering algorithms
Unsupervised learning is essential for data scientists and k-means clustering is the first thing to know.
Included topics: k-means clustering
Regularisation
Regularisation plays an important role in modelling, and linear models are of no exception - for one, regularisation prevents models from overfitting. Different regularization methods lead to different results in terms of feature selection and bias, and understanding their implications is important.
Included topics: Regularisation in Linear Regression and Logistic Regression.
Model evaluation
With a plethora of useful, easy-to-use libraries, we can easily train linear models, such as linear regression models and logistic regression models, and come up with stunning charts that demonstrate how accurate the models are. Yet it is critical to choose the right validation/error metrics depending on various factors like skewness of the data, lack of or excess of data, and evaluation metrics. Without careful validation, one may come to wrong, biased conclusions and the models will not perform as well in production as expected.
Included topics: Training/test error, validation methods such as k-fold cross validation, and various evaluation metrics.