JavaScript required
We’re sorry, but Coda doesn’t work properly without JavaScript enabled.
Practical Statistics for Data Scientists
Share
Explore
Practical Statistics for Data Scientists
1. Exploratory data analysis
Elements of structured data
Estimates of location
Estimates of variability
Exploring the data distribution
Exploring binary and categorical data
Correlation
Exploring two or more variables
2. Data distributions
Random sampling and sample bias
Selection bias
Sampling distribution of a statistic
The bootstrap
Confidence intervals
Normal distribution
Long-tailed distributions
Student's t-distribution
Binomial distribution
Poisson and related distributions
3. Statistical experiments
A/B testing
Hypothesis tests
Resampling
Statistical significance and p-values
t-Tests
Multiple testing
Degrees of freedom
ANOVA
Chi-squre test
Multi-arm bandit algorithm
Power and sample size
4. Regression
Simple linear regression
Multiple linear regression
Prediction using regression
Factor variables in regression
Interpreting the regression equation
Testing the assumptions: regression diagnostics
Polynomial and spline regression
5. Classification
Naive Bayes
Discriminant analysis
Logistic regression
Evaluating classification models
Strategies for imbalanced data
6. Statistical ML
K-nearest neighbours
Tree models
Bagging and random forest
Boosting
7. Unsupervised learning
Principal components analysis
K-means clustering
Hierarchical clustering
Model-based clustering
Scaling and categorical variables
5. Classification
Evaluating classification models
Evaluating classification models
Accuracy
The percent/proportion of cases classified correctly
Confusion matrix
A tabular display of the record counts by their predicted and actual classification status
The rare class problem
Depending on the relative cost, need to make the trade-off between false positives and false negatives
Precision, recall, and specificity
Term
Description
Interpretation
Formula
R code
Term
Description
Interpretation
Formula
R code
1
Specificity
The percent/proportion of
0s
correctly classified
Measures a model's ability to predict a negative outcome
conf_mat[2,2]/sum(conf_mat[2,])
2
Precision
The percent/proportion of predicted
1s
that are actually 1s
The accuracy of a predicted positive outcome
conf_mat[1,1]/sum(conf_mat[,1])
3
Sensitivity/Recall
The percent/proportion of
1s
correctly classified
Measure the strength of the model to predict a positive outcome
conf_mat[1,1]/sum(conf_mat[1,])
There are no rows in this table
3
Count
Receiver Operating Characteristics (ROC) curve
A plot of trade-off between
sensitivity
and
specificity as
Precision-recall curve
Area under the curve (AUC)
Lift
A measure of how effective the model is at identifying (comparatively rare) 1s at different probability cutoffs
Accuracy
Confusion matrix
The rare class problem
Precision, recall, and specificity
Receiver Operating Characteristics (ROC) curve
Precision-recall curve
Area under the curve (AUC)
Lift
Gallery
Share
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
Ctrl
P
) instead.