 Practical Statistics for Data Scientists
Share
Explore 3. Statistical experiments

# Resampling

Repeatedly sample values from observed data to assess random variability in a statistics
Two main types of resampling procedures
To assess the reliability of an estimate
The permutation tests
Used to test hypotheses, typically involving two or more groups
Permutation test
Permute: to change the order of a set of values
Entails combining and shuffling samples from all groups together, and randomly (or exhaustively) reallocating the observations to resamples, and statistic of interest is calculated
This is the logical embodiment of the null hypothesis, that the groups do not differ
The null hypothesis is tested by randomly drawing groups (without replacement) from the combined set, and seeing how much they differ from one another
Compare the observed difference with the permuted differences
If the observed difference lies outside most of the permutation distribution, the difference is likely not due to chance
Example: web stickiness
See
for the R Markdown notebook.
Exhaustive and bootstrap permutation test
Two variants of the permutation test
Exhaustive permutation test
Instead of randomly shuffling and dividing the data, figure out all the possible ways it could be divided
Only practical for relatively small sample sizes
Sometimes called exact tests, due to their statistical property of guaranteeing that the null model will not test as "significant" more than the alpha level of the test
Bootstrap permutation test
The draws are made with replacement
This models both
The random assignment of treatment to subject
The random selection of subjects from a population
Permutation tests: the bottom line for data science
Permutation tests are useful heuristic procedures for exploring the role of random variation
As compared to formula-based statistics, permutation tests have fewer assumptions on the data
Data can be numeric or binary
Sample sizes can be same or difference
Normal distribution not needed

Share 