Practical Statistics for Data Scientists
Copy doc
3. Statistical experiments
Repeatedly sample values from observed data to assess random variability in a statistics
Intelligence Refinery

Two main types of resampling procedures
To assess the reliability of an estimate
The permutation tests
Used to test hypotheses, typically involving two or more groups

Permutation test

: to change the order of a set of values
Entails combining and shuffling samples from all groups together, and randomly (or exhaustively) reallocating the observations to resamples, and statistic of interest is calculated
This is the logical embodiment of the null hypothesis, that the groups do not differ
The null hypothesis is tested by randomly drawing groups (without replacement) from the combined set, and seeing how much they differ from one another
Compare the observed difference with the permuted differences
If the observed difference lies outside most of the permutation distribution, the difference is likely
due to chance

Example: web stickiness

for the R Markdown notebook.

Exhaustive and bootstrap permutation test

Two variants of the permutation test
Exhaustive permutation test
Instead of randomly shuffling and dividing the data, figure out all the possible ways it could be divided
Only practical for relatively small sample sizes
Sometimes called
exact tests
, due to their statistical property of guaranteeing that the null model will not test as "significant" more than the alpha level of the test
Bootstrap permutation test
The draws are made
with replacement
This models both
The random assignment of treatment to subject
The random selection of subjects from a population

Permutation tests: the bottom line for data science

Permutation tests are useful heuristic procedures for
exploring the role of random variation
As compared to formula-based statistics, permutation tests have fewer assumptions on the data
Data can be numeric or binary
Sample sizes can be same or difference
Normal distribution not needed