Explore

Distance metrics

Intelligence Refinery

Introduction

Used to measure the distance between two records

Will influence the shape of the clusters [

@Clustering Distance Measures

⁠

]

Metrics

Name

Definition

Suited for

Notes

Euclidean

Commonly used default distance metric, performs well in general

Also used in K-means clustering

Also called l2 distance?

Manhattan

Calculated by summing the absolute value of the difference between the dimensions

Ex. In a map, if the Euclidean distance is the shortest route between two points, the Manhattan distance implies moving straight, first along one axis and then along the other — as a car in the city would, reaching a destination by driving along city blocks.

l1 distance is often good for sparse features, or sparse noise: i.e. many of the features are zero, as in text mining using occurrences of rare words

[scikit-learn documentation

⁠

]

Similar to Euclidean distance

Also called l1 distance?

Cosine

A good choice when there are too many variables and you worry that some variable may not be significant.

Cosine distance reduces noise by taking the shape of the variables, more than their values, into account.

Invariant to global scalings of the signal [

@2.3.6.4 Varying the metric

⁠

]

It tends to associate observations that have the same maximum and minimum variables, regardless of their effective value.

Hamming

Pearson correlation distance

Spearman correlation distance

There are no rows in this table

Count

⁠

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.