ML04 KNN

Table of Contents

KNN is a relatively less mathematically complex model, but can be frequently asked as one of the fundamental techniques. Here to the completeness, we also briefly discuss. Like my other materials, I mainly refer to the book “An introduction to Statistical Learning with R”.

K-Nearest Neighbors

This is relatively simple method in machine learning to classify. The idea is to mimic the Bayes classifier (requires our knowledge about conditional distribution of $Y$ given $X$), which is impossible in real word. So we want to estimate this conditional distribution and classify accordingly, assigned the new observation the class with the highest estimated probability.

Every time the KNN receives a new instance, it will try to find $K$ nearest neighbors from the training set, and then assign the probability to classes based on frequencies of classes of these neighbors. Denote the neighbors set as $\mathcal{N}_0$, then,

$$ \operatorname{Pr}\left(Y=j \mid X=x_0\right)=\frac{1}{K} \sum_{i \in \mathcal{N}_0} I\left(y_i=j\right) $$

Parameter $K$

There is only one parameter to tune here, which is the $K$. For larger $K$, the models looks at a wider range of points, and variance is small for shorter movements on the feature space (tends to be linear). If $K=1$, then the model is over flexible, and we have less bias but high variance, and especially the error rate on the training data can be 0, as in this case, the point is looking at itself.

image-20220831130151094 image-20220831130335532
Yiming Zhang
Yiming Zhang
Quantitative Researcher Associate, JP Morgan