Posted on 10 March 2023, by Kevin Mora
The number of points we consider isn’t like the number of parameters in a regression model, but rather a smoothing width/window size.
Let's push it to the limit: k = n. We can always anticipate the most prevalent class when k = n. We can perform the K-Nearest Neighbors regression to create a regression analog, wherein we report the average value of the dependent variable across the k neighbors rather than a class value. We always return the world mean when k = n, and we call this "minimal variability."
Contrarily, let's think about k = 1. Truth is that we are not refining at all there and any variation will inevitably creep into our predictions. Consider how this would appear if a KNN regression were used – our prediction value would be a colorful, fragmented function that changes values quickly, almost as if we had just linked the dots in chronological order, which ultimately means that it has a lot of variation.
Now let us consider the extremes, if we use k = n, the prediction will be the same regardless of the spot you choose (equal to the mean of the data set). The k is essentially the degree of local aggregation. The more averaging there is, the more generalized and universal the model is. The model is more local (and consequently less universal) the less averaging there is. When we pick k = 10 versus k = 3, we are not picking 10 neighbors from the same region of space as 3. The ten neighbors will be farther apart than the closest three (assuming data density is low enough).
We must also be aware of the fact that the k-nearest neighbors algorithm necessitates that we calculate the distances between our test observation and all of our training observations, sort those distances from smallest to largest, and then take the mean of the "k nearest" observations we select from. When we select a smaller k number, we are basically indicating that we want to sort the k nearest observations first before taking their average to determine a prediction (in regression).
If we select a small k, our prediction will be based on few observations, so each time we perform this sorting and depend on the top k, we will get highly variable estimates. Consider selecting k = 5, computing distances, sorting, and taking the mean of 5 observations as opposed to choosing k = 10, where the mean of 10 observations is now taken. The mean of 10 samples will result in significantly less variability than taking the mean of 5 observations. Consequently, predictions with a large variance and few nearest neighbors.
If you'd like to play with some simple code that showcases this idea and other statistical computing phenomena written in R, then I'd strongly suggest you take a look at my code here.