Chapter 11 kNN - k-nearest-neighbors

Very useful algorithm, identifies the k-nearest neighbors and classifies according to the majority vote of the neighbors.

kNN requires variables to be normalized or scaled. It is used to predict factor variables as well as continuous variables. In the caret implementation, the features, or the independent variables, can be both discrete and continuous.

# load libraries and look at the data
library(ISLR)
library(ggplot2)
library(caret)
library(Metrics)
head(Smarket)

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up

dim(Smarket)

## [1] 1250    9

s = Smarket
table(s$Direction)

## 
## Down   Up 
##  602  648

prop.table(table(s$Direction))

## 
##   Down     Up 
## 0.4816 0.5184

Now we fit the model

inTrain = createDataPartition(s$Direction, p = 0.75, list = FALSE)
training = s[inTrain,]
testing = s[-inTrain,]
fit = train(Direction ~ ., data = training, preProcess = c("center", "scale"), method = "knn")
fit2 = train(Direction ~ ., data = training, preProcess = c("center", "scale"), method = "knn", tuneLength = 6)

fit

## k-Nearest Neighbors 
## 
## 938 samples
##   8 predictor
##   2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 938, 938, 938, 938, 938, 938, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8385501  0.6754274
##   7  0.8457586  0.6896489
##   9  0.8475178  0.6931941
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

fit2

## k-Nearest Neighbors 
## 
## 938 samples
##   8 predictor
##   2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 938, 938, 938, 938, 938, 938, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8387262  0.6765276
##    7  0.8511159  0.7012801
##    9  0.8558962  0.7106913
##   11  0.8626060  0.7241337
##   13  0.8695310  0.7380914
##   15  0.8679743  0.7347562
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 13.

plot(fit)

plot(fit2)

Let us check the accuracy of the model.

prediction = predict(fit, newdata = testing)
confusionMatrix(testing$Direction, prediction) #Actual data first, then the prediction vector

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down  117  33
##       Up     11 151
##                                           
##                Accuracy : 0.859           
##                  95% CI : (0.8153, 0.8956)
##     No Information Rate : 0.5897          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.716           
##  Mcnemar's Test P-Value : 0.001546        
##                                           
##             Sensitivity : 0.9141          
##             Specificity : 0.8207          
##          Pos Pred Value : 0.7800          
##          Neg Pred Value : 0.9321          
##              Prevalence : 0.4103          
##          Detection Rate : 0.3750          
##    Detection Prevalence : 0.4808          
##       Balanced Accuracy : 0.8674          
##                                           
##        'Positive' Class : Down            
##

Let us try kNN for a continuous variable.

data(mtcars)
inTrain = createDataPartition(mtcars$mpg, p = 0.75, list = FALSE)
training = mtcars[inTrain,]
testing = mtcars[-inTrain,]
fit = train(mpg ~ ., data = training, preProcess = c("center", "scale"), method="knn")
fit

## k-Nearest Neighbors 
## 
## 25 samples
## 10 predictors
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 25, 25, 25, 25, 25, 25, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  3.417859  0.7573906  2.785716
##   7  3.434295  0.7690902  2.803742
##   9  3.608257  0.7618139  2.927551
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

prediction = predict(fit, newdata = testing)
rmse(testing$mpg, prediction)

## [1] 4.133335

testing$predictedmpg = prediction
testing

##                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Datsun 710         22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Valiant            18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 240D          24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 280           19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Toyota Corona      21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Camaro Z28         13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
##                    predictedmpg
## Datsun 710                29.08
## Valiant                   17.02
## Merc 240D                 23.16
## Merc 280                  20.88
## Cadillac Fleetwood        15.18
## Toyota Corona             27.56
## Camaro Z28                17.18