Chapter 9 Introducing caret
9.1 1. Separating the testing and the training set
Caret is a wrapper package over multiple separate packages for machine learning, makes life really easy.
inTrain = createDataPartition(y = diamonds$price, p = 0.75, list = FALSE)
training = diamonds[inTrain,]
testing = diamonds[-inTrain,]
dim(training)
9.2 2. Fit the model
The general format for fitting any model in caret is similar: modelname = train(y-variable ~ x-variables, data = dataframe, method = "methodname")
. The method you use can be any of the many hundreds that caret supports.
fit = train(y ~ x, data = dataframe, method = "methodname")
9.3 3. Do the prediction
Once the model has been fitted, the predict(modelname, newdata = newdata)
can be used for predictions.
predict(fit, newdata = dataframe)
9.4 4. Check the model’s accuracy
Model accuracy can be checked using confusionMatrix(modelname)
, or the rmse
function. Note that accuracy is measured by sum of squares (RMSE) for continuous y-variables, and by Accuracy for categorical or factor variables.
confusionMatrix(fit)
That is all there is to it.
Additional Info re Caret
It has two additional arguments that might be of use as your use of the algorithms becomes more advanced.
trainControl
sets the arguments for resampling. Often you want to use cross validation and repeated crossvalidation to improve your model, and these options help you set it up. eg, trainControl = (method = “repeatedcv”, )
preProcess = c("center", "scale")
lets you normalize data.
You can even impute missing data by using k-nearest-neighbors. preProcess
does not return values, it sets up the model to perform the knn calculations. Once done, you can use predict
to calculate the missing values.
preProcValues = preProcess(df_with_missing_values, method = c("knnImpute"))
imputed_values = predict(preProcValues, df_with_missing_values)