Chapter 9 Introducing caret

9.1 1. Separating the testing and the training set

Caret is a wrapper package over multiple separate packages for machine learning, makes life really easy.

inTrain = createDataPartition(y = diamonds$price, p = 0.75, list = FALSE)
training = diamonds[inTrain,]
testing = diamonds[-inTrain,]
dim(training)

9.2 2. Fit the model

The general format for fitting any model in caret is similar: modelname = train(y-variable ~ x-variables, data = dataframe, method = "methodname"). The method you use can be any of the many hundreds that caret supports.

fit = train(y ~ x,  data = dataframe,  method = "methodname")

9.3 3. Do the prediction

Once the model has been fitted, the predict(modelname, newdata = newdata) can be used for predictions.

predict(fit, newdata = dataframe)

9.4 4. Check the model’s accuracy

Model accuracy can be checked using confusionMatrix(modelname), or the rmse function. Note that accuracy is measured by sum of squares (RMSE) for continuous y-variables, and by Accuracy for categorical or factor variables.

confusionMatrix(fit)

That is all there is to it.



Additional Info re Caret

It has two additional arguments that might be of use as your use of the algorithms becomes more advanced.

trainControl sets the arguments for resampling. Often you want to use cross validation and repeated crossvalidation to improve your model, and these options help you set it up. eg, trainControl = (method = “repeatedcv”, )

preProcess = c("center", "scale") lets you normalize data.

You can even impute missing data by using k-nearest-neighbors. preProcess does not return values, it sets up the model to perform the knn calculations. Once done, you can use predict to calculate the missing values.

preProcValues = preProcess(df_with_missing_values, method = c("knnImpute"))
imputed_values = predict(preProcValues, df_with_missing_values)