Cross-Validation: Concept and Example in R

What is Cross-Validation?

In Machine Learning, Cross-validation is a resampling method used for model evaluation to avoid testing a model on the same dataset on which it was trained. This is a common mistake, especially that a separate testing dataset is not always available. However, this usually leads to inaccurate performance measures (as the model will have an almost perfect score since it is being tested on the same data it was trained on). To avoid this kind of mistakes, cross validation is usually preferred.

The concept of cross-validation is actually simple: Instead of using the whole dataset to train and then test on same data, we could randomly divide our data into training and testing datasets.

There are several types of cross-validation methods (LOOCV – Leave-one-out cross validation, the holdout method, k-fold cross validation). Here, I’m gonna discuss the K-Fold cross validation method.
K-Fold  basically consists of the below steps:

  1. Randomly split the data into k subsets, also called folds.
  2. Fit the model on the training data (or k-1 folds).
  3. Use the remaining part of the data as test set to validate the model. (Usually, in this step the accuracy or test error of the model is measured).
  4. Repeat the procedure k times.

Below is a simple illustration of the procedure taken from Wikipedia.

k-fold_cross_validation_en

How can it be done with R?

In the below exercise, I am using logistic regression to predict whether a passenger in the famous Titanic dataset has survived or not. The purpose is to find an optimal threshold on the predictions to know whether to classify the result as 1 or 0.

Threshold Example: Consider that the model has predicted the following values for two passengers: p1 = 0.7 and p2 = 0.4. If the threshold is 0.5, then p1 > threshold and passenger 1 is in the survived category. Whereas, p2 < threshold, so passenger 2 is in the not survived category.

However, and depending on our data, the 0.5 ‘default’ threshold will not alway ensure the maximum the number of correct classifications. In this context, we could use Cross-validation to determine the best threshold for each fold based on the results of running the model on the validation set.

In my implementation, I followed the below steps:

  1. Split the data randomly into 80 (train and validation), 20 (test with unseen data).
  2. Run cross-validation on 80% of the data, which will be used to train and validate the model. 
  3. Get the optimal threshold after running the model on the validation dataset according to the best accuracy at each fold iteration.
  4. Store the best accuracy and the optimal threshold resulting from the fold iterations in a dataframe.
  5. Find the best threshold (the one that has the highest accuracy) and use it as a cutoff when testing the model against the test dataset.

Note: ROC is usually the best method to be used to find an optimal ‘cutoff’ probability, but for sake of simplicity, i am using accuracy in the code below.  

The below cross_validation method will:

  1. Create a ‘perf‘ dataframe that will store the results of the testing of the model on the validation data.
  2. Use the createFolds method to create nbfolds number of folds.
  3. On each of the folds:
    • Train the model on k-1 folds
    • Test the model on the remaining part of the data
    • Measure the accuracy of the model using the performance method.
    • Add the optimal threshold and its accuracy to the perf  dataframe.
  4. Look in the perf dataframe for optThresh – the threshold that has the highest accuracy.
  5. Use it as cutoff when testing the model on the test set (20% of original data).
  6. Use F1 score to measure the accuracy of the model.
library(ROCR)
cross_validation = function(nbfolds, split){
 perf = data.frame()
 #create folds
 folds = createFolds(split$trainset$survived, nbfolds, list = TRUE, returnTrain = TRUE)

#loop nbfolds times to find optimal threshold
 for(i in 1:nbfolds)
   {
      #train the model on part of the data
      model = glm(survived~., data=split$trainset[folds[[i]],], family = "binomial")

      #validate on the remaining part of the data
      probs = predict(model, type="response", newdata = split$trainset[-folds[[i]],])

      #Threshold selection based on Accuracy
      #create a prediction object based on the predicted values
      pred = prediction(probs,split$trainset[-folds[[i]],]$survived)

      #measure performance of the prediction
      acc.perf = performance(pred, measure = "acc")

      #Find index of most accurate threshold and add threshold in data frame
      ind = which.max( slot(acc.perf, "y.values")[[1]] )
      acc = slot(acc.perf, "y.values")[[1]][ind]
      optimalThreshold = slot(acc.perf, "x.values")[[1]][ind]
      row = data.frame(threshold = optimalThreshold, accuracy = acc)

      #Store the best thresholds with their performance in the perf dataframe
      perf = rbind(perf, row)
   }

 #Get the threshold with the max accuracy among the nbfolds and predict based on it on the unseen test set
 indexOfMaxPerformance = which.max(perf$accuracy)
 optThresh = perf$threshold[indexOfMaxPerformance]
 probs = predict(model, type="response", newdata = split$testset)
 predictions = data.frame(survived=split$testset$survived, pred=probs)
 T = table(predictions$survived, predictions$pred > optThresh)
 F1 = (2*(T[1,1]))/((2*(T[1,1]))+T[2,1]+T[1,2])
 F1
}

Then, if we run this method 100 times we can measure our max model accuracy when using cross-validation:


# Feature selection method
easyFeatureSelection = function(split) {
corrs = abs(cor(split$trainset)[1,])
toKeep = corrs[corrs > 0.1 & !is.na(corrs)]
split$trainset = subset(split$trainset, select=names(toKeep))
split$testset = subset(split$testset, select=names(toKeep))
split
}

# Perform cross validation
for(i in 1:100){
 #split the data into 80-20
 split = splitdf(df, i, 0.8)
 #perform feature selection at each iteration on the training data
 split = easyFeatureSelection(split)
 #get optimal performance from 10 folds at each iteration
 performance = cross_validation(10, split)
}

 

screen-shot-2017-03-03-at-11-30-25-am.png

We can see that we have a maximum performance of about 0.885 when using the accuracy method for threshold selection, which is not bad at all, taking into consideration that the model is being tested on unseen data.

 

5 thoughts on “Cross-Validation: Concept and Example in R

Leave a comment