Friday, April 14, 2023

Cross Validation lecture for joint BYU-City Tech undergraduate data science seminar

 Today, I delivered a lecture on cross-validation and Bootstrap, which are resampling methods used to evaluate machine learning algorithms' test error. I based my lecture on a resampling chapter from Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani's book, "An Introduction to Statistical Learning: with Applications in R" (2013).

During the lecture, we first discussed the bias-variance trade-off and how the training error cannot accurately predict the testing error as the model's complexity increases. We then explored the performance of various ways of spltting the data, such as splitting the data into two equal parts and performing leave-one-out splitting.

Next, we discussed k-fold cross-validation, which involves splitting the data into K subsets of equal size, and iteratively selecting a validation set and training on the remaining folds K times. The cross-validation score is the weighted average of the mean squared error (MSE) for each fold. We demonstrated an experiment with 10-fold MSE and found that K=5 or 10 is a good compromise in terms of the variance-bias trade-off.

Finally, we discussed an example of two-class classification with 5000 predictors and 50 samples, where the wrong way to perform K-fold CV was to apply it after filter 1, while the right way was to apply CV on both filter 1 and filter 2. We stressed the importance of understanding the correlation between each K-fold split and the variance-bias trade-off in determining the optimal choice of K for the CV folds.

Overall, the lecture was well-received, and there were many interesting questions from colleagues and students. This is a topic that is frequently discussed in data science forums, and it is important to emphasize the importance of balancing the variance-bias trade-off and the impact of data splitting on the results of cross-validation.

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani's book, "An Introduction to Statistical Learning: with Applications in R" (2013).




No comments:

Post a Comment

4-week summer intense precalculus course...an enthusiastic, refreshing and fun crowd!

I decided to teach a four-week summer class this year. Sometimes I had doubts because I usually use the summer months to do research in a mo...