Summarizing Cross-Validation Performance Values

The text defers to using Wilcoxon Signed Rank to determine whether two models are statistically different (page 164). This is because
Wilcoxon Signed Rank is nonparametric and thus ok to use on non-normally distributed data. Since we have 10 performance values, we
technically have enough data for bootstrapping.

My question is, is there any intuitive benefit to bootstrapping our performance values? I know this would allow us to check for the
distribution of the data and, assuming normality, use parametric tests and MAD for measuring dispersion. Is there anything else we could
gain from this? One of my previous professors was very big on bootstrapping so I often tend to look for scenarios where it may benefit me.

I know this is a somewhat vague question but I saw that bootstrapping is something that may be done here and may tell us something about
the distribution of the values, as well as opening up some different measurement statistics. I'm just unsure if doing this would actually
have any practical significance? If it turns out we have normally distributed data I suppose we could defer to a t-test instead of Wilcoxon
Sign Ranks, which should be a more powerful test...

Answers and follow-up questions

Answer or follow-up question 1

Dear student,

Bootstrapping means creating a new sample, with replacement, of equal size as the original dataset.
If we have 10 AUCs that would mean you would sample 10 times with replacement to obtain a new dataset of size 10.
For obvious reasons this is completely incorrect.

So I suppose you mean sampling with replacement but more then 10 times, say 100 times. The risk is that you dataset is still biased when
doing this. Only when you sample with replacement to a dataset of very large size (say thousands), your dataset can be an accurate
reflection or your original sample, that now has a better chance of being normally distributed.
For obvious reasons, this is not a good strategy.

I'm a very big fan of introducing diversity in algorithms to fit models, and bootstrapping is one strategy to accomplish that, but random
sampling should not be introduced in model performance evaluation.

In sum, bootstrapping in model creation can be a good thing, and bootstrapping in model evaluation is a bad thing. You might want to check
with the professor you mention in your question.

Hope this helps,

Michel Ballings

Sign in to be able to add an answer or mark this question as resolved.