Do we need to train the model on train+test set again after finding optimal ensemble size?

In the at-home exercises, we have been asked:

"Determine the optimal number of trees in the bagged tree ensemble in terms of AUC and make a prediction on the test set using the optimal

number of trees and compute the AUC"

After finding the optimal number of trees (from 1 to 500) which gives the highest AUC, do we need to train the model on trainBIG (train set

+ validation set) and then, using this model, we predict on test set?

Dear student,

Never train on train+test.

You can train on train+validation (re-train after parameter tuning) or only on train.

The advantage of training on train only is that there is a closer fit between your model and the optimal parameter. If you optimized/tuned a

parameter for a model that is estimated on the training set, and the algorithm you used has high variance (changes the model a lot when you

re-estimate on slightly changed data), you should only train on the training set. Another advantage is that you do not need to re-train and

that lowers computational needs.

The advantage of training on train+validation is that you now have more data, and that could result in a better model. If the algorithm does

not have high variance, it makes sense to do this.

In other words, tuning a parameter is good, but it means that your training set is smaller, which is bad. You could tune the model (find the

optimal parameter), and then retrain on train + validation with that optimal parameter to counter that, but that might mean that now your

parameter is not optimal anymore (a parameter is optimal for a given model). In the book we always take that extra step and estimate the

model on train + validation, so you can see how it is done, but you should really benchmark both methods and see which one is best.

Best,

Michel Ballings

Dear Dr Ballings,

Thank you for your response.

I am sorry, that was a typo, I should have written train+validation.

I totally understand that. What you said in your last paragraph is my confusion. For this assignment, we are training on train set and

tuning on validation set and finding the optimal number of parameter. And then, using this optimal number of trees (which is optimal only

for train set), we train the model on train+val set, later and make predictions on test set. In the case of bagged trees, just because we

use bootstrapping with replacement to resample from the original data, do you think that this optimal number of parameter will still be

valid?

Dear student,

Unfortunately, there is really no way to know in advance if the optimal parameter value will still be optimal.

Best,

Michel Ballings

Sign in to be able to add an answer or mark this question as resolved.