When do we perform variable selection outside of regression? We learned in BAS 320 that some variables in a predictive regression model can be removed to decrease potential overfitting and training
time for large data sets. I noticed that for all of the binary prediction models we have discussed so far in class, we always use the ~.
operator to regress y against all x predictors. I imagine that if we are given large data consumer data sets with, say, dozens of possible
predictors and many records for each, variable selection at least seems like something we should consider as a data pre-processing step. For
some models this seems less important (in decision trees, if there are uninformative variables, you just wouldn't split over them), but for
others it seems very important (in K-nearest neighbors, the "distance" between a training observation and a test observation is a function
of all x variables, each with an equal weight).
My question is, when would we need to remove variables from predictive models, and how would we select the variables to remove?
Answers and follow-up questions Answer or follow-up question 1
In most algorithms one should use a strategy to reduce overfitting. One approach to reducing overfitting is to decrease the influence of
variables (by either reducing the magnitude of their coefficient or removing them from the model).
There are two main approaches to removing unnecessary variables with the goal to reduce overfitting:
(1) variable selection: re-estimating the model multiple times, each time removing or adding variables, and then picking the best model
(typically called variable selection in hyperplane methods such as regression, and pruning in trees)
(2) coefficient reduction: adding a penalty term to the objective function (coefficient reduction; typically called penalization, shrinkage,
regularization, or decay depending on the community). The coefficients are shrunk to zero.
Variable selection is considered inferior to coefficient reduction as it is a lot more costly (compute intensive), and results in a lot more
variance (the variable is either in or out, and this can change the model significantly). One should always go with coefficient reduction
when available. It shrinks the value of the coefficient towards 0, thereby making it less sensitive to changes in the values of a variable.
However, sometimes it is impossible to implement a coefficient reduction approach, and then one is forced to go with the first approach.
Up to now we have seen the following algorithms. I have listed the overfitting reduction strategies based on variable selection or
coefficient reduction in parentheses.
-Naive Bayes (because it is so naive, i.e., variables are assumed to be independent, it cannot capture complexity well, and the danger is
underfitting and not overfitting)
-Logistic regression (stepwise variable selection and L1 norm, aka LASSO regularization)
-Neural networks (L2 norm, also called RIDGE regularization)
-K-nearest neighbors (no variable selection or coeffiicient reduction; we tuned K instead to combat overfitting; one could apply variable
selection by computing the distance function multiple times using a stepwise approach or using an optimization algorithm such as a genetic
algorithm as a wrapper)
-Decision trees (variable selection using pruning with the complexity parameter)
-Bagged decision trees (variable selection using pruning with the cp parameter)
We will also see random forest, which uses random feature selection to combat overfitting. It is so effective that trees are built until the
node size is 1 (i.e., no pruning).
As a final note, there is a whole section about overfitting (a situation of high variance) and underfitting (a situation of high bias) in
the book, which is out of the scope of BAS474, but within scope of BZAN542.
Michel BallingsSign in to be able to add an answer or mark this question as resolved.