Kathrine Thompson, PhD, Department of Statistics, University of Kentucky

"Incorrect model selection using R2 and AIC in big data analyses"

Although recent attention has focused on improving predictive models, less consideration has been given to variability introduced into models through incorrect variable selection. Here, the difficulty in choosing a scientifically correct model is explored theoretically, computationally, and practically, and the performance of model selection with maximum R2 or minimum AIC is compared with that of more recent methods. The results in this paper show that often the model with the largest R2 or smallest AIC may not be the scientifically correct model, suggesting that these model selection techniques may not be appropriate when data sets contain a large number of explanatory variables. This work starts with the derivation of the probability of choosing the scientifically correct model in data sets as a function of regression model parameters when using R2 or AIC. Next, a data analysis example shows that these model selection criteria are outperformed by methods that produce multiple candidate models for researchers' consideration. These results are demonstrated in simulation studies and through the analysis of a National Health and Nutrition Examination Survey data set.

Stay connected Youtube