Some critiques of data-splitting and the stepwise procedure

Paul Smith

Introduction

This post is partly a expansion of a comment I made about training/test sets in a previous post. It is the first part of a four part series on model fitting and validation. At work, we often need to fit a model and validate it (using internal validation), in the presence of missing data. This short article ignores the missing data issue (I will look at this later), but instead focuses on fitting and validating a model.

There are two main areas that I am looking at here, to try and improve current practice:

Fitting a model
Validating the chosen model

This short article collates some of the critiques in the way we currently do model building and validation.

Current practice

Currently, at work we use a stepwise procedure to chose explanatory variables in the model. This stepwise procedure is sometimes a mix of clinical judgement with some form of automated selection, and sometimes it is fully automated. This is often done on a training set which consists of around \(70\%\) of the full data. The remaining \(30\%\) of the data is the test set used for model validation.¹

¹ For an example of a fully automated stepwise procedure on a \(70\%\) training set, with validation on the remaining \(30\%\) test set, see Collett, Friend, and Watson (2017).

Collett, David, Peter J Friend, and Christopher JE Watson. 2017. “Factors Associated with Short-and Long-Term Liver Graft Survival in the United Kingdom: Development of a UK Donor Liver Index.” Transplantation 101 (4): 786–92.

There are lots of issues that result from this process: from the use of the stepwise procedure, and from not using all the data to fit the model (data-splitting).

Issues with the stepwise procedure

There are lots of issues from using a stepwise procedure. Some of these are related to the procedure itself, and some related to the ability of the statistician to not have to think about the problem they are trying to solve. From Harrell et al. (2001, vol. 608, sec. 4.3),

if [the stepwise] procedure had just been proposed as a statistical method, it would most likely be rejected because it violates every principle of statistical estimation and hypothesis testing.

Some of the issues resulting from using a stepwise procedure are:²

² For a full list, see Harrell et al. (2001).

Harrell, Frank E et al. 2001. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Vol. 608. Springer.

It yields \(R^2\) values that are biased high.
The \(F\) and \(\chi^2\) test statistics do not have the claimed distribution.
The standard errors of regression coefficient estimates are biased low and confidence intervals for effects and predicted values are falsely narrow.
It yields \(P\)-values that are too small (due to severe multiple comparison problems).
The regression coefficients are biased high in absolute value. That is, if \(\hat \beta > 0\), \(E(\hat \beta | P < 0.05, \hat \beta > 0) > \beta\).

Copas and Long (1991) explain how (5) occurs,

Copas, JB, and Tianyong Long. 1991. “Estimating the Residual Variance in Orthogonal Regression with Variable Selection.” Journal of the Royal Statistical Society Series D: The Statistician 40 (1): 51–59.

The choice of the variables to be included depends on estimated regression coefficients rather than their true values, and so \(X_j\) is more likely to be included if its regression coefficient is over-estimated than if its regression coefficient is underestimated.

To prevent regression coefficients being too large, Tibshirani’s³ lasso procedure can be used (Tibshirani 1996). This shrinks coefficient estimates towards zero by including a constraint that the sum of absolute values of the coefficient estimates must be less than some \(k\). Note that the lasso procedure shares many of stepwise’s deficiencies, namely that there is a “low probability of selecting the ‘correct’ variables” (see StackExchange for some more information).

³ of An Introduction to Statistical Learning fame.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society Series B: Statistical Methodology 58 (1): 267–88.

Derksen, Shelley, and Harvey J Keselman. 1992. “Backward, Forward and Stepwise Automated Subset Selection Algorithms: Frequency of Obtaining Authentic and Noise Variables.” British Journal of Mathematical and Statistical Psychology 45 (2): 265–82.

Derksen and Keselman (1992) found that when the stepwise procedure was used, the final model represented noise between \(20\%\) and \(74\%\) of the time, and the final model contained less than half of the actual number of authentic predictors. To see this in action, Frank Harrell has some simulations as part of his ‘Regression Modelling Strategies’ course notes.

Issues with data-splitting

The issues with data-splitting are discussed in Harrell et al. (2001, vol. 608, sec. 5.3.3). These include

Data-splitting greatly reduces the sample size for both model development and model testing.
Repeating the process with a different split can result in different assessments of predictive accuracy.
It does not validate the final model, only a model developed on a subset of the data.

Number (3) is discussed further here, specifically:

When feature selection is done, data splitting validates just one of a myriad of potential models. In effect it validates an example model. Resampling (repeated cross-validation or the bootstrap) validate the process that was used to develop the model. Resampling is honest in reporting the results because it depicts the uncertainty in feature selection, e.g., the disagreements in which variables are selected from one resample to the next.⁴

⁴ That is, for each bootstrap sample, some sort of variable selection is done (e.g. stepwise, or lasso). The proportion of bootstrap samples where each variable is chosen in the model can then be reported to give an idea of the uncertainty in variable selection.

A good primer on the evaluation of clinical prediction models is given by Collins et al. (2024). In this, the authors state,

Collins, Gary S, Paula Dhiman, Jie Ma, Michael M Schlussel, Lucinda Archer, Ben Van Calster, Frank E Harrell, et al. 2024. “Evaluation of Clinical Prediction Models (Part 1): From Development to External Validation.” Bmj 384.

Randomly splitting obviously creates two smaller datasets, and often the full dataset is not even large enough to begin with. Having a dataset that is too small to develop the model increases the likelihood of overfitting and producing an unreliable model, and having a test set that is too small will not be able to reliably and precisely estimate model performance—this is a clear waste of precious information.

Even more succinctly, Steyerberg (2018) states,

Steyerberg, Ewout W. 2018. “Validation in Prediction Research: The Waste by Data Splitting.” Journal of Clinical Epidemiology 103: 131–33.

random data splitting should be abolished for validation of prediction models.

Well, that’s pretty clear!

But what about external validation?

There are not many situations where external validation is possible – the main one in my mind being another research group has access to different data than you. For information on internal and external validation, see the ‘Biostatistic for Biomedical Research’ course notes.

Fin

Citation

BibTeX citation:

@online{smith2025,
  author = {Smith, Paul and Smith, Paul},
  title = {Some Critiques of Data-Splitting and the Stepwise Procedure},
  date = {2025-03-05},
  url = {https://pws3141.github.io/blog/posts/modelbuilding/stepwise_datasplitting/},
  langid = {en}
}

For attribution, please cite this work as:

Smith, Paul, and Paul Smith. 2025. “Some Critiques of Data-Splitting and the Stepwise Procedure.” March 5, 2025. https://pws3141.github.io/blog/posts/modelbuilding/stepwise_datasplitting/.