Model Validation using the Bootstrap

Paul Smith

This is Part Three of a four part series on model fitting and validation.

Part One considers some theoretical issues with data-splitting and using stepwise selection
Part Two looks at simulating the effect of data-splitting and stepwise selection in model building and validation on the lung dataset, including:
- The variability of model terms chosen and the coefficient estimates from data-splitting and using stepwise selection;
- The variability in the C-statistic obtained from the test set.

Introduction

In this part, I look at using the bootstrap to validate a model. Much of this work is inspired by the excellent book by Harrell et al. (2001). Using the bootstrap to validate a model means that data-splitting is not required, and instead both model building and validation can be performed on the full dataset. As discussed by Harrell in a post,

Data are too precious to not be used in model development/parameter estimation. Resampling methods allow the data to be used for both development and validation, and they do a good job in estimating the likely future performance of a model.

The hierarchy of validation methods

In Harrell et al. (2001, vol. 608, chap. 5.2), a hierarchy of validation methods is given, from worst to best.¹

Attempting several validations (internal or external) and reporting only the one that “worked”
Reporting apparent performance on the training dataset (no validation)
Reporting predictive accuracy on an undersized independent test sample
Internal validation using data-splitting where at least one of the training and test samples is not huge and the investigator is not aware of the arbitrariness of variable selection done on a single sample
Strong internal validation using 100 repeats of 10-fold cross-validation or several hundred bootstrap resamples, repeating all analysis steps involving \(Y\) afresh at each re-sample and the arbitrariness of selected “important variables” is reported (if variable selection is used)
External validation on a large test sample, done by the original research team
Re-analysis by an independent research team using strong internal validation of the original dataset
External validation using new test data, done by an independent research team
External validation using new test data generated using different instruments/technology, done by an independent research team

In this post, I am looking at option (5). Currently at work we often rely on (3) or (4).

¹ Specifically, “one suggested hierarchy”, so I’m unsure whether this is Harrell’s suggested one or not…

The method

I will only consider complete-case data for now, and will consider two different scenarios.

Model fitting where we already “know” the factors in the model.
Model building and fitting, where we use stepwise to select the factors in the model.

Why stepwise?

I am using stepwise here for simplicity. The lasso would be better (although it still shares many of stepwise’s flaws). Using expert judgement to choose variables is much better, and means it is much harder to hide behind the computer.

I’m waiting with baited breath for guidance from the STRengthening Analytical Thinking for Observational Studies (STRATOS) initiative on variable selection techniques for multivariate analysis, but I’m not sure it’s coming anytime soon (Sauerbrei et al. 2020).

Sauerbrei, Willi, Aris Perperoglou, Matthias Schmid, Michal Abrahamowicz, Heiko Becher, Harald Binder, Daniela Dunkler, et al. 2020. “State of the Art in Selection of Variables and Functional Forms in Multivariable Analysis—Outstanding Issues.” Diagnostic and Prognostic Research 4: 1–18.

Harrell, Frank E et al. 2001. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Vol. 608. Springer.

The method for model validation in this article is based on Harrell et al. (2001, vol. 608, chap. 5.3.5), which describes the process of calculating an optimism-corrected accuracy metric using bootstrap resampling as:

From the original \(X\) and \(Y\) in the sample of size \(n\), draw a sample with replacement also of size \(n\). Derive a model in the bootstrap sample and apply it without change to the original sample. The accuracy index from the bootstrap sample minus the index computed on the original sample is an estimate of optimism. This process is repeated for \(100\) or so bootstrap replications to obtain an average optimism, which is subtracted from the final model fit’s apparent accuracy to obtain the overfitting-corrected estimate.

He goes on…

Note that bootstrapping validates the process that was used to fit the original model (as does cross-validation). It provides an estimate of the expected value of the optimism, which when subtracted from the original index, provides an estimate of the expected bias-corrected index. If stepwise variable selection is part of the bootstrap process (as it must be if the final model is developed that way), and not all resamples (samples with replacement or training samples in cross-validation) resulted in the same model (which is almost always the case), this internal validation process actually provides an unbiased estimate of the future performance of the process used to identify markers and scoring systems; it does not validate a single final model. But resampling does tend to provide good estimates of the future performance of the final model that was selected using the same procedure repeated in the resamples.

A good introduction to model building and internal validation is given in Collins et al. (2024). See Box 2 for a simple explanation of the bootstrapping internal validation method.

Collins, Gary S, Paula Dhiman, Jie Ma, Michael M Schlussel, Lucinda Archer, Ben Van Calster, Frank E Harrell, et al. 2024. “Evaluation of Clinical Prediction Models (Part 1): From Development to External Validation.” Bmj 384.

I have clarified this procedure where appropriate in Section 2.

The data

The examples will be performed on the complete-case of pbc data from the {survival} package (Therneau 2024).²

Therneau, Terry M. 2024. A Package for Survival Analysis in r. https://CRAN.R-project.org/package=survival.

² In my next post I’ll be considering the full dataset and use multiple imputation to handle the missing data.

library(data.table)
options(datatable.print.nrows = 20) # max # rows before truncating (default = 100)
library(survival)
library(rms) # for 'validate'
library(mice)

# load pbc dataset
data(pbc)
pbc <- as.data.table(pbc)
pbc[, status := fifelse(status == 2, 1, 0)]

## used in 'step()' below
# creating upper and lower model for stepwise selection
# create upper scope (everything except 'id' and outcome variables)
upper_names <- paste(setdiff(names(pbc), c("time", "status", "id")), 
                             collapse = " + ")
scope <- list(
          upper = reformulate(upper_names),
          lower = ~1
        )

# complete case
pbc_complete <- na.omit(pbc)

This is a dataset of patients with primary biliary cirrhosis, and has 276 observations (compared to 418 observations in the full pbc dataset). Throughout this post I will use \(B = 100\) bootstrap iterations.

# number of bootstrap iterations
B <- 100

Analysis

Notation	Definition
\(X \in \mathbb{R}^{n \times p}\)	dataset containing missingness (`pbc`)
\(\tilde{X} \in \mathbb{R}^{n \times \tilde{p}}\)	complete-case dataset (`pbc_complete`)
\(X^{(i)}\) (or \(\tilde{X}^{(i)}\))	\(i\)-th bootstrap dataset (or complete-case dataset), \(i = 1,\dots,B\)

Choosing factors manually

In this section, I consider building a model and validating it, when the factors are chosen in the model a-priori.

Model building

\[ \begin{align} \text{Data:}\quad&\tilde{X}\\ &\\ &\downarrow \quad \color{gray}{\text{Factors chosen a-priori}}\\ &\\ \text{Model:}\quad&\mathcal{M} \end{align} \]

The example

I will build a model using the following factors: age, ascites, edema, albumin.

# fit a Cox PH model to pbc
cox1 <- coxph(Surv(time, status) ~ age + ascites + edema + albumin, 
                  data = pbc_complete)
summary(cox1)

Call:
coxph(formula = Surv(time, status) ~ age + ascites + edema + 
    albumin, data = pbc_complete)

  n= 276, number of events= 111 

             coef exp(coef)  se(coef)      z Pr(>|z|)    
age      0.029756  1.030203  0.009434  3.154 0.001610 ** 
ascites  0.795809  2.216234  0.337631  2.357 0.018421 *  
edema    1.603254  4.969174  0.323922  4.950 7.44e-07 ***
albumin -1.041877  0.352792  0.274523 -3.795 0.000148 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0302     0.9707     1.011    1.0494
ascites    2.2162     0.4512     1.143    4.2954
edema      4.9692     0.2012     2.634    9.3758
albumin    0.3528     2.8345     0.206    0.6042

Concordance= 0.757  (se = 0.027 )
Likelihood ratio test= 95.51  on 4 df,   p=<2e-16
Wald test            = 116.5  on 4 df,   p=<2e-16
Score (logrank) test = 172.7  on 4 df,   p=<2e-16

Model validation

First, \(B\) bootstrap datasets are created, and models fitted to each of these:

\[ \begin{align} \text{Data:}\quad\quad\quad&\tilde{X}&&&&\\ \\ &\downarrow\\ \\ \text{Bootstrapped data:}\quad\quad&\tilde{X}^{(1)} \quad &&\tilde{X}^{(2)} \quad &&\ldots \quad &&\tilde{X}^{(B)}\\ \\ &\downarrow && \downarrow &&\ldots &&\downarrow \quad\color{gray}{\text{Same factors as original model building (chosen a-priori)}}\\ \\ \text{Models:}\quad\quad &\mathcal{M}^{(1)} \quad &&\mathcal{M}^{(2)} \quad &&\ldots \quad &&\mathcal{M}^{(B)}\\ \end{align} \]

Then, calculating the optimism-corrected C-statistic involves:

Calculate the C-statistic obtained from the original model \(\mathcal{M}\) (from Section 2.1.1), fitted to the complete-case dataset \(\tilde{X}\), denoted \(c\).
For \(b = 1, 2, \ldots, B\):
1. Calculate the C-statistic from the \(b^\text{th}\) bootstrap model \(\mathcal{M}^{(b)}\), fitted to the \(b^\text{th}\) bootstrap dataset \(\tilde{X}^{(b)}\), denoted \(c^{(b)}\).
2. Calculate the C-statistic from the \(b^\text{th}\) bootstrap model \(\mathcal{M}^{(b)}\), fitted to the original dataset, \(\tilde{X}\), denoted \(c^{(b)}_X\).
3. Calculate the optimism for the \(b^\text{th}\) bootstrap model \(o^{(b)} = c^{(b)} - c^{(b)}_X\).
Calculate the mean optimism, \(\bar{o} = \frac{1}{B} \sum_{b=1}^{B} o^{(b)}\).
Calculate the optimism-corrected C-statistic, \[ c_\text{opt} = c - \bar{o}. \]

The example

First, I obtain the (optimistic) C-statistic.

c1 <- concordance(cox1)$concordance

Then, we follow Steps (2) - (4) above to obtain the optimism-corrected C-statistic.

# number of bootstrap samples
B <- 100


set.seed(10)

c_boots1 <- rbindlist(
  lapply(seq_len(B), function(b) {
      
      # 2a. create bootstrap sample
      idx <- sample(nrow(pbc_complete), replace = TRUE)
      pbc_boot <- pbc_complete[idx]
      
      # 2b. fit model
      cox_boot <- coxph(Surv(time, status) ~ age + ascites + edema + albumin, 
                        data = pbc_boot)

      # 2c. 
      # calculate concordance on bootstrap sample
      c_stat_boot <- concordance(cox_boot)
      
      # 2d.
      # predict on original data
      lp_X <- predict(cox_boot, newdata = pbc_complete, type = "lp")
      # calculate concordance on original data
      c_stat_X <- concordance(Surv(time, status) ~ lp_X, 
                              data = pbc_complete, reverse = TRUE)
      c_boot <- c_stat_boot$concordance
      c_x <- c_stat_X$concordance
      
      # save c stats and calculate optimism
      data.table(c_boot = c_boot, c_x = c_x, optimism = c_boot - c_x)
    }
  )
)

print(c_boots1, topn = 5)

        c_boot       c_x     optimism
         <num>     <num>        <num>
  1: 0.7414303 0.7518621 -0.010431756
  2: 0.7249516 0.7552477 -0.030296102
  3: 0.7277843 0.7552477 -0.027463354
  4: 0.7299828 0.7534767 -0.023493912
  5: 0.7489701 0.7543622 -0.005392114
 ---                                 
 96: 0.7580732 0.7539455  0.004127666
 97: 0.7296153 0.7548831 -0.025267743
 98: 0.8220811 0.7535809  0.068500184
 99: 0.7695147 0.7469660  0.022548752
100: 0.7220383 0.7516537 -0.029615478

Now, we can compute the mean optimism, \(\bar{o}\), and the overfitting-corrected estimate of accuracy.

# 3. mean difference
optimism_mean1 <- mean(c_boots1$optimism)

# 4. overfitting-corrected estimate of accuracy
c_opt1 <- c1 - optimism_mean1

So, the original C-statistic was 0.7565, compared to the bias-corrected estimate of 0.7547.

The quick code

This can be done in a few lines lines, using the {rms} package.

# validate
set.seed(10)
cox1_cph <- rms::cph(Surv(time, status) ~ age + ascites + edema + albumin, 
                         data = pbc_complete, x = TRUE, y = TRUE)
bias_corrected_metrics1 <- rms::validate(cox1_cph, B = 100)
bias_corrected_metrics1

1: rms::validate() requires the model to be fit using rms::cph() instead of survival::coxph().

      index.orig training   test optimism index.corrected   n
Dxy       0.5131   0.5072 0.5035   0.0037          0.5094 100
R2        0.2981   0.3038 0.2865   0.0173          0.2807 100
Slope     1.0000   1.0000 0.9359   0.0641          0.9359 100
D         0.0859   0.0892 0.0819   0.0073          0.0786 100
U        -0.0018  -0.0018 0.0025  -0.0044          0.0025 100
Q         0.0877   0.0910 0.0794   0.0116          0.0761 100
g         0.9916   1.0087 0.9469   0.0619          0.9298 100

For a detailed explanation of the validate output, see John Haman’s website.

c1_rms <- (bias_corrected_metrics1["Dxy", "index.corrected"] + 1) / 2
c1_rms

[1] 0.7546952

Analysis Using Stepwise Selection

The model building stage (Section 2.2.1) is done using stepwise selection, and the model validation stage (Section 2.2.2) is done using bootstrapping. The model building and validation stages are shown below.

Model building

Here, I perform stepwise selection on the complete-case dataset, starting from the null model.

\[ \begin{align} \text{Data:}\quad&\tilde{X}\\ &\\ &\downarrow \quad \color{gray}{\text{Stepwise selection}}\\ &\\ \text{Model:}\quad&\mathcal{M} \end{align} \]

# fit a Cox PH model to pbc: start with the null model
cox_null <- coxph(Surv(time, status) ~ 1, data = pbc_complete)

# use stepwise selection (minimising AIC)
# creating upper and lower model for stepwise selection
# create upper scope (everything except 'id' and outcome variables)
upper_names <- paste(setdiff(names(pbc), c("time", "status", "id")), 
                             collapse = " + ")
scope <- list(
          upper = reformulate(upper_names),
          lower = ~1
        )

cox2 <- step(cox_null, scope = scope, trace = 0)

# look at the model
summary(cox2)

Call:
coxph(formula = Surv(time, status) ~ bili + stage + copper + 
    albumin + protime + age + edema + ast, data = pbc_complete)

  n= 276, number of events= 111 

              coef  exp(coef)   se(coef)      z Pr(>|z|)    
bili     0.0851214  1.0888492  0.0193352  4.402 1.07e-05 ***
stage    0.4327939  1.5415584  0.1456307  2.972  0.00296 ** 
copper   0.0028535  1.0028576  0.0009832  2.902  0.00370 ** 
albumin -0.7185954  0.4874364  0.2724486 -2.638  0.00835 ** 
protime  0.2275175  1.2554794  0.1013729  2.244  0.02481 *  
age      0.0313836  1.0318812  0.0102036  3.076  0.00210 ** 
edema    0.8217952  2.2745795  0.3471465  2.367  0.01792 *  
ast      0.0043769  1.0043865  0.0018067  2.423  0.01541 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
bili       1.0888     0.9184    1.0484    1.1309
stage      1.5416     0.6487    1.1588    2.0508
copper     1.0029     0.9972    1.0009    1.0048
albumin    0.4874     2.0515    0.2858    0.8314
protime    1.2555     0.7965    1.0292    1.5314
age        1.0319     0.9691    1.0114    1.0527
edema      2.2746     0.4396    1.1519    4.4915
ast        1.0044     0.9956    1.0008    1.0079

Concordance= 0.845  (se = 0.019 )
Likelihood ratio test= 163.8  on 8 df,   p=<2e-16
Wald test            = 176.1  on 8 df,   p=<2e-16
Score (logrank) test = 257.5  on 8 df,   p=<2e-16

Model validation

The model validation stage is done using bootstrapping. The bootstrap procedure – starting from the complete-case dataset \(\tilde{X}\) (which is the same as used in the model building stage) – is as follows.

First, \(B\) bootstrap datasets are created, and models fitted to each of these (using the same stepwise procedure as in Section 2.2.1):

\[ \begin{align} \text{Data:}\quad\quad\quad&\tilde{X}&&&&\\ \\ &\downarrow\\ \\ \text{Bootstrapped data:}\quad\quad&\tilde{X}^{(1)} \quad &&\tilde{X}^{(2)} \quad &&\ldots \quad &&\tilde{X}^{(B)}\\ \\ &\downarrow && \downarrow &&\ldots &&\downarrow \quad\color{gray}{\text{Stepwise selection}}\\ \\ \text{Models:}\quad\quad &\mathcal{M}^{(1)} \quad &&\mathcal{M}^{(2)} \quad &&\ldots \quad &&\mathcal{M}^{(B)}\\ \end{align} \]

Then, calculating the optimism-corrected C-statistic involves:

Calculate the C-statistic obtained from the original model \(\mathcal{M}\) (from Section 2.2.1), fitted to the complete-case dataset \(\tilde{X}\), denoted \(c\).
For \(b = 1, 2, \ldots, B\):
1. Calculate the C-statistic from the \(b^\text{th}\) bootstrap model \(\mathcal{M}^{(b)}\), fitted to the \(b^\text{th}\) bootstrap dataset \(\tilde{X}^{(b)}\), denoted \(c^{(b)}\).
2. Calculate the C-statistic from the \(b^\text{th}\) bootstrap model \(\mathcal{M}^{(b)}\), fitted to the original dataset, \(\tilde{X}\), denoted \(c^{(b)}_X\).
3. Calculate the optimism for the \(b^\text{th}\) bootstrap model \(o^{(b)} = c^{(b)} - c^{(b)}_X\).
Calculate the mean optimism, \(\bar{o} = \frac{1}{B} \sum_{b=1}^{B} o^{(b)}\).
Calculate the optimism-corrected C-statistic, \[ c_\text{opt} = c - \bar{o}. \]

The example

First, I obtain the (optimistic) C-statistic³, as described in Step (1).

³ The C-statistic obtained from the original model \(\mathcal{M}\) (from Section 2.2.1), fitted to the complete-case dataset \(\tilde{X}\).

c <- concordance(cox2)$concordance

Then, follow Step (2) to obtain the C-statistic from the \(b^\text{th}\) bootstrap model \(\mathcal{M}^{(b)}\), fitted to both the bootstrap dataset,\(\tilde{X}^{(b)}\), and the original dataset, \(\tilde{X}\). From these the \(b^\text{th}\) bootstrap model optimism can be calculated.

set.seed(10)

# start time of execution
total_time_start <- Sys.time()
      
c_boots <- rbindlist(
  lapply(seq_len(B), function(b) {
      
      # start time of inner execution
      start_time <- Sys.time()

      # 2a. create bootstrap sample
      idx <- sample(nrow(pbc_complete), replace = TRUE)
      pbc_boot <- pbc_complete[idx]
      
      # 2b. fit null model
      cox_boot_null <- coxph(Surv(time, status) ~ 1, data = pbc_boot)
      # use stepwise selection (minimising AIC)
      cox_boot <- step(cox_boot_null, scope = scope, trace = 0)

      # 2c. 
      c_boot <- concordance(cox_boot)$concordance
      
      # 2d.
      # predict on original data
      lp_X <- predict(cox_boot, newdata = pbc_complete, type = "lp")
      # calculate concordance on original data
      c_x <- concordance(Surv(time, status) ~ lp_X, 
                              data = pbc_complete, reverse = TRUE)$concordance
      
      # end time of inner execution
      end_time <- Sys.time()
      time_taken <- end_time - start_time
      
      # save c stats and calculate optimism
      data.table(c_boot = c_boot, c_x = c_x, optimism = c_boot - c_x,
                 time = time_taken)
    }
  )
)

# end time of inner execution
total_time_end <- Sys.time()
total_time_taken <- total_time_end - total_time_start

print(c_boots, topn = 5)

        c_boot       c_x     optimism           time
         <num>     <num>        <num>     <difftime>
  1: 0.8538053 0.8443148  0.009490538 0.1987031 secs
  2: 0.8293180 0.8343143 -0.004996320 0.2695990 secs
  3: 0.8455517 0.8346268  0.010924944 0.2656732 secs
  4: 0.8362208 0.8348872  0.001333566 0.3547299 secs
  5: 0.8589161 0.8320225  0.026893555 0.1707869 secs
 ---                                                
 96: 0.8501148 0.8251992  0.024915579 0.1238530 secs
 97: 0.8402561 0.8210323  0.019223753 0.3700459 secs
 98: 0.8800687 0.8248346  0.055234084 0.2577620 secs
 99: 0.8755016 0.8236366  0.051864931 0.2591660 secs
100: 0.8329462 0.8116048  0.021341406 0.2755342 secs

This process took 23.74 secs in total.⁴

⁴ On my Macbook Air M3.

Now, I compute the mean optimism, \(\bar{o}\), and the overfitting-corrected estimate of accuracy.

# (3) mean optimism
optimism_mean <- mean(c_boots$optimism)

# (4) optimism-corrected estimate of accuracy
c_opt <- c - optimism_mean

So, the original C-statistic was \(c =\) 0.8454, compared to the bias-corrected estimate of \(c_\text{opt}=\) 0.8235. The mean of the optimism was \(\bar{o} =\) 0.0218.

Using rms::validate()

The validate() function cannot be used here; it doesn’t validate the whole process, as the stepwise procedure is not repeated. That is, validate assumes the model is fixed for every bootstrap iteration. This means that the mean optimism will be smaller than the one obtained above, and therefore the ‘optimism-adjusted’ C-statistic will be higher than it should be.

The model found from the stepwise procedure in Section 2.2.1.1 is,

formula(cox2)

Surv(time, status) ~ bili + stage + copper + albumin + protime + 
    age + edema + ast
<environment: 0x12251e7e0>

So, assuming this model form is constant throughout:

# validate
set.seed(10)
cox2_cph <- rms::cph(formula(cox2),
                         data = pbc_complete, x = TRUE, y = TRUE)
bias_corrected_metrics2 <- rms::validate(cox2_cph, B = 100)
bias_corrected_metrics2
c2_rms <- (bias_corrected_metrics2["Dxy", "index.corrected"] + 1) / 2
c2_rms

      index.orig training   test optimism index.corrected   n
Dxy       0.6907   0.6950 0.6791   0.0159          0.6748 100
R2        0.4561   0.4700 0.4379   0.0321          0.4240 100
Slope     1.0000   1.0000 0.8994   0.1006          0.8994 100
D         0.1479   0.1561 0.1400   0.0161          0.1319 100
U        -0.0018  -0.0018 0.0049  -0.0068          0.0049 100
Q         0.1498   0.1579 0.1351   0.0228          0.1270 100
g         1.4660   1.5596 1.3985   0.1611          1.3049 100
[1] 0.8373867

Summary

What I’ve done: for both a predefined model and using a stepwise model building procedure, I have calculated the apparent performance C-statistic, and then adjusted for optimism. For the predefined model, it is possible to calculate the optimism easily using rms::validate(). However, when using a model building procedure it is required to do this manually as the whole process needs to be included in each bootstrap iteration.

Next steps: model building and validation in the presence of missing data, using multiple imputation.

Fin

Citation

BibTeX citation:

@online{smith2025,
  author = {Smith, Paul and Smith, Paul},
  title = {Model {Validation} Using the {Bootstrap}},
  date = {2025-04-26},
  url = {https://pws3141.github.io/blog/posts/modelbuilding/bias_corrected_validation/},
  langid = {en}
}

For attribution, please cite this work as:

Smith, Paul, and Paul Smith. 2025. “Model Validation Using the Bootstrap.” April 26, 2025. https://pws3141.github.io/blog/posts/modelbuilding/bias_corrected_validation/.