library(data.table)
options(datatable.print.nrows = 20) # max # rows before truncating (default = 100)
library(survival)
library(rms) # for 'validate'
library(mice)
# load pbc dataset
data(pbc)
<- as.data.table(pbc)
pbc := fifelse(status == 2, 1, 0)]
pbc[, status
## used in 'step()' below
# creating upper and lower model for stepwise selection
# create upper scope (everything except 'id' and outcome variables)
<- paste(setdiff(names(pbc), c("time", "status", "id")),
upper_names collapse = " + ")
<- list(
scope upper = reformulate(upper_names),
lower = ~1
)
# complete case
<- na.omit(pbc) pbc_complete
This is Part Three of a four part series on model fitting and validation.
- Part One considers some theoretical issues with data-splitting and using stepwise selection
- Part Two looks at simulating the effect of data-splitting and stepwise selection in model building and validation on the
lung
dataset, including:- The variability of model terms chosen and the coefficient estimates from data-splitting and using stepwise selection;
- The variability in the C-statistic obtained from the test set.
Introduction
In this part, I look at using the bootstrap to validate a model. Much of this work is inspired by the excellent book by Harrell et al. (2001). Using the bootstrap to validate a model means that data-splitting is not required, and instead both model building and validation can be performed on the full dataset. As discussed by Harrell in a post,
Data are too precious to not be used in model development/parameter estimation. Resampling methods allow the data to be used for both development and validation, and they do a good job in estimating the likely future performance of a model.
In Harrell et al. (2001, vol. 608, chap. 5.2), a hierarchy of validation methods is given, from worst to best.1
- Attempting several validations (internal or external) and reporting only the one that “worked”
- Reporting apparent performance on the training dataset (no validation)
- Reporting predictive accuracy on an undersized independent test sample
- Internal validation using data-splitting where at least one of the training and test samples is not huge and the investigator is not aware of the arbitrariness of variable selection done on a single sample
- Strong internal validation using 100 repeats of 10-fold cross-validation or several hundred bootstrap resamples, repeating all analysis steps involving \(Y\) afresh at each re-sample and the arbitrariness of selected “important variables” is reported (if variable selection is used)
- External validation on a large test sample, done by the original research team
- Re-analysis by an independent research team using strong internal validation of the original dataset
- External validation using new test data, done by an independent research team
- External validation using new test data generated using different instruments/technology, done by an independent research team
In this post, I am looking at option (5). Currently at work we often rely on (3) or (4).
1 Specifically, “one suggested hierarchy”, so I’m unsure whether this is Harrell’s suggested one or not…
The method
I will only consider complete-case data for now, and will consider two different scenarios.
- Model fitting where we already “know” the factors in the model.
- Model building and fitting, where we use stepwise to select the factors in the model.
I am using stepwise here for simplicity. The lasso would be better (although it still shares many of stepwise’s flaws). Using expert judgement to choose variables is much better, and means it is much harder to hide behind the computer.
I’m waiting with baited breath for guidance from the STRengthening Analytical Thinking for Observational Studies (STRATOS) initiative on variable selection techniques for multivariate analysis, but I’m not sure it’s coming anytime soon (Sauerbrei et al. 2020).
The method for model validation in this article is based on Harrell et al. (2001, vol. 608, chap. 5.3.5), which describes the process of calculating an optimism-corrected accuracy metric using bootstrap resampling as:
From the original \(X\) and \(Y\) in the sample of size \(n\), draw a sample with replacement also of size \(n\). Derive a model in the bootstrap sample and apply it without change to the original sample. The accuracy index from the bootstrap sample minus the index computed on the original sample is an estimate of optimism. This process is repeated for \(100\) or so bootstrap replications to obtain an average optimism, which is subtracted from the final model fit’s apparent accuracy to obtain the overfitting-corrected estimate.
He goes on…
Note that bootstrapping validates the process that was used to fit the original model (as does cross-validation). It provides an estimate of the expected value of the optimism, which when subtracted from the original index, provides an estimate of the expected bias-corrected index. If stepwise variable selection is part of the bootstrap process (as it must be if the final model is developed that way), and not all resamples (samples with replacement or training samples in cross-validation) resulted in the same model (which is almost always the case), this internal validation process actually provides an unbiased estimate of the future performance of the process used to identify markers and scoring systems; it does not validate a single final model. But resampling does tend to provide good estimates of the future performance of the final model that was selected using the same procedure repeated in the resamples.
A good introduction to model building and internal validation is given in Collins et al. (2024). See Box 2 for a simple explanation of the bootstrapping internal validation method.
I have clarified this procedure where appropriate in Section 2.
The data
The examples will be performed on the complete-case of pbc
data from the {survival}
package (Therneau 2024).2
2 In my next post I’ll be considering the full dataset and use multiple imputation to handle the missing data.
This is a dataset of patients with primary biliary cirrhosis, and has 276 observations (compared to 418 observations in the full pbc
dataset). Throughout this post I will use \(B = 100\) bootstrap iterations.
# number of bootstrap iterations
<- 100 B
Analysis
Notation | Definition |
---|---|
\(X \in \mathbb{R}^{n \times p}\) | dataset containing missingness (pbc ) |
\(\tilde{X} \in \mathbb{R}^{n \times \tilde{p}}\) | complete-case dataset (pbc_complete ) |
\(X^{(i)}\) (or \(\tilde{X}^{(i)}\)) | \(i\)-th bootstrap dataset (or complete-case dataset), \(i = 1,\dots,B\) |
Choosing factors manually
In this section, I consider building a model and validating it, when the factors are chosen in the model a-priori.
Model building
\[ \begin{align} \text{Data:}\quad&\tilde{X}\\ &\\ &\downarrow \quad \color{gray}{\text{Factors chosen a-priori}}\\ &\\ \text{Model:}\quad&\mathcal{M} \end{align} \]
The example
I will build a model using the following factors: age
, ascites
, edema
, albumin
.
# fit a Cox PH model to pbc
<- coxph(Surv(time, status) ~ age + ascites + edema + albumin,
cox1 data = pbc_complete)
summary(cox1)
Call:
coxph(formula = Surv(time, status) ~ age + ascites + edema +
albumin, data = pbc_complete)
n= 276, number of events= 111
coef exp(coef) se(coef) z Pr(>|z|)
age 0.029756 1.030203 0.009434 3.154 0.001610 **
ascites 0.795809 2.216234 0.337631 2.357 0.018421 *
edema 1.603254 4.969174 0.323922 4.950 7.44e-07 ***
albumin -1.041877 0.352792 0.274523 -3.795 0.000148 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.0302 0.9707 1.011 1.0494
ascites 2.2162 0.4512 1.143 4.2954
edema 4.9692 0.2012 2.634 9.3758
albumin 0.3528 2.8345 0.206 0.6042
Concordance= 0.757 (se = 0.027 )
Likelihood ratio test= 95.51 on 4 df, p=<2e-16
Wald test = 116.5 on 4 df, p=<2e-16
Score (logrank) test = 172.7 on 4 df, p=<2e-16
Model validation
First, \(B\) bootstrap datasets are created, and models fitted to each of these:
\[ \begin{align} \text{Data:}\quad\quad\quad&\tilde{X}&&&&\\ \\ &\downarrow\\ \\ \text{Bootstrapped data:}\quad\quad&\tilde{X}^{(1)} \quad &&\tilde{X}^{(2)} \quad &&\ldots \quad &&\tilde{X}^{(B)}\\ \\ &\downarrow && \downarrow &&\ldots &&\downarrow \quad\color{gray}{\text{Same factors as original model building (chosen a-priori)}}\\ \\ \text{Models:}\quad\quad &\mathcal{M}^{(1)} \quad &&\mathcal{M}^{(2)} \quad &&\ldots \quad &&\mathcal{M}^{(B)}\\ \end{align} \]
Then, calculating the optimism-corrected C-statistic involves:
- Calculate the C-statistic obtained from the original model \(\mathcal{M}\) (from Section 2.1.1), fitted to the complete-case dataset \(\tilde{X}\), denoted \(c\).
- For \(b = 1, 2, \ldots, B\):
- Calculate the C-statistic from the \(b^\text{th}\) bootstrap model \(\mathcal{M}^{(b)}\), fitted to the \(b^\text{th}\) bootstrap dataset \(\tilde{X}^{(b)}\), denoted \(c^{(b)}\).
- Calculate the C-statistic from the \(b^\text{th}\) bootstrap model \(\mathcal{M}^{(b)}\), fitted to the original dataset, \(\tilde{X}\), denoted \(c^{(b)}_X\).
- Calculate the optimism for the \(b^\text{th}\) bootstrap model \(o^{(b)} = c^{(b)} - c^{(b)}_X\).
- Calculate the mean optimism, \(\bar{o} = \frac{1}{B} \sum_{b=1}^{B} o^{(b)}\).
- Calculate the optimism-corrected C-statistic, \[ c_\text{opt} = c - \bar{o}. \]
The example
First, I obtain the (optimistic) C-statistic.
<- concordance(cox1)$concordance c1
Then, we follow Steps (2) - (4) above to obtain the optimism-corrected C-statistic.
# number of bootstrap samples
<- 100
B
set.seed(10)
<- rbindlist(
c_boots1 lapply(seq_len(B), function(b) {
# 2a. create bootstrap sample
<- sample(nrow(pbc_complete), replace = TRUE)
idx <- pbc_complete[idx]
pbc_boot
# 2b. fit model
<- coxph(Surv(time, status) ~ age + ascites + edema + albumin,
cox_boot data = pbc_boot)
# 2c.
# calculate concordance on bootstrap sample
<- concordance(cox_boot)
c_stat_boot
# 2d.
# predict on original data
<- predict(cox_boot, newdata = pbc_complete, type = "lp")
lp_X # calculate concordance on original data
<- concordance(Surv(time, status) ~ lp_X,
c_stat_X data = pbc_complete, reverse = TRUE)
<- c_stat_boot$concordance
c_boot <- c_stat_X$concordance
c_x
# save c stats and calculate optimism
data.table(c_boot = c_boot, c_x = c_x, optimism = c_boot - c_x)
}
)
)
print(c_boots1, topn = 5)
c_boot c_x optimism
<num> <num> <num>
1: 0.7414303 0.7518621 -0.010431756
2: 0.7249516 0.7552477 -0.030296102
3: 0.7277843 0.7552477 -0.027463354
4: 0.7299828 0.7534767 -0.023493912
5: 0.7489701 0.7543622 -0.005392114
---
96: 0.7580732 0.7539455 0.004127666
97: 0.7296153 0.7548831 -0.025267743
98: 0.8220811 0.7535809 0.068500184
99: 0.7695147 0.7469660 0.022548752
100: 0.7220383 0.7516537 -0.029615478
Now, we can compute the mean optimism, \(\bar{o}\), and the overfitting-corrected estimate of accuracy.
# 3. mean difference
<- mean(c_boots1$optimism)
optimism_mean1
# 4. overfitting-corrected estimate of accuracy
<- c1 - optimism_mean1 c_opt1
So, the original C-statistic was 0.7565, compared to the bias-corrected estimate of 0.7547.
This can be done in a few lines lines, using the {rms}
package.
# validate
set.seed(10)
<- rms::cph(Surv(time, status) ~ age + ascites + edema + albumin,
cox1_cph data = pbc_complete, x = TRUE, y = TRUE)
<- rms::validate(cox1_cph, B = 100)
bias_corrected_metrics1 bias_corrected_metrics1
- 1
-
rms::validate()
requires the model to be fit usingrms::cph()
instead ofsurvival::coxph()
.
index.orig training test optimism index.corrected n
Dxy 0.5131 0.5072 0.5035 0.0037 0.5094 100
R2 0.2981 0.3038 0.2865 0.0173 0.2807 100
Slope 1.0000 1.0000 0.9359 0.0641 0.9359 100
D 0.0859 0.0892 0.0819 0.0073 0.0786 100
U -0.0018 -0.0018 0.0025 -0.0044 0.0025 100
Q 0.0877 0.0910 0.0794 0.0116 0.0761 100
g 0.9916 1.0087 0.9469 0.0619 0.9298 100
For a detailed explanation of the validate
output, see John Haman’s website.
<- (bias_corrected_metrics1["Dxy", "index.corrected"] + 1) / 2
c1_rms c1_rms
[1] 0.7546952
Analysis Using Stepwise Selection
The model building stage (Section 2.2.1) is done using stepwise selection, and the model validation stage (Section 2.2.2) is done using bootstrapping. The model building and validation stages are shown below.
Model building
Here, I perform stepwise selection on the complete-case dataset, starting from the null model.
\[ \begin{align} \text{Data:}\quad&\tilde{X}\\ &\\ &\downarrow \quad \color{gray}{\text{Stepwise selection}}\\ &\\ \text{Model:}\quad&\mathcal{M} \end{align} \]
The example
# fit a Cox PH model to pbc: start with the null model
<- coxph(Surv(time, status) ~ 1, data = pbc_complete)
cox_null
# use stepwise selection (minimising AIC)
# creating upper and lower model for stepwise selection
# create upper scope (everything except 'id' and outcome variables)
<- paste(setdiff(names(pbc), c("time", "status", "id")),
upper_names collapse = " + ")
<- list(
scope upper = reformulate(upper_names),
lower = ~1
)
<- step(cox_null, scope = scope, trace = 0) cox2
# look at the model
summary(cox2)
Call:
coxph(formula = Surv(time, status) ~ bili + stage + copper +
albumin + protime + age + edema + ast, data = pbc_complete)
n= 276, number of events= 111
coef exp(coef) se(coef) z Pr(>|z|)
bili 0.0851214 1.0888492 0.0193352 4.402 1.07e-05 ***
stage 0.4327939 1.5415584 0.1456307 2.972 0.00296 **
copper 0.0028535 1.0028576 0.0009832 2.902 0.00370 **
albumin -0.7185954 0.4874364 0.2724486 -2.638 0.00835 **
protime 0.2275175 1.2554794 0.1013729 2.244 0.02481 *
age 0.0313836 1.0318812 0.0102036 3.076 0.00210 **
edema 0.8217952 2.2745795 0.3471465 2.367 0.01792 *
ast 0.0043769 1.0043865 0.0018067 2.423 0.01541 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
bili 1.0888 0.9184 1.0484 1.1309
stage 1.5416 0.6487 1.1588 2.0508
copper 1.0029 0.9972 1.0009 1.0048
albumin 0.4874 2.0515 0.2858 0.8314
protime 1.2555 0.7965 1.0292 1.5314
age 1.0319 0.9691 1.0114 1.0527
edema 2.2746 0.4396 1.1519 4.4915
ast 1.0044 0.9956 1.0008 1.0079
Concordance= 0.845 (se = 0.019 )
Likelihood ratio test= 163.8 on 8 df, p=<2e-16
Wald test = 176.1 on 8 df, p=<2e-16
Score (logrank) test = 257.5 on 8 df, p=<2e-16
Model validation
The model validation stage is done using bootstrapping. The bootstrap procedure – starting from the complete-case dataset \(\tilde{X}\) (which is the same as used in the model building stage) – is as follows.
First, \(B\) bootstrap datasets are created, and models fitted to each of these (using the same stepwise procedure as in Section 2.2.1):
\[ \begin{align} \text{Data:}\quad\quad\quad&\tilde{X}&&&&\\ \\ &\downarrow\\ \\ \text{Bootstrapped data:}\quad\quad&\tilde{X}^{(1)} \quad &&\tilde{X}^{(2)} \quad &&\ldots \quad &&\tilde{X}^{(B)}\\ \\ &\downarrow && \downarrow &&\ldots &&\downarrow \quad\color{gray}{\text{Stepwise selection}}\\ \\ \text{Models:}\quad\quad &\mathcal{M}^{(1)} \quad &&\mathcal{M}^{(2)} \quad &&\ldots \quad &&\mathcal{M}^{(B)}\\ \end{align} \]
Then, calculating the optimism-corrected C-statistic involves:
- Calculate the C-statistic obtained from the original model \(\mathcal{M}\) (from Section 2.2.1), fitted to the complete-case dataset \(\tilde{X}\), denoted \(c\).
- For \(b = 1, 2, \ldots, B\):
- Calculate the C-statistic from the \(b^\text{th}\) bootstrap model \(\mathcal{M}^{(b)}\), fitted to the \(b^\text{th}\) bootstrap dataset \(\tilde{X}^{(b)}\), denoted \(c^{(b)}\).
- Calculate the C-statistic from the \(b^\text{th}\) bootstrap model \(\mathcal{M}^{(b)}\), fitted to the original dataset, \(\tilde{X}\), denoted \(c^{(b)}_X\).
- Calculate the optimism for the \(b^\text{th}\) bootstrap model \(o^{(b)} = c^{(b)} - c^{(b)}_X\).
- Calculate the mean optimism, \(\bar{o} = \frac{1}{B} \sum_{b=1}^{B} o^{(b)}\).
- Calculate the optimism-corrected C-statistic, \[ c_\text{opt} = c - \bar{o}. \]
The example
First, I obtain the (optimistic) C-statistic3, as described in Step (1).
3 The C-statistic obtained from the original model \(\mathcal{M}\) (from Section 2.2.1), fitted to the complete-case dataset \(\tilde{X}\).
<- concordance(cox2)$concordance c
Then, follow Step (2) to obtain the C-statistic from the \(b^\text{th}\) bootstrap model \(\mathcal{M}^{(b)}\), fitted to both the bootstrap dataset,\(\tilde{X}^{(b)}\), and the original dataset, \(\tilde{X}\). From these the \(b^\text{th}\) bootstrap model optimism can be calculated.
set.seed(10)
# start time of execution
<- Sys.time()
total_time_start
<- rbindlist(
c_boots lapply(seq_len(B), function(b) {
# start time of inner execution
<- Sys.time()
start_time
# 2a. create bootstrap sample
<- sample(nrow(pbc_complete), replace = TRUE)
idx <- pbc_complete[idx]
pbc_boot
# 2b. fit null model
<- coxph(Surv(time, status) ~ 1, data = pbc_boot)
cox_boot_null # use stepwise selection (minimising AIC)
<- step(cox_boot_null, scope = scope, trace = 0)
cox_boot
# 2c.
<- concordance(cox_boot)$concordance
c_boot
# 2d.
# predict on original data
<- predict(cox_boot, newdata = pbc_complete, type = "lp")
lp_X # calculate concordance on original data
<- concordance(Surv(time, status) ~ lp_X,
c_x data = pbc_complete, reverse = TRUE)$concordance
# end time of inner execution
<- Sys.time()
end_time <- end_time - start_time
time_taken
# save c stats and calculate optimism
data.table(c_boot = c_boot, c_x = c_x, optimism = c_boot - c_x,
time = time_taken)
}
)
)
# end time of inner execution
<- Sys.time()
total_time_end <- total_time_end - total_time_start
total_time_taken
print(c_boots, topn = 5)
c_boot c_x optimism time
<num> <num> <num> <difftime>
1: 0.8538053 0.8443148 0.009490538 0.1987031 secs
2: 0.8293180 0.8343143 -0.004996320 0.2695990 secs
3: 0.8455517 0.8346268 0.010924944 0.2656732 secs
4: 0.8362208 0.8348872 0.001333566 0.3547299 secs
5: 0.8589161 0.8320225 0.026893555 0.1707869 secs
---
96: 0.8501148 0.8251992 0.024915579 0.1238530 secs
97: 0.8402561 0.8210323 0.019223753 0.3700459 secs
98: 0.8800687 0.8248346 0.055234084 0.2577620 secs
99: 0.8755016 0.8236366 0.051864931 0.2591660 secs
100: 0.8329462 0.8116048 0.021341406 0.2755342 secs
This process took 23.74 secs in total.4
4 On my Macbook Air M3.
Now, I compute the mean optimism, \(\bar{o}\), and the overfitting-corrected estimate of accuracy.
# (3) mean optimism
<- mean(c_boots$optimism)
optimism_mean
# (4) optimism-corrected estimate of accuracy
<- c - optimism_mean c_opt
So, the original C-statistic was \(c =\) 0.8454, compared to the bias-corrected estimate of \(c_\text{opt}=\) 0.8235. The mean of the optimism was \(\bar{o} =\) 0.0218.
rms::validate()
The validate()
function cannot be used here; it doesn’t validate the whole process, as the stepwise procedure is not repeated. That is, validate
assumes the model is fixed for every bootstrap iteration. This means that the mean optimism will be smaller than the one obtained above, and therefore the ‘optimism-adjusted’ C-statistic will be higher than it should be.
The model found from the stepwise procedure in Section 2.2.1.1 is,
formula(cox2)
Surv(time, status) ~ bili + stage + copper + albumin + protime +
age + edema + ast
<environment: 0x12251e7e0>
So, assuming this model form is constant throughout:
# validate
set.seed(10)
<- rms::cph(formula(cox2),
cox2_cph data = pbc_complete, x = TRUE, y = TRUE)
<- rms::validate(cox2_cph, B = 100)
bias_corrected_metrics2
bias_corrected_metrics2<- (bias_corrected_metrics2["Dxy", "index.corrected"] + 1) / 2
c2_rms c2_rms
index.orig training test optimism index.corrected n
Dxy 0.6907 0.6950 0.6791 0.0159 0.6748 100
R2 0.4561 0.4700 0.4379 0.0321 0.4240 100
Slope 1.0000 1.0000 0.8994 0.1006 0.8994 100
D 0.1479 0.1561 0.1400 0.0161 0.1319 100
U -0.0018 -0.0018 0.0049 -0.0068 0.0049 100
Q 0.1498 0.1579 0.1351 0.0228 0.1270 100
g 1.4660 1.5596 1.3985 0.1611 1.3049 100
[1] 0.8373867
Summary
What I’ve done: for both a predefined model and using a stepwise model building procedure, I have calculated the apparent performance C-statistic, and then adjusted for optimism. For the predefined model, it is possible to calculate the optimism easily using rms::validate()
. However, when using a model building procedure it is required to do this manually as the whole process needs to be included in each bootstrap iteration.
Next steps: model building and validation in the presence of missing data, using multiple imputation.
Fin
Citation
@online{smith2025,
author = {Smith, Paul and Smith, Paul},
title = {Model {Validation} Using the {Bootstrap}},
date = {2025-04-26},
url = {https://pws3141.github.io/blog/posts/modelbuilding/bias_corrected_validation/},
langid = {en}
}