Title: | Batch Effect Adjustments |
---|---|
Description: | Different adjustment methods for batch effects in biomarker data, such as from tissue microarrays. Some methods attempt to retain differences between batches that may be due to between-batch differences in "biological" factors that influence biomarker values. |
Authors: | Konrad Stopsack [aut, cre] |
Maintainer: | Konrad Stopsack <[email protected]> |
License: | GPL-3 |
Version: | 0.1.7 |
Built: | 2025-03-05 04:18:01 UTC |
Source: | https://github.com/stopsack/batchtma |
adjust_batch
generates biomarker levels for the variable(s)
markers
in the dataset data
that are corrected
(adjusted) for batch effects, i.e. differential measurement
error between levels of batch
.
adjust_batch( data, markers, batch, method = c("simple", "standardize", "ipw", "quantreg", "quantnorm"), confounders = NULL, suffix = "_adjX", ipw_truncate = c(0.025, 0.975), quantreg_tau = c(0.25, 0.75), quantreg_method = "fn" )
adjust_batch( data, markers, batch, method = c("simple", "standardize", "ipw", "quantreg", "quantnorm"), confounders = NULL, suffix = "_adjX", ipw_truncate = c(0.025, 0.975), quantreg_tau = c(0.25, 0.75), quantreg_method = "fn" )
data |
Data set |
markers |
Variable name(s) to batch-adjust. Select
multiple variables with tidy evaluation, e.g.,
|
batch |
Categorical variable indicating batch. |
method |
Method for batch effect correction:
|
confounders |
Optional: Confounders, i.e. determinants of
biomarker levels that differ between batches. Only used if
|
suffix |
Optional: What string to append to variable names
after batch adjustment. Defaults to
|
ipw_truncate |
Optional and used for |
quantreg_tau |
Optional and used for |
quantreg_method |
Optional and used for |
If no true differences between batches are expected, because
samples have been randomized to batches, then a method
that returns adjusted values with equal means
(method = simple
) or with equal rank values
(method = quantnorm
) for all batches is appropriate.
If the distribution of determinants of biomarker values
(confounders
) differs between batches, then a
method
that retains these "true" differences
between batches while adjusting for batch effects
may be appropriate: method = standardize
and
method = ipw
address means; method = quantreg
addresses lower values and dynamic range separately.
Which method
to choose depends on the properties of
batch effects (affecting means or also variance?) and
the presence and strength of confounding. For the two
mean-only confounder-adjusted methods, the choice may depend
on whether the confounder–batch association (method = ipw
)
or the confounder–biomarker association
(method = standardize
) can be modeled better.
Generally, if batch effects are present, any adjustment
method tends to perform better than no adjustment in
reducing bias and increasing between-study reproducibility.
See references.
All adjustment approaches except method = quantnorm
are based on linear models. It is recommended that variables
for markers
and confounders
first be transformed
as necessary (e.g., log
transformations or
splines
). Scaling or mean centering are not necessary,
and adjusted values are returned on the original scale.
Parameters markers
, batch
, and confounders
support tidy evaluation.
Observations with missing values for the markers
and
confounders
will be ignored in the estimation of adjustment
parameters, as are empty batches. Batch effect-adjusted values
for observations with existing marker values but missing
confounders are based on adjustment parameters derived from the
other observations in a batch with non-missing confounders.
The data
dataset with batch effect-adjusted
variable(s) added at the end. Model diagnostics, using
the attribute .batchtma
of this dataset, are available
via the diagnose_models
function.
Konrad H. Stopsack
Stopsack KH, Tyekucheva S, Wang M, Gerke TA, Vaselkiv JB, Penney KL, Kantoff PW, Finn SP, Fiorentino M, Loda M, Lotan TL, Parmigiani G+, Mucci LA+ (+ equal contribution). Extent, impact, and mitigation of batch effects in tumor biomarker studies using tissue microarrays. eLife 2021;10:e71265. doi: https://doi.org/10.7554/elife.71265 (This R package, all methods descriptions, and further recommendations.)
Rosner B, Cook N, Portman R, Daniels S, Falkner B.
Determination of blood pressure percentiles in
normal-weight children: some methodological issues.
Am J Epidemiol 2008;167(6):653-66. (Basis for
method = standardize
)
Bolstad BM, Irizarry RA, Åstrand M, Speed TP.
A comparison of normalization methods for high density
oligonucleotide array data based on variance and bias.
Bioinformatics 2003;19:185–193. (method = quantnorm
)
https://stopsack.github.io/batchtma/
# Data frame with two batches # Batch 2 has higher values of biomarker and confounder df <- data.frame( tma = rep(1:2, times = 10), biomarker = rep(1:2, times = 10) + runif(max = 5, n = 20), confounder = rep(0:1, times = 10) + runif(max = 10, n = 20) ) # Adjust for batch effects # Using simple means, ignoring the confounder: adjust_batch( data = df, markers = biomarker, batch = tma, method = simple ) # Returns data set with new variable "biomarker_adj2" # Use quantile regression, include the confounder, # change suffix of returned variable: adjust_batch( data = df, markers = biomarker, batch = tma, method = quantreg, confounders = confounder, suffix = "_batchadjusted" ) # Returns data set with new variable "biomarker_batchadjusted"
# Data frame with two batches # Batch 2 has higher values of biomarker and confounder df <- data.frame( tma = rep(1:2, times = 10), biomarker = rep(1:2, times = 10) + runif(max = 5, n = 20), confounder = rep(0:1, times = 10) + runif(max = 10, n = 20) ) # Adjust for batch effects # Using simple means, ignoring the confounder: adjust_batch( data = df, markers = biomarker, batch = tma, method = simple ) # Returns data set with new variable "biomarker_adj2" # Use quantile regression, include the confounder, # change suffix of returned variable: adjust_batch( data = df, markers = biomarker, batch = tma, method = quantreg, confounders = confounder, suffix = "_batchadjusted" ) # Returns data set with new variable "biomarker_batchadjusted"
The goal of the batchtma is to provide functions for batch effect-adjusting biomarker data. It implements different methods that address batch effects while retaining differences between batches that may be due to “true” underlying differences in factors that drive biomarker values (confounders).
adjust_batch
: Adjust for batch effects
diagnose_models
: Model diagnostics after batch adjustment
plot_batch
: Plot biomarkers by batch
Stopsack KH, Tyekucheva S, Wang M, Gerke TA, Vaselkiv JB, Penney KL, Kantoff PW, Finn SP, Fiorentino M, Loda M, Lotan TL, Parmigiani G+, Mucci LA+ (+ equal contribution). Extent, impact, and mitigation of batch effects in tumor biomarker studies using tissue microarrays. eLife 2021;10:e71265. doi: https://doi.org/10.7554/elife.71265
https://stopsack.github.io/batchtma/
After adjust_batch
has performed
adjustment for batch effects, diagnose_models
provides an overview of parameters and adjustment models.
Information is only available about the most recent
run of adjust_batch
on a dataset.
diagnose_models(data)
diagnose_models(data)
data |
Batch-adjusted dataset (in which
|
List:
adjust_method
Method used for batch adjustment
(see adjust_batch
).
markers
Variables of biomarkers for adjustment
suffix
Suffix appended to variable names
batchvar
Variable indicating batch
confounders
Confounders, i.e. determinants of
biomarker levels that differ between batches.
Returned only if used by the model.
adjust_parameters
Tibble of parameters used to
obtain adjust biomarker levels. Parameters differ between
methods:
simple
, standardize
, and ipw
: Estimated adjustment
parameters are a tibble with one batchmean
per marker
and .batchvar
.
quantreg
returns a tibble with numerous values per
marker
and .batchvar
: unadjusted (un_...
) and
adjusted (ad_...
) estimates of the lower (..._lo
) and
upper quantile (..._hi
) and interquantile range (..._iq
),
plus the lower (all_lo
) and upper quantiles (all_hi
)
across all batches.
quantnorm
does not explicitly estimate parameters.
model_fits
List of model fit objects, one
per biomarker. Models differ between methods:
standardize
: Linear regression model for the biomarker with
.batchvar
and confounders
as predictors, from which
marginal predictions of batch means for each batch are obtained.
ipw
: Logistic (2 batches) or multinomial models for assignment
to a specific batch with .batchvar
as the response and
confounders
as the predictors, used to generate stabilized
inverse-probability weights that are then used in a linear regression
model to estimate marginally standardized batch means.
quantreg
: Quantile regression with the marker as the response
variable and .batchvar
and confounders
as predictors.
simple
and quantnorm
do not fit any regression models.
# Data frame with two batches # Batch 2 has higher values of biomarker and confounder df <- data.frame( tma = rep(1:2, times = 10), biomarker = rep(1:2, times = 10) + runif(max = 5, n = 20), confounder = rep(0:1, times = 10) + runif(max = 10, n = 20) ) # Adjust for batch effects df2 <- adjust_batch( data = df, markers = biomarker, batch = tma, method = quantreg, confounders = confounder ) # Show overview of model diagnostics: diagnose_models(data = df2) # Obtain first fitted regression model: fit <- diagnose_models(data = df2)$model_fits[[1]][[1]] # Obtain residuals for this model: residuals(fit)
# Data frame with two batches # Batch 2 has higher values of biomarker and confounder df <- data.frame( tma = rep(1:2, times = 10), biomarker = rep(1:2, times = 10) + runif(max = 5, n = 20), confounder = rep(0:1, times = 10) + runif(max = 10, n = 20) ) # Adjust for batch effects df2 <- adjust_batch( data = df, markers = biomarker, batch = tma, method = quantreg, confounders = confounder ) # Show overview of model diagnostics: diagnose_models(data = df2) # Obtain first fitted regression model: fit <- diagnose_models(data = df2)$model_fits[[1]][[1]] # Obtain residuals for this model: residuals(fit)
To provide a simple visualization of potential batch
effects, plot_batch
generates a Tukey
box plot overlaid by a jittered
dot plot, inspired by the Stata plugin stripplot
.
Boxes span from the 1st to the 3rd quartile; thick lines indicate medians; whiskers span up to 1.5 times the interquartile range; and asterisks indicate means.
plot_batch( data, marker, batch, color = NULL, maxlevels = 15, title = NULL, ... )
plot_batch( data, marker, batch, color = NULL, maxlevels = 15, title = NULL, ... )
data |
Dataset. |
marker |
Variable indicating the biomarker. |
batch |
Variable indicating the batch. |
color |
Optional: third variable to use for symbol
color and shape. For example, |
maxlevels |
Optional: Maximum number of
levels for |
title |
Optional: character string that specifies plot title |
... |
Optional: Passed on to |
ggplot2 object, which can be further modified using standard ggplot2 functions. See examples.
Cox NJ (2003). STRIPPLOT: Stata module for strip plots (one-way dot plots). Statistical Software Components S433401, Boston College Department of Economics, revised 11 Oct 2020.
Manimaran S, Selby HM, Okrah K, Ruberman C, Leek JT, Quackenbush J, Haibe-Kains B, Bravo HC, Johnson WE (2016). BatchQC: interactive software for evaluating sample and batch effects in genomic data. Bioinformatics. doi:10.1093/bioinformatics/btw538
More powerful visualizations of batch effects exist in the BatchQC package:
http://bioconductor.org/packages/release/bioc/html/BatchQC.html
# Define example data df <- data.frame( tma = rep(1:2, times = 10), biomarker = rep(1:2, times = 10) + runif(max = 5, n = 20), confounder = rep(0:1, times = 10) + runif(max = 10, n = 20) ) # Visualize batch effects: plot_batch( data = df, marker = biomarker, batch = tma, color = confounder ) # Label y-axis, changing graph like other ggplots: plot_batch( data = df, marker = biomarker, batch = tma, color = confounder ) + ggplot2::labs(y = "Biomarker (variable 'noisy')")
# Define example data df <- data.frame( tma = rep(1:2, times = 10), biomarker = rep(1:2, times = 10) + runif(max = 5, n = 20), confounder = rep(0:1, times = 10) + runif(max = 10, n = 20) ) # Visualize batch effects: plot_batch( data = df, marker = biomarker, batch = tma, color = confounder ) # Label y-axis, changing graph like other ggplots: plot_batch( data = df, marker = biomarker, batch = tma, color = confounder ) + ggplot2::labs(y = "Biomarker (variable 'noisy')")