| Title: | High-Dimensional Mediation Analysis via Transfer Learning |
|---|---|
| Description: | Provides a framework for high-dimensional mediation analysis using transfer learning. The main function TransHDM() integrates large-scale source data to improve the detection power of potential mediators in small-sample target studies. It addresses data heterogeneity via transfer regularization and debiased estimation while controlling the false discovery rate. The package also includes utilities for data generation (gen_simData_homo(), gen_simData_hetero()), baseline methods such as lasso() and dblasso(), sure independence screening via SIS(), and model diagnostics through source_detection(). The methodology is described in Pan et al. (2025) <doi:10.1093/bib/bbaf460>. |
| Authors: | Huer Gao [aut, cre, cph], Lulu Pan [aut, cph], Yongfu Yu [ctb, cph], Guoyou Qin [ctb, cph] |
| Maintainer: | Huer Gao <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.0.1 |
| Built: | 2026-05-23 08:47:40 UTC |
| Source: | https://github.com/gaohuer/transhdm |
Fits a debiased LASSO regression model under transfer learning framework, supporting feature selection and coefficient estimation by combining target and source data.
dblasso( target, source = NULL, transfer = FALSE, level = 0.95, lambda = "lambda.1se" )dblasso( target, source = NULL, transfer = FALSE, level = 0.95, lambda = "lambda.1se" )
target |
A list containing two elements:
Required. |
source |
A list (optional, default: NULL) containing two elements:
Used when transfer = TRUE. |
transfer |
A logical value (default: FALSE) indicating whether to enable transfer learning (combining source data with target data for estimation). |
level |
A numeric value (default: 0.95) specifying confidence level for confidence intervals. |
lambda |
A string specifying criterion for selecting regularization parameter:
|
A list containing:
dbcoef.hat: Debiased LASSO coefficient vector (including intercept)
coef.hat: Original LASSO coefficient vector
CI: Data frame with confidence intervals (lb = lower bound, ub = upper bound)
var.est: Variance estimates for debiased coefficients
se.est: Standard errors for debiased coefficients
P.value: Vector of p-values for coefficients
# Prepare target and source data target <- list(x = matrix(rnorm(100 * 20), 100, 20), y = rnorm(100)) source <- list(x = matrix(rnorm(200 * 20), 200, 20), y = rnorm(200)) # Non-transfer mode result_no_transfer <- dblasso(target = target, transfer = FALSE, level = 0.95, lambda = 'lambda.min') summary(result_no_transfer) # Transfer learning mode result_transfer <- dblasso(target = target, source = source, transfer = TRUE, level = 0.95, lambda = 'lambda.min') summary(result_transfer)# Prepare target and source data target <- list(x = matrix(rnorm(100 * 20), 100, 20), y = rnorm(100)) source <- list(x = matrix(rnorm(200 * 20), 200, 20), y = rnorm(200)) # Non-transfer mode result_no_transfer <- dblasso(target = target, transfer = FALSE, level = 0.95, lambda = 'lambda.min') summary(result_no_transfer) # Transfer learning mode result_transfer <- dblasso(target = target, source = source, transfer = TRUE, level = 0.95, lambda = 'lambda.min') summary(result_transfer)
Generates synthetic datasets mimicking high-dimensional mediation structures, optionally incorporating transferable source data under varying covariate correlation and heterogeneity levels. This function supports heterogeneous settings for data generation.
gen_simData_hetero( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = FALSE, transferable = TRUE, seed = NULL )gen_simData_hetero( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = FALSE, transferable = TRUE, seed = NULL )
n |
Integer. Number of observations (sample size). Default is 100. |
p_x |
Integer. Number of covariates (confounders). Default is 5. |
rho |
Numeric. Correlation coefficient (0–1) controlling correlation between mediators. Default is 0 (no correlation). |
p_m |
Integer. Number of mediators. Default is 100. |
h |
Integer. Degree of heterogeneity (for source data). Default is 0. |
source |
Logical. If TRUE, generate source (external) dataset. Default is FALSE. |
transferable |
Logical. If TRUE, generates a transferable source dataset sharing mediator-outcome structure with the target. Default is TRUE. |
seed |
Integer. Optional random seed for reproducibility. Default is NULL. |
This function generates data according to a structural equation model (SEM):
Treatment D depends linearly on covariates X.
Mediators M depend on D and X, with residual correlation controlled by rho.
Outcome Y depends on D, M, and X.
If source = TRUE, then a source dataset is simulated with potentially transferable mechanisms.
A list with the following components:
data: A data.frame of dimension n × (2 + p_m + p_x) containing outcome (Y), treatment (D), mediators (M1–Mp_m), and covariates (X1–Xp_x).
coef: A named list of true model coefficients (alpha1, alpha2, beta1, beta2, beta4, etc.).
source_data <- gen_simData_hetero( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = TRUE, transferable = TRUE, seed = 123 ) source_data <- gen_simData_hetero( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = FALSE, transferable = TRUE, seed = 123 )source_data <- gen_simData_hetero( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = TRUE, transferable = TRUE, seed = 123 ) source_data <- gen_simData_hetero( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = FALSE, transferable = TRUE, seed = 123 )
Generates synthetic datasets mimicking high-dimensional mediation structures, optionally incorporating transferable source data under varying covariate correlation and heterogeneity levels. This function supports homogeneous settings for data generation.
gen_simData_homo( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = FALSE, transferable = TRUE, seed = NULL )gen_simData_homo( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = FALSE, transferable = TRUE, seed = NULL )
n |
Integer. Number of observations (sample size). Default is 100. |
p_x |
Integer. Number of covariates (confounders). Default is 5. |
rho |
Numeric. Correlation coefficient (0–1) controlling correlation between mediators. Default is 0 (no correlation). |
p_m |
Integer. Number of mediators. Default is 100. |
h |
Integer. Degree of heterogeneity (for source data). Default is 0. |
source |
Logical. If TRUE, generate source (external) dataset. Default is FALSE. |
transferable |
Logical. If TRUE, generates a transferable source dataset sharing mediator-outcome structure with the target. Default is TRUE. |
seed |
Integer. Optional random seed for reproducibility. Default is NULL. |
This function generates data according to a structural equation model (SEM):
Treatment D depends linearly on covariates X.
Mediators M depend on D and X, with residual correlation controlled by rho.
Outcome Y depends on D, M, and X.
If source = TRUE, then a source dataset is simulated with potentially transferable mechanisms.
A list with the following components:
data: A data.frame of dimension n * (2 + p_m + p_x) containing outcome (Y), treatment (D), mediators (M1–Mp_m), and covariates (X1–Xp_x).
coef: A named list of true model coefficients (alpha1, alpha2, beta1, beta2, beta3, beta4, etc.).
source_data <- gen_simData_homo( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = TRUE, transferable = TRUE, seed = 123 ) target_data <- gen_simData_homo( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = FALSE, transferable = TRUE, seed = 123 )source_data <- gen_simData_homo( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = TRUE, transferable = TRUE, seed = 123 ) target_data <- gen_simData_homo( n = 100, p_x = 5, rho = 0, p_m = 100, h = 0, source = FALSE, transferable = TRUE, seed = 123 )
Fits a LASSO (Least Absolute Shrinkage and Selection Operator) regression model under a transfer learning framework. Supports feature selection and coefficient estimation by combining target data and source data.
lasso(target, source = NULL, transfer = FALSE, lambda = "lambda.1se")lasso(target, source = NULL, transfer = FALSE, lambda = "lambda.1se")
target |
A list containing two elements:
Required. |
source |
A list (optional, default: NULL) containing two elements:
Used when transfer = TRUE. |
transfer |
A logical value (default: FALSE) indicating whether to enable transfer learning mode (combine source data with target data). |
lambda |
A string (default: 'lambda.1se') specifying the criterion for selecting regularization parameter:
|
A numeric vector coef containing LASSO coefficient estimates (including intercept).
# Prepare target and source data target <- list(x = matrix(rnorm(100 * 20), 100, 20), y = rnorm(100)) source <- list(x = matrix(rnorm(200 * 20), 200, 20), y = rnorm(200)) # Non-transfer mode coef_no_transfer <- lasso(target = target, transfer = FALSE, lambda = 'lambda.min') print(coef_no_transfer) summary(coef_no_transfer) # Transfer learning mode coef_transfer <- lasso(target = target, source = source, transfer = TRUE, lambda = 'lambda.min') print(coef_transfer) summary(coef_transfer)# Prepare target and source data target <- list(x = matrix(rnorm(100 * 20), 100, 20), y = rnorm(100)) source <- list(x = matrix(rnorm(200 * 20), 200, 20), y = rnorm(200)) # Non-transfer mode coef_no_transfer <- lasso(target = target, transfer = FALSE, lambda = 'lambda.min') print(coef_no_transfer) summary(coef_no_transfer) # Transfer learning mode coef_transfer <- lasso(target = target, source = source, transfer = TRUE, lambda = 'lambda.min') print(coef_transfer) summary(coef_transfer)
This function performs dimension reduction for high-dimensional mediation analysis using Sure Independence Screening (SIS). Mediators are ranked based on the product of their marginal associations with the exposure and the outcome, and the top-ranked mediators are retained for downstream analysis.
SIS( target_data, source_data = NULL, Y, D, M, X, topN = NULL, transfer = FALSE, verbose = TRUE, ncore = 1, dblasso_method = FALSE )SIS( target_data, source_data = NULL, Y, D, M, X, topN = NULL, transfer = FALSE, verbose = TRUE, ncore = 1, dblasso_method = FALSE )
target_data |
A data frame containing the target dataset. All variables must be numeric. |
source_data |
A list of data frames containing source datasets (optional, default: NULL). All variables must be numeric and have the same column names as target_data. |
Y |
Character string specifying the outcome variable name. |
D |
Character string specifying the exposure (treatment) variable name. |
M |
Character vector specifying mediator variable names. |
X |
Character vector specifying covariate variable names. |
topN |
An integer specifying the number of mediators to retain after screening.
If |
transfer |
A logical value (default: FALSE) indicating whether to apply transfer learning by incorporating the source dataset in the screening procedure. |
verbose |
A logical value (default: TRUE) controlling whether progress messages are printed to the console. |
ncore |
An integer (default: 1) specifying the number of CPU cores for parallel computation. |
dblasso_method |
A logical value (default: FALSE). If TRUE, the debiased lasso (dblasso) is used to estimate marginal effects. If FALSE, standard linear or generalized linear models are used. |
The function supports transfer learning, allowing information from a source dataset to be leveraged to improve screening stability and robustness in the target dataset.
A list with the following components:
target_SIS: A data frame containing the outcome, exposure, selected
mediators, and covariates from the target dataset.
source_SIS: A data frame containing the same variables from the source
dataset if transfer = TRUE; otherwise NULL.
M_ID_name_SIS: A character vector of selected mediator names.
set.seed(123) # Target data M_target <- matrix(rnorm(200 * 50), nrow = 200) colnames(M_target) <- paste0("M", 1:50) target_data <- data.frame( Y = rnorm(200), D = rnorm(200), M_target, X1 = rnorm(200) ) # Source data M_source <- matrix(rnorm(300 * 50), nrow = 300) colnames(M_source) <- paste0("M", 1:50) source_data <- data.frame( Y = rnorm(300), D = rnorm(300), M_source, X1 = rnorm(300) ) # Run SIS result <- SIS( target_data = target_data, source_data = source_data, Y = "Y", D = "D", M = paste0("M", 1:50), X = "X1", transfer = TRUE, topN = 10 ) result$M_ID_name_SISset.seed(123) # Target data M_target <- matrix(rnorm(200 * 50), nrow = 200) colnames(M_target) <- paste0("M", 1:50) target_data <- data.frame( Y = rnorm(200), D = rnorm(200), M_target, X1 = rnorm(200) ) # Source data M_source <- matrix(rnorm(300 * 50), nrow = 300) colnames(M_source) <- paste0("M", 1:50) source_data <- data.frame( Y = rnorm(300), D = rnorm(300), M_source, X1 = rnorm(300) ) # Run SIS result <- SIS( target_data = target_data, source_data = source_data, Y = "Y", D = "D", M = paste0("M", 1:50), X = "X1", transfer = TRUE, topN = 10 ) result$M_ID_name_SIS
Determines whether external source datasets can be effectively transferred to the target data by comparing predictive performance using LASSO regression under a transfer learning framework.
source_detection( target_data, source_data = NULL, Y, D, M, X, kfold = 5, C0 = 0.05, verbose = TRUE )source_detection( target_data, source_data = NULL, Y, D, M, X, kfold = 5, C0 = 0.05, verbose = TRUE )
target_data |
A data frame containing the target dataset. All variables must be numeric. |
source_data |
A list of data frames containing source datasets (optional, default: NULL). All variables must be numeric and have the same column names as target_data. |
Y |
Character string specifying the outcome variable name. |
D |
Character string specifying the exposure (treatment) variable name. |
M |
Character vector specifying mediator variable names. |
X |
Character vector specifying covariate variable names. |
kfold |
Integer (default: 5). Number of folds for cross-validation. |
C0 |
Numeric (default: 0.05). Threshold constant for determining transferability. Larger values make the criterion more lenient. |
verbose |
Logical (default: TRUE). Whether to print progress messages. |
A list containing:
transfer.source.id: Indices of source datasets deemed transferable
source.loss: Mean validation loss for each source dataset
target.valid.loss: Mean validation loss using target-only model
T_index: Difference between source loss and target-only loss for each source
threshold: Calculated transferability threshold
loss.cv: Full k-fold cross-validation loss matrix
## Reproducible example set.seed(123) # Generate synthetic target data target_data <- data.frame( Y = rnorm(200), D = rnorm(200), M1 = rnorm(200), M2 = rnorm(200), X1 = rnorm(200) ) # Generate synthetic source data source1 <- data.frame( Y = rnorm(300), D = rnorm(300), M1 = rnorm(300), M2 = rnorm(300), X1 = rnorm(300) ) source2 <- data.frame( Y = rnorm(250), D = rnorm(250), M1 = rnorm(250), M2 = rnorm(250), X1 = rnorm(250) ) # Run source detection result <- source_detection( target_data = target_data, source_data = list(source1, source2), Y = "Y", D = "D", M = c("M1", "M2"), X = "X1", kfold = 5, C0 = 0.05, verbose = FALSE ) # Get Summary summary(result) # Transferable source indices result$transfer.source.id # Compare validation losses data.frame( Source = c(paste0("Source", seq_along(result$source.loss)), "Target"), Loss = c(result$source.loss, result$target.valid.loss) )## Reproducible example set.seed(123) # Generate synthetic target data target_data <- data.frame( Y = rnorm(200), D = rnorm(200), M1 = rnorm(200), M2 = rnorm(200), X1 = rnorm(200) ) # Generate synthetic source data source1 <- data.frame( Y = rnorm(300), D = rnorm(300), M1 = rnorm(300), M2 = rnorm(300), X1 = rnorm(300) ) source2 <- data.frame( Y = rnorm(250), D = rnorm(250), M1 = rnorm(250), M2 = rnorm(250), X1 = rnorm(250) ) # Run source detection result <- source_detection( target_data = target_data, source_data = list(source1, source2), Y = "Y", D = "D", M = c("M1", "M2"), X = "X1", kfold = 5, C0 = 0.05, verbose = FALSE ) # Get Summary summary(result) # Transferable source indices result$transfer.source.id # Compare validation losses data.frame( Source = c(paste0("Source", seq_along(result$source.loss)), "Target"), Loss = c(result$source.loss, result$target.valid.loss) )
Summary of Debiased Lasso Inference
## S3 method for class 'dblasso' summary(object, ...)## S3 method for class 'dblasso' summary(object, ...)
object |
An object of class |
... |
Further arguments (currently not used). |
An object of class "summary.dblasso".
Summary of Lasso Regression
## S3 method for class 'lasso' summary(object, ...)## S3 method for class 'lasso' summary(object, ...)
object |
A numeric vector of lasso coefficients with names. |
... |
Further arguments (currently not used). |
An object of class "summary.lasso".
Summary of Source Detection Results
## S3 method for class 'source_detection' summary(object, ...)## S3 method for class 'source_detection' summary(object, ...)
object |
An object of class |
... |
Further arguments (unused). |
An object of class "summary.source_detection".
Summary of TransHDM Mediation Analysis
## S3 method for class 'TransHDM' summary(object, top = 10, digits = 4, ...)## S3 method for class 'TransHDM' summary(object, top = 10, digits = 4, ...)
object |
An object of class |
top |
Integer, maximum number of mediators to display. |
digits |
Number of digits for rounding estimates. |
... |
Further arguments (unused). |
An object of class "summary.TransHDM".
The TransHDM function performs high-dimensional mediation analysis
under a transfer learning framework. It identifies and estimates indirect
(mediation) effects of a high-dimensional set of mediators between an
exposure and an outcome by integrating a target dataset and a source datasets.
TransHDM( target_data, source_data = NULL, Y, D, M, X, transfer = FALSE, verbose = TRUE, ncore = 1, topN = NULL, dblasso_SIS = FALSE )TransHDM( target_data, source_data = NULL, Y, D, M, X, transfer = FALSE, verbose = TRUE, ncore = 1, topN = NULL, dblasso_SIS = FALSE )
target_data |
A data frame containing the target dataset. All variables must be numeric. |
source_data |
A list of data frames containing source datasets (optional, default: NULL). All variables must be numeric and have the same column names as target_data. |
Y |
Character string specifying the outcome variable name. |
D |
Character string specifying the exposure (treatment) variable name. |
M |
Character vector specifying mediator variable names. |
X |
Character vector specifying covariate variable names.. |
transfer |
A logical value (default: |
verbose |
A logical value (default: |
ncore |
An integer (default: 1) specifying the number of CPU cores to use for parallel computation when fitting mediator models. |
topN |
An integer (default: |
dblasso_SIS |
A logical value (default: |
A list with the following components:
contributions: A data frame of identified mediators containing:
mediator: Mediator name
alpha: Estimated exposure mediator effect
alpha_pv: P-value for the exposure mediator effect
beta: Estimated mediator outcome effect
beta_pv: P-value for the mediator outcome effect
alpha_beta: Estimated indirect (mediation) effect
ab_pv: Joint p-value for the indirect effect
pa: Proportion of the total effect mediated
effects: A data frame summarizing the total indirect effect,
direct effect, total effect, and proportion mediated.
IDE_est: A numeric vector of indirect effect estimates for all
specified mediators, with non-selected mediators set to zero.
DE_est: The estimated direct effect of the exposure on the
outcome.
Pan L, Liu Y, Huang C, Lin R, Yu Y, Qin G. Transfer learning reveals the mediating mechanisms of cross-ethnic lipid metabolic pathways in the association between APOE gene and Alzheimer's disease. Brief Bioinform. 2025;26(5):bbaf460. doi:10.1093/bib/bbaf460
set.seed(123) # Target data target_data <- gen_simData_homo(n = 50, p_x = 3, p_m = 20, rho = 0.1)$data # Source data source_data <- gen_simData_homo(n = 100, p_x = 3, p_m = 20, rho = 0.1, source = TRUE, transferable = TRUE)$data # Run TransHDM result <- TransHDM( target_data = target_data, source_data = source_data, Y = "Y", D = "D", M = paste0("M", 1:20), X = paste0("X", 1:3), transfer = TRUE, ncore = 1, topN = 10 ) summary(result)set.seed(123) # Target data target_data <- gen_simData_homo(n = 50, p_x = 3, p_m = 20, rho = 0.1)$data # Source data source_data <- gen_simData_homo(n = 100, p_x = 3, p_m = 20, rho = 0.1, source = TRUE, transferable = TRUE)$data # Run TransHDM result <- TransHDM( target_data = target_data, source_data = source_data, Y = "Y", D = "D", M = paste0("M", 1:20), X = paste0("X", 1:3), transfer = TRUE, ncore = 1, topN = 10 ) summary(result)