High-Dimensional Mediation Analysis
A Guide to Using the HIMA Package
The HIMA Development Team
2025-01-27
Source:vignettes/hima-vignette.Rmd
hima-vignette.Rmd
Introduction
Mediation analysis is a statistical method used to explore the mechanisms by which an independent variable influences a dependent variable through one or more intermediary variables, known as mediators. This process involves assessing the indirect, direct, and total effects within a defined statistical framework. The primary goal of mediation analysis is to test the mediation hypothesis, which posits that the effect of an independent variable on a dependent variable is partially or fully mediated by intermediate variables. By examining these pathways, researchers can gain a deeper understanding of the causal relationships among variables. Mediation analysis is particularly valuable in refining interventions to enhance their effectiveness in clinical trials and in observational studies, where it helps identify key intervention targets and elucidate the mechanisms underlying complex diseases.
Package Overview
The HIMA
package provides robust tools for estimating
and testing high-dimensional mediation effects, specifically designed
for modern omic
data, including epigenetics,
transcriptomics, and microbiomics. It supports high-dimensional
mediation analysis across continuous, binary, and survival outcomes,
with specialized methods tailored for compositional microbiome data and
quantile mediation analysis.
At the core of the package is the hima
function, a
flexible and powerful wrapper that integrates various high-dimensional
mediation analysis methods for both effect estimation and hypothesis
testing. The hima
function automatically selects the most
suitable analytical approach based on the outcome and mediator data
type, streamlining complex workflows and reducing user burden.
hima
is designed with extensibility in mind, allowing
seamless integration of future features and methods. This ensures a
consistent, user-friendly interface while staying adaptable to evolving
analytical needs.
Data Preparation and Settings
The hima
function provides a flexible and user-friendly
interface for performing high-dimensional mediation analysis. It
supports a variety of analysis methods tailored for continuous, binary,
survival, and compositional data. Below is an overview of the
hima
function and its key parameters:
hima
Function Interface
hima(
formula, # The model formula specifying outcome, exposure, and covariate(s)
data.pheno, # Data frame with outcome, exposure, and covariate(s)
data.M, # Data frame or matrix of high-dimensional mediators
mediator.type, # Type of mediators: "gaussian", "negbin", or "compositional"
penalty = "DBlasso", # Penalty method: "DBlasso", "MCP", "SCAD", or "lasso"
quantile = FALSE, # Use quantile mediation analysis (default: FALSE)
efficient = FALSE,# Use efficient mediation analysis (default: FALSE)
scale = TRUE, # Scale data (default: TRUE)
sigcut = 0.05, # Significance cutoff for mediator selection
contrast = NULL, # Named list of contrasts for factor covariate(s)
subset = NULL, # Optional subset of observations
verbose = FALSE # Display progress messages (default: FALSE)
)
To use the hima
function, ensure your data is prepared
according to the following guidelines:
1. Formula Argument (formula
)
Define the model formula to specify the relationship between the
Outcome
, Exposure
, and
Covariate(s)
. Ensure the following:
General Form: Use the format
Outcome ~ Exposure + Covariate(s)
. Note that theExposure
variable represents the exposure of interest (e.g., “Treatment” in the demo examples) and it has to be listed as the first independent variable in the formula.Covariate(s)
are optional.Survival Data: For survival analysis, use the format
Surv(time, event) ~ Exposure + Covariate(s)
. See data examplesSurvivalData$PhenoData
for more details.
2. Phenotype Data (data.pheno
)
The data.pheno
object should be a
data.frame
or matrix
containing the phenotype
information for the analysis (without missing values). Key requirements
include:
Rows: Represent samples.
Columns: Include variables such as the outcome, treatment, and optional covariate(s).
Formula Consistency: Ensure that all variables specified in the
formula
argument (e.g.,Outcome
,Treatment
, andCovariate(s)
) are present indata.pheno
.
3. Mediator Data (data.M
)
The data.M
object should be a data.frame
or
matrix
containing high-dimensional mediators (without
missing values). Key requirements include:
Rows: Represent samples, aligned with the rows in
data.pheno
.Columns: Represent mediators (e.g., CpGs, genes, or other molecular features).
-
Mediator Type: Specify the type of mediators in the
mediator.type
argument. Supported types include:-
"gaussian"
for continuous mediators (default, e.g., DNA methylation data). -
"negbin"
for count data (e.g., transcriptomic data). -
"compositional"
for microbiome or other compositional data.
-
4. About data scaling
In most real-world data analysis scenarios, scale
is
typically set to TRUE
, ensuring that the exposure (variable
of interest), mediators, and covariate(s) (if included) are standardized
to a mean of zero and a variance of one. No scaling will be applied to
Outcome
. However, if your data is already
pre-standardized—such as in simulation studies or when using our demo
dataset-scale
should be set to FALSE
to
prevent introducing biases or altering the original data structure.
When applying HIMA
to simulated data, if
scale
is set to TRUE
, it is imperative to
preprocess the mediators by scaling them to have a mean of zero and a
variance of one prior to generating the outcome variables.
Applications and Examples
Continuous Outcome Analysis
When analyzing continuous and normally distributed outcomes and mediators, we can use the following code snippet:
data(ContinuousOutcome)
hima_continuous.fit <- hima(
Outcome ~ Treatment + Sex + Age,
data.pheno = ContinuousOutcome$PhenoData,
data.M = ContinuousOutcome$Mediator,
mediator.type = "gaussian",
penalty = "DBlasso",
scale = FALSE # Demo data is already standardized
)
summary(hima_continuous.fit, desc=TRUE)
# `desc = TRUE` option to show the description of the output results
penalty = "DBlasso"
is particularly effective at
identifying mediators with weaker signals compared to
penalty = "MCP"
. However, using DBlasso
requires more computational time.
Efficient HIMA
For continuous and normally distributed mediators and outcomes, an
efficient HIMA method can be activated with the
efficient = TRUE
option (penalty should be MCP
for the best results). This method may also provide greater statistical
power to detect mediators with weaker signals.
hima_efficient.fit <- hima(
Outcome ~ Treatment + Sex + Age,
data.pheno = ContinuousOutcome$PhenoData,
data.M = ContinuousOutcome$Mediator,
mediator.type = "gaussian",
efficient = TRUE,
penalty = "MCP",
scale = FALSE # Demo data is already standardized
)
summary(hima_efficient.fit, desc=TRUE)
# Note that the efficient HIMA is controlling FDR
It is recommended to try different penalty
options and
efficient
option to find the best one for your data.
Survival Outcome Analysis
For survival data, HIMA
incorporates a Cox proportional
hazards approach. Here is an example of survival outcome analysis using
HIMA
:
Microbiome Mediation Analysis
For compositional microbiome data, HIMA
employs
isometric Log-Ratio transformations. Here is an example of microbiome
mediation analysis using HIMA
:
Quantile Mediation Analysis
Perform quantile mediation analysis using the
quantile = TRUE
option and specify tau
for
desired quantile(s):
data(QuantileData)
hima_quantile.fit <- hima(
Outcome ~ Treatment + Sex + Age,
data.pheno = QuantileData$PhenoData,
data.M = QuantileData$Mediator,
mediator.type = "gaussian",
quantile = TRUE,
penalty = "MCP",
tau = c(0.3, 0.5, 0.7),
scale = FALSE # Demo data is already standardized
)
summary(hima_quantile.fit)