A Nextflow- and R-based pipeline to organize the multiple steps of QTL discovery analysis. Built under Nextflow DSL2.
We developed the yQTL Pipeline – with ‘y’ indicating the dependent quantitative variable being modeled – to facilitate and automate large-scale QTL analysis. Prior to genome-wide association test, the pipeline supports the calculation or the direct input of pre-defined genome-wide principal components and genetic relationship matrix when applicable. User-specified covariates may also be supplied. Depending on the presence or absence of familial relatedness among the subjects, genome-wide association tests will be conducted using either a linear mixed-effect model or a linear model, respectively. Alternatively, the user has the option to treat the genotype as a categorical variable and conduct an ANOVA test, or assess the significance of the interaction between the SNP and a covariate. Through the adoption of the workflow management tool Nextflow, the pipeline parallelizes the analysis steps to optimize run-time and ensure reproducibility of the results. A user-friendly R Shiny App is also provided for the visualization of the results, including Manhattan plots of user-selected phenotype traits, and trait-QTL connection networks based on user-specified p-value thresholds. [Preprint]
- 1.Setup
- 2.Run the yQTL-Pipeline
- 3.Input Files and Parameters
- 4.Input File Examples
- 5.Output Descriptions
- 6.Workflow Process Details
- 7.Shiny App
Install the required R packages:
install.packages("dplyr")
install.packages("tidyverse")
install.packages("data.table")
install.packages("qqman")
install.packages("MatrixEQTL")
BiocManager::install("Biobase")
BiocManager::install("SNPRelate")
BiocManager::install("SeqArray")
BiocManager::install("SeqVarTools")
BiocManager::install("GENESIS")
The pipeline is divided into three nf scripts: Prepare.nf
, Analysis.nf
and Report.nf
.
- When genotype data is in GDS format and genetic PCs (and GRM when individual relatedness is present) are available, only
Analysis.nf
andReport.nf
are required. - When genotype data is in VCF format, and/or genetic PCs and GRM are needed,
Prepare.nf
must be run beforeAnalysis.nf
andReport.nf
.
See below in 6.Workflow Process Details for detailed description of the processes included in each of the Nextflow scripts.
Nextflow requires a POSIX-compatible system (Linux, macOS, etc.) and Java 8 (Java 11 or later, up to 18) to be installed. Use the following command to download the executable. Once downloaded, optionally, you may make the Nextflow file accessible by your $PATH variable so you do not need to specify the full path to Nextflow each time.
$ cd /path/to/download/nextflow/
$ curl -s https://get.nextflow.io | bash
Sometimes, Nextflow may already be installed as a module on your machine. Use module avail nextflow
to check. If so, you do not need to download the Nextflow executable.
Create a Config.config
and specify input file paths and parameters. See the following 3.Input Files and Parameters for details.
When running on local:
$ cd /path/to/work/directory
$ module load R/4.1.2
$ $NF/nextflow -c Config.config run $PIPLINE/Prepare.nf
$ $NF/nextflow -c Config.config run $PIPLINE/Analysis.nf
$ $NF/nextflow -c Config.config run $PIPLINE/Report.nf
$NF is the directory where the Nextflow executable is located, and $PIPLINE is the directory where the yQTL Pipeline is installed.
When running on a shared computer cluster:
- Set up the bash job parameters in
configs/sge.config
. - Modify the 1st line of
Config.config
to use sge.config:includeConfig 'configs/sge.config'
. - Submit the code above as a bash job.
When Nextflow is already installed as a module, an executable is not required. For example:
$ cd /path/to/work/directory
$ module load R/4.1.2
$ module load nextflow
$ nextflow -c Config.config run $PIPLINE/Prepare.nf
$ nextflow -c Config.config run $PIPLINE/Analysis.nf
$ nextflow -c Config.config run $PIPLINE/Report.nf
Starting from a VCF file input, calculating PCs, and using MatrixeQTL to obtain QTL results.
- Navigate to the directory < example >.
cd /path/to/the/yQTLpipeline/example
- Launch R and run
Rscript GenerateExample_M.R.
This will write out the filesExampleConfig_M.config
andExample_M.sh
, as well as update the < data > directory to set up the input file paths. - Run
bash Example_M.sh
. - Example results will be generated in the folder < ExampleResults_M >.
Starting from a GDS file, pre-defined PCs, and GRM, using GENESIS to obtain QTL results.
- Navigate to the directory < example >.
cd /path/to/the/yQTLpipeline/example
- Launch R and run
Rscript GenerateExample_G.R
.
This will write out the filesExampleConfig_G.config
andExample_G.sh
, as well as update the < data > directory to set up the input file paths. - Run
bash Example_G.sh
. - Example results will be generated in the folder < ExampleResults_G >.
params.outdir
Directory for result storage.params.pipeline_engine
Use "genesis" or "g" to run GENESIS when individual relatedness exists and a genetic relationship matrix (GRM) is required. Use "matrixeqtl" or "m" to run MatrixeQTL when samples are independent. Processes will start with "G_" or "M_" to indicate the selected engine.params.model_type
Mandatory for MatrixeQTL. Specify as "linear" for additive linear model, "category" for genotype as categories with ANOVA, or "interaction" for SNP-covariate interaction specified inparams.interaction_cvrt
.params.genodat_format
Choose "vcf" or "gds".params.vcf_list
andparams.gds_list
Path to CSV file pointing to VCF or GDS file locations, with two columns (no header). First column: user-defined names for genotype files; Second column: full file paths. For VCF files, also setparams.gds_list = "${params.outdir}/1_data/gds_list.csv"
.params.genotype_or_dosage
Use "genotype" or "GT" for VCF/GDS "GT" entry, or "dosage" or "DS" for "DS" entry.params.phenotype_file
Path to phenotype data file. It should be a TXT, CSV, or RDS data frame with samples as rows and phenotypes as columns. It must include a "sample.id" column to match with genotype data. Optionally, it can include covariates.params.pheno_name
TXT file with one phenotype name per row for analysis, or "all" to analyze all non-"sample.id" columns in the phenotype file.
Please ensure none of these optional parameters are left empty. Use "NA" when they're not applicable.
params.covariates_file
Covariates file, following the same format as the phenotype file, or "${params.phenotype_file}" if covariates are included there.params.covariates
String specifying covariates to include, separated by commas without spaces. PCs must be specified, whether provided or calculated. Example: "age,gender,facility,PC1,PC2".params.covariates_factor
String specifying categorical covariates. Example: "gender,facility".params.PC_rds
Data frame in RDS format with samples as rows and PCs as columns. Requires a "sample.id" column matching sample IDs in genotype data. Alternatively, setparams.PC_rds = "${params.outdir}/1_data/PCs.rds"
if estimated by the pipeline.params.GRM_rds
Genetic relationship matrix (GRM) in RDS format with sample IDs in genotype data as row and column names. Setparams.GRM_rds = "${params.outdir}/1_data/GRM.rds"
if estimated by the pipeline.params.snpset_assoc_txtfile
TXT file with one SNP per row for analysis, matching "annotation/id" in genotype file. Alternatively, input "NA" to include all available SNPs.params.userdef_sampleid_txtfile
TXT file with one sample ID per row for analysis. Alternatively, input "NA" to include all available samples.
params.max_forks_parallel
Numeric value controlling the max number of parallel processes in QTL discovery.params.max_pheno_parallel
Numeric value controlling the maximum number of phenotypes per QTL discovery process in MatrixeQTL. Increasing it reduces computation time but increases memory usage in the M_QTL_analysis process.params.pval_cutoff
Numeric value setting the threshold for saving and reporting QTL results.params.output_result_csv
Boolean, "true" or "false" (lowercase with quotes). QTL results are saved in RDS format by default. Set to "true" to also output results in CSV format.
PCA Parameters
params.start_PC
Boolean, true or false (lowercase, without quotes). Set to true to estimate genetic PCs and GRM using the pipeline.
Whenparams.start_PC = true
, set up the following:params.snpset_PCA_txtfile
Optionally specify a TXT file containing SNP IDs for genetic PC estimation. SNP IDs match the genotype file's annotation entry. Alternatively, input "NA" to prune SNPs using LD.params.pcair_kinthresh
Numeric value setting threshold for pairwise kinship coefficients in kinobj, declaring relatedness. Commonly used value is 11, corresponding to 4th degree relatives. See PC-Air documentation for details. See PC-Air documentation for more details.params.pcair_divthresh
Numeric value setting threshold for pairwise ancestry divergence measures in divobj, declaring ancestral divergence. Commonly used value is 11, corresponding to -2^(-11/2) ~ -0.022. See PC-Air documentation for more details.
Plotting Parameters
These parameters solely control plot generation and do not filter QTL results in RDS or CSV.
params.draw_genopheno_boxplot
Set to "true" to generate genotype-phenotype boxplots for the most significant variant in each genotype file, if it passes the specified p-value threshold.params.boxplot_p_cutoff
P-value threshold for determining significance of the top SNP.params.plot_mac
Numeric value determining the minimum minor allele count (MAC) for inclusion of a SNP in QQ plot, Manhattan plot, and Miami plot, to avoid artefacts from too few observations. Common values for rare SNPs include 3 or 5, while 20 for common SNPs.params.plot_resolution
Numeric value setting the resolution (dpi) for saving the plots. Common values include 72, 100, and 120.params.plot_size
Numeric value controlling the size of the plots in pixels. Common values include 400 and 600.
PreQC We offer a supporting script, PreQC.nf, for quality filtering of input VCF files.
params.hwe_p
Hardy-Weinberg Equilibrium p-value. Variants with a p-value below this threshold will be considered to violate HWE and will be removed. Input 1 to avoid filtering out any variants.params.min_mac
Numeric value. Only variants with a minor allele count equal to or higher than this threshold will be retained.params.max_missing_allowed
Numeric value ranging from 0 to 1. It sets the maximum acceptable missing rate. For instance, inputting 0.1 will remove variants with a missingness rate exceeding 10%.
Note: All TXT and CSV files must end with a complete final line. This means you should see an empty line at the end of the file when opened in a text editor.
CSV file to specify genotype file paths (without header):
chr_1 | /yQTL-Pipeline/data/gds/chr_1.gds |
chr_2 | /yQTL-Pipeline/data/gds/chr_2.gds |
chr_3 | /yQTL-Pipeline/data/gds/chr_3.gds |
Phenotype data frame with covariates:
sample.id | age | sex | gene1 | gene2 |
---|---|---|---|---|
HG00110 | 40 | F | 7.001 | 6.540 |
HG00116 | 50 | M | 7.025 | 6.440 |
HG00120 | 65 | F | 8.035 | 6.420 |
HG00128 | 35 | F | 7.015 | 6.154 |
GRM:
HG00110 | HG00116 | HG00120 | HG00128 | |
---|---|---|---|---|
HG00110 | 1.0332116 | -0.0179534 | 0.0070812 | -0.0114037 |
HG00116 | -0.0179534 | 0.9901158 | 0.1161200 | -0.0369330 |
HG00120 | 0.0070812 | 0.1161200 | 0.9772376 | -0.0595185 |
HG00128 | -0.0114037 | -0.0369330 | -0.0595185 | 0.9500809 |
The results will be saved in the directory specified in params.outdir
, with the following subdirectories.
- < 1_data >. Includes covariates and sample IDs used. If the input genotype data was in VCF format, the converted GDS files will also be saved here.
- < 1_phenotype_data >. Includes the phenotype data after splitting the original input into multiple chunks (when params.pipeline_engine = "matrixeqtl") or single phenotype (when params.pipeline_engine = "genesis").
- < 2_SNP_info >. Contains information about each SNP obtained from the GDS file.
- < 3_individual_results >. Contains QTL association results of each genotype data with each phenotype chunk (when params.pipeline_engine = "matrixeqtl") or each phenotype (when params.pipeline_engine = "genesis"), without SNP information.
- < 4_individual_results_SNPinfo >. Combines < 3_individual_results > with the corresponding SNP information in < 2_SNP_info >.
- < 5_Results_Summary >. Contains the QTL results for each phenotype in RDS format, along with a count table of the number of identified QTLs. Additionally, includes phenotype expression histograms, QQ plots, Manhattan plots, and Miami plots for each phenotype. If QTL results are too large to merge into a single file, you can obtain and merge them in <4_individual_results_SNPinfo>.
All logs for each of the analysis steps will be saved in each of the subfolders.
proceed with caution: Nextflow saves all intermediate results in the work/
directory. Once the pipeline successfully finishes, it's recommended to discard the work/
directory.
proceed with caution: QTL results can be very large. Once the user confirms that all SNP and phenotype associations are saved in <5_Results_Summary>, the results in <2_SNP_info>, <3_individual_results>, and <4_individual_results_SNPinfo> can be discarded since <5_Results_Summary> is the merged version of them. However, it's highly recommended to save a copy of the null models for each phenotype (3_individual_results/nullmod_[phenotype]_[gds].rds
).
The processes included in each of the Nextflow scripts are:
Prepare.nf
vcf_to_gds
Converts VCF files to GDS format.write_gds_list
Writes the resulting GDS file paths into a CSV file for Analysis.nf to use.merge_gds
Merges the GDS files for the following process.PCA
orPCA_GRM
Performs genetic PCA analysis using PC-AiR and obtains GRM using PC-Relate, utilizing the GDS generated frommerge_gds
.
Analysis.nf
get_shared_sampleid
Obtains shared sample IDs from input genotype files, phenotype file, and optional input files such as covariate file, PC RDS file, and user-defined sample ID TXT file. These sample IDs are used in the analysis.split_pheno
Splits phenotype data into chunks or single phenotypes for parallel analysis.prepare_cvrt
Reads and processes covariate file, handling categorical covariates.SNP_info
Retrieves SNP information from the GDS file.
When using GENESIS, this includes variant ID, REF (reference allele), ALT.ca (alternative allele, coded allele), and snpID (annotation in genotype file).
When using MatrixeQTL, additional entries such as chr (chromosome), pos (position), freq.ca (frequency of coded allele), freq.minor (frequency of minor allele), miss (missing rate), n.obs (number of observances), and MAC (minor allele count) are included. These columns are automatically returned by GENESIS.QTL_analysis
Performs association tests of SNPs from each genotype data file with each phenotype.merge_info_QTL
Combines QTL results obtained from QTL_analysis with corresponding SNP information from the SNP_info process.QTL_count_wrap
Counts the number of QTLs returned by the QTL_analysis process.
Report.nf
QTL_results_wrap
For each phenotype, merge the QTL results from QTL_count_wrap in Analysis.nf and create phenotype histogram plot, QQ plot, Manhattan plot, and Miami plot.
We've developed an R Shiny App available in the App folder to facilitate downstream visualization. Simply open app.R using RStudio and click the Run App
button at the top right.
The uploaded QTL result is a data frame that includes the following columns:
- chr
- pos
- snpID
- phenotype
- pvalue / Score.pval
- beta / beta.ca / Est / Est.ca
- mac / MAC
An example is provided as N2-acetyl,N6-methyllysine_Example.rds
in the < example > folder.
Select a phenotype from the dropdown menu, specify the significance p-value threshold, and optionally provide a list of SNPs to highlight in the plots.
First, upload a GDS file and a phenotype file in RDS or CSV format. Specify a variant from the "annotation/id" entry in the GDS file. Click "Retrieve Data". If successful, a new dropdown menu and plotting button will appear.
For example, you can use 'data/gds/chr_1.gds' for the GDS file and 'data/pheno_file.csv' for the phenotype file. Select variant '1:1000156'.
Second, select a phenotype from the new dropdown menu and click "Plot Boxplot".
QTL results consist of multiple phenotypes and variants. Therefore, we offer an option to visualize the associations as a network. For example, upload the Network_Example.rds
provided in the < example > folder to the "Choose QTL result" file input at the top.
Next, select a p-value threshold and MAC (minor allele count) threshold, and plot the network.
Since genetic variants are often in high linkage disequilibrium (LD) with each other, only the top variant in each chromosome will be included.
-end-