The major change from Version 2 to Version 3 is the reorganisation of the repo so that the different workflows are in separate directories.
This means that instead of running nextflow run h3abionet/h3agwas/assoc.nf
, you should run nextflow run h3abionet/h3agwas/assoc/main.nf
In addition to this README we have a detailed tutorial and videos
- These can be found at http://www.bioinf.wits.ac.za/gwas
- 2021-02-18: add pipeline to build a example data using gwas catalog and 1000 genome build_example_data
- 2021-02-16: add report to vcf in plink with analyse of frequencies and score formatdata
- 2021-01-22: create utils folder to add Metasoft binary and utils (server down)
- 2020-12-08: add meta analyse with plink assoc
- 2020-12-01: add plink GxE, add estimation of beta and se assoc
- 2020-11-17: add module nf to convert vcf in bgen format formatdata
- 2020-07-27: add covariable qualitatif to fastgwa assoc
- 2020-07-27: News nextflow modules to transform vcf impute format in bimbamformatdata
- 2020-06-03: News nextflow modules to transform plink file in vcf file with check allele for imputationformatdata
- 2020-05-18: fixed bug in gcta to computed heribilities assoc
- 2020-03-27: added a modules to convert position between different genome version formatdata
- 2020-02-20: support for awsbatch
- 2020-02-20 : added fastgwa (software gcta) as assoc software : assoc
- 2019-10-01 : added in transform data a nextflow script to format output of GWAS with added your own rs, frequencies, N etc... (usefull for post analysis) : formatdata
- file
formatdata/format_gwasfile.nf
- file
- 2019/09/19 : added in estimation of heritabilites option for Multiple variance components for boltlmm assoc
- 2019/09/17 : added format and analysis by mtag in assoc
- 2019/09/16 : added two news nextflow files to convert data in formatdata:
formatdata/vcf_in_plink.nf
: format data in vcf for plinkformatdata/vcf_in_impute2.nf
: extract impute2 data from vcf of sanger
- 2019/09/10 : update estimation of heritability in assoc to take account for each software when heritabilities can't be computed
H3Agwas is a simple human GWAS analysis workflow for data quality control (QC) and basic association testing developed by H3ABioNet. It is an extension of the witsGWAS pipeline for human genome-wide association studies built at the Sydney Brenner Institute for Molecular Bioscience. H3Agwas uses Nextflow as the basis for workflow managment and has been dockerised to facilitate portability.
The original version of the H3Agwas was published in June 2017 with minor updates and bug fixes through the rest of the year. Based on experience with large data sets, the pipelines were considerably revised with additional features, reporting and a slightly different workflow.
We have moved all scripts from Python 2 to Python 3, so you will need to have Python 3 installed.
Please ignore the Wiki in this version which refers to version 1
Problems with the workflow should be raised as an issue on this GitHub repo. (If you think the probem is the workflow)
If you need help with using the workflow, please log a call with the H3A Help Disk
- Features
- Installing the pipeline
- A quick start example
- The Nextflow configuration file
- Running the workflow in different environments and Advanced options: Docker, PBS, Singularity, Amazon EC2
- Dealing with errors
- Auxiliary Programs
- Acknowledgement, Copyright and general
The goals of this pipeline is to have a portable and robust pipeline for performing a genome-wide association study
There are three separate workflows that make up h3agwas
-
call2plink
. Conversion of Illumina genotyping reports with TOP/BOTTOM or FORWARD/REVERSE calls into PLINK format, aligning the calls. -
qc
: Quality control of the data. This is the focus of the pipeline. It takes as input PLINK data and has the following functions-
see README of qc
-
Sample QC tasks checking:
- discordant sex information
- calculating missingness
- heterozygosity scores
- relatedness
- discordant sex information
- SNP QC tasks checking:
-
batch reports
- remove duplicates
- minor allele frequencies
- SNP missingness
- differential missingness
- Hardy Weinberg Equilibrium deviations
-
-
assoc
: Association study. A simple analysis association study is done. The purpose of this is to give users an introduction to their data. Real studies, particularly those of the H3A consortium will have to handle compex co-variates and particular population study. We encourage users of our pipeline to submit their analysis for the use of other scientists.- see README of assoc/
- Basic PLINK association tests, producing manhattan and qqplots
- CMH association test - Association analysis, accounting for clusters
- permutation testing
- logistic regression
- Efficient Mixed Model Association testing with gemma, boltlmm or fastlmm
- Gene environment association with gemma or plink
- Other scripts gave for post analysis :
assoc/cojo-assoc.nf
: do Conditional & joint (COJO) analysis of GWAS summary statistics without individual-level genotype data with gcta- ̀ assoc/esth2-assoc.nf` : estimate heritability and co-heritabilie with gcta, ldsc, gemma and bolt
assoc/meta-assoc.nf
: do meta analysis with summary statisticsassoc/permutation-assoc.nf
: do a permutation test to reevaluate p.value with gemmaassoc/simul-assoc.nf
: simulation of bed file
formatdata
: additional script to format data added some missing information etc...
The goal of the H3ABionet GWAS pipeline is to provide a portable and robust pipeline for reproducble genome-wide association studies.
A GWAS requires a complex set of analyses with complex dependancies between the analyses. We want to support GWAS work by supporting
- reproducibility -- we can rerun the entire analysis from start to finish;
- reusability -- we can run the entire analysis with different parameters in an efficient and consistent way;
- portability -- we can run the analysis on a laptop, on a server, on a cluster, in the cloud. The same workflow can be used for all environments, even if the time taken may change;
We took the following into account:
- The anticipated users are heterogeneous both in terms of their bioinformatics needs and the computing environments they will use.
- Each GWAS is different -- it must be customisable to allow the bioinformaticists to set different parameters.
There are two key technologies that we use, Nextflow and Docker, both of which support our design principles.
Nextflow is a workflow language designed at the Centre for Genomic Regulation, Barcelona. Although it is a general workflow language for science, it comes out of a bioinformmatics group and strongly supports bioinformatics.
Our pipeline is built using Nextflow. However, users do not need to know anything about Nextflow. Obviously if you can do some programming you can customise and extend the pipelines, but you do not need to know Nextflow yourself.
Nextlow is very easy to install and is highly portable. It supports partial execution and pipelines that scale. Nextflow supports our worklow requirements very well.
A GWAS requires several software tools to be installed. Using Docker we can simplify the installation. Essentially, Docker wraps up all software dependancies into containers. Instead of installing all the dependancies, you can install Docker, easily and then install our containers. (In fact you don't need to explicitly install our containers, Nextflow and our workflow will do that for you automatically).
We expect that many of our users will use Docker. However, we recognise that this won't be suitable for everyone because many high performance computing centres do not support Docker for security reasons. It is possible to run our pipeline without Docker and will give instructions about which software needs to be installed.
Similarily we support Singularity. Although it's a new feature, we've tested it two different organisaitons and it's worked flawlessly
The h3agwas pipeline can be run in different environments; the requirements differ. The different modes are described in detail below
- Running on Docker/Singularity. This is the easiest way of running h3agwas. We have a set of Docker containers that have all the required executables and libraries.
- Running natively on a local computer -- this is requires a number of external executables and libraries to be installed..
- Running with a scheduler -- Nextflow supports a range of schedulers. Our pipeline supports using docker or running natively.
- Running on Amazon EC2. You need to have Amazon AWS credentials (and a credit card). Our EC2 pipeline uses Docker so this is very easy to run.
- We have also used Docker swarm. If you have a Docker swarm it's easy to do.
We now explore these in details
All modes of h3agwas have the following requirements
-
Java 8 or later
-
Nextflow. To install Nextflow, run the command below. It creates a nextflow executable in the directory you ran the command. Move the executable to a directory on the system or user PATH and make it executable. You need to be running Nextflow 27 (January 2018) or later.
curl -fsSL get.nextflow.io | bash
If you don't have curl (you can use wget)
-
Git (this probably is already installed)
If you install Docker or Singularity, you do not need to install all the other dependencies. Docker is available on most major platforms. See the Docker documentation for installation for your platform. Singularity works very well on Linux.
That's it.
This requires a standard Linux installation or macOS. It requires bash to be available as the shell of the user running the pipeline.
The following code needs to be installed and placed in a directory on the user's PATH.
- plink 1.9 [Currently, it will not work on plink 2, though it is on our list of things to fix. It probably will work on plink 1.05 but just use plink 1.0]
- LaTeX. A standard installation of texlive should have all the packages you need. If you are installing a lightweight TeX version, you need the following pacakges which are part of texlive.: fancyhdr, datetime, geometry, graphicx, subfig, listings, longtable, array, booktabs, float, url.
- python 3.6 or later. pandas, numpy, scipy, matplotlib and openpyxl need to be installed. You can instally these by saying:
pip3 install pandas
etc
If you want to run the assoc
pipeline then you should install gemma,fastlmm if you are using those options.
There are two approaches: let Nextflow manage this for you; or download using Git. The former is easier; you need to use Git if you want to change the workflow
To download the workflow you can say
nextflow pull h3abionet/h3agwas
If we update the workflow, the next time you run it, you will get a warning message. You can do another pull to bring it up to date.
If you manage the workflow this way, you will run the scripts, as follows
nextflow run h3abionet/h3agwas/call2plink/main.nf .....
nextflow run h3abionet/h3agwas/qc/main.nf .....
nextflow run h3abionet/h3agwas/assoc/main.nf .....
Change directory where you want to install the software and say
git clone https://github.com/h3abionet/h3agwas.git
This will create a directory called h3agwas with all the necesssary code. If you manage the workflow this way, you will run the scripts this way:
nextflow run SOME-PATH/call2plink .....
nextflow run SOME-PATH/qc .....
nextflow run SOME-PATH/assoc .....
where SOME-PATH is a relative or absolute path to where the workflow was downloaded.
This section shows a simple run of the qc
pipeline that
should run out of the box if you have installed the software or
Docker or Singularity. More details and general configuration will be shown later.
This section illustrates how to run the pipeline on a small sample data file with default parameters. For real runs, the data to be analysed and the various parameters to be used are specified in the nextflow.config files in assoc, qc and call2plink folder. The details will be explained in another section.
Our quick start example will fetch the data from an Amazon S3 bucket, but if you'd prefer then you use locally installed sample. If you have downloaded the software using Git, you can find the sample data in the directory. Otherwise you can download the files from http://www.bioinf.wits.ac.za/gwas/sample.zip and unzip The sample data to be used is in the input directory (in PLINK format as sampleA.bed, sampleA.bim, sampleA.fam). The default nextflow.config file uses this, and so you can run the workflow through with this example. Note that this is a very small PLINK data set with no X-chromosome information and no sex checking is done.
This requires that all software dependancies have been installed (see later for singularity or docker)
We also assume the sample directory with data is in the current working directory
nextflow run h3abionet/h3agwas/qc/main.nf --input_dir=s3://h3abionet/sample
If you have downloaded the sample data and the directory with the sample data is a sub-directory of your working directory, you could just say: nextflow run h3abionet/h3agwas/qc/main.nf --input_dir=s3://h3abionet/sample
Change directory to the directory in which the workflow was downloaded
nextflow run qc
The workflow runs and output goes to the output directory. In the sampleA.pdf file, a record of the analysis can be found.
In order, to run the workflow on another PLINK data set, say mydata.{bed,bim,fam}, say
nextflow run qc --input_pat mydata
(or nextflow run h3abionet/h3agwas/qc --input_pat mydata
: for simplicity for the rest of the tutorial we'll only present the one way of running the workflow -- you should use the method that is appropriate for you)
If the data is another directory, and you want to the data to go elsehwere:
nextflow run qc --input_pat mydata --input_dir /data/project10/ --output_dir ~/results
There are many other options that can be passed on the the command-line. Options can also be given in the config file (explained below). We recommend putting options in the configuration file since these can be archived, which makes the workflow more portable
Just add -profile docker
to your run command -- for example,
nextflow run qc -profile docker
ornextflow run h3abionet/h3agwas/qc/main.nf --input_dir=s3://h3abionet/sample
Please note that the first time you run the workflow using Docker, the Docker images will be downloaded. Warning: This will take about 1GB of bandwidth which will consume bandwidth and will take time depending on your network connection. It is only the first time that the workflow runs that the image will be downloaded.
More options are shown later.
You may at some point want to run multiple, independent executions of the workflows at the same time (e.g. different data). This is possible. However, each run should be started in a different working directory. You can refer to the scripts and even the data in the same diretory, but the directories from which you run the nextflow run
command should be different.
Nextflow uses parameters that are passed to it and contents of a configuration file to guide its behaviour. By default, the configuration file used in nextflow.config. This includes specifiying
- where the inputs come from and outputs go to;
- what the parameters of the various programs/steps. For example, in QC you can specify the what missingness cut-offs you want;
- the mode of operation -- for example, are you running it on a cluster? Using Docker?
To run your workflow, you need to modify the nextflow.config file, and then run nexflow. Remember, that to make your workflow truly reproducible you need to save a copy of the config file. For this reason although you can specify many parameters from the command line, we recommend using the config file since this makes your runs reproducible. It may be useful to use git or similar tool to archive your config files.
You can use the -c option specify another configuration file in addition to the nextflow.config file
nextflow run -c data1.config qc
This is highly recommended. We recommend that you keep the nextflow.config
file as static as possible, perhaps not even modifying it from the default config. Then for any
run or data set, have a much smaller config file that only specifies the changes you want made. The base nextflow.config
file will typically contain config options that are best set by the H3Agwas developers (e.g., the names of the docker containers) or default GWAS options that are unlikely to change. In your separate config file, you will specify the run-specific options, such as data sets, directories or particular GWAS parameters you want. Both configuration files should be specified. For example, suppose I create a sub-directory within the directory where the nextflow file is (probably called h3agwas). Within the h3agwas directory I keep my nexflow.config file and the nextflow file itself. From the sub-directory, I run the workflow by saying:
nextflow run -c data1.config ../qc
This will automatically use the nextflow.config
file in either the current or parent directory. Note that the the config files are processed in order: if an option is set into two config files, the latter one takes precedence.
There is a template of a nextflow.config file called aux.config.template. This is a read only file. Make a copy of it, call it aux.config (or some suitable name). This file contains all the options a user is likely to want to change. It does not specify options like the names of docker containers etc. Of course, you can if you wish modify the nextflow.config file, but we recommend against it. Your auxiliary file should supplement the nextflow.config file.
Then fill in the details in the config that are required for your run. These are expained in more detail below.
When you run the the scripts there are a number of different options that you might want to use. These options are specified by using the -flag
or --flag
notation. The flags with a single hyphen (e.g. -resume
) are standard Nextflow options applicable to all Nextflow scripts. The flags with a double hyphen (e.g., --pi_hat
) are options that are specific to our scripts. Take care not to mix this up as it's an easy error to make, and may cause silent errors to occur.
Almost all the workflow options that are in the nextflow.config file can also be passed on the command line and they will then override anything in the config like. For example
nextflow run qc --cut_miss 0.04
sets the maximim allowable per-SNP misisng to 4%. However, this should only be used when debugging and playing round. Rather, keep the options in the auxiliary config file that you save. By putting options on the command line you reduce reproducibility. (Using the parameters that change the mode of the running -- e.g. whether using docker or whether to produce a time line only affects time taken and auxiliary data rather than the substantive results).
Often a workflow may fail in the middle of execution because there's a problem with data (perhaps a typo in the name of a file), or you may want to run the workflow with slightly different parameters. Nextflow is very good in detecting what parts of the workflow need to re-executed -- use the -resume
option.
If you want to clean up your work directory, say nextflow clean
.
Nextflow provides several options for visualising and tracing workflow. See the Nextflow documentation for details. Two of the options are:
-
A nice graphic of a run of your workflow
nextflow run qc -with-dag quality-d.pdf
-
A timeline of your workflow and individual processes (produced as an html file).
nextflow run <pipeline name> -with-timeline time.html
This is useful for seeing how long different parts of your process took. Also useful is peak virtual memory used, which you may need to know if running on very large data to ensure you have a big enough machine and specify the right parmeters.
In the quick start we gave an overview of running our workflows in different environments. Here we go through all the options, in a little more detail
This option requires that all dependancies have been installed. You run the code by saying
nextflow run qc
You can add that any extra parameters at the end.
This requires the user to have docker installed.
Run by nextlow run qc -profile docker
Nextflow supports execution on clusters using standard resource managers, including Torque/PBS, SLURM and SGE. Log on to the head node of the cluster, and execute the workflow as shown below. Nextflow submits the jobs to the cluster on your behalf, taking care of any dependancies. If your job is likely to run for a long time because you've got really large data sets, use a tool like screen to allow you to control your session without timing out.
Our workflow has pre-built configuration for SLURM and Torque/PBS. If you use another scheduler that Nextflow supports you'll need to do a little more (see later): see https://www.nextflow.io/docs/latest/executor.html for details
To run using Torque/PBS, log into the head node. Edit the nextflow.config file, and change the queue
variable to be the queue you will run jobs on (if you're not sure of this, ask your friendly sysadmin). Then when you run, our workflow, use the -profile pbs
option -- typically you would say something like nextflow run -c my.config qc -profile pbs
. Note that the -profile pbs
only uses a single "-".
Similarily, if you run SLURM, set the queue variable, and use the -profile slurm
option.
To use only of the other schedulers supported by Nextflow, add the following sub-stanza to your nextflow.config file inside of the profile stanza:
myscheduler {
process.executor = 'myscheduler'
process.queue = queue
}
where myscheduler
is one of: nqsii, htcondor, sge, lsf.
and then use this as the profile.
We assume all the data is visible to all nodes in the swarm. Log into the head node of the Swarm and run your chosed workflow -- for example
We have tested our workflow on different Docker Swarms. How to set up Docker Swarm is beyond the scope of this tutorial, but if you have a Docker Swarm, it is easy to run. From the head node of your Docker swarm, run
nextflow run qc -profile dockerSwarm
Our workflows now run easily with Singularity.
nextflow run qc -profile singularity
or
nextflow run qc -profile pbsSingularity
By default the user's ${HOME}/.singularity will be used as the cache for Singularity images. If you want to use something else, change the singularity.cacheDir
parameter in the config file.
If you have a cluster which runs Docker, you can get the best of both worlds by editing the queue variable in the pbsDocker stanza, and then running
nextflow run qc -profile option
where option is one of pbsDocker, pbsSingularity, slurmDocker or slurmSingularity. If you use a different scheduler, read the Nextflow documentation on schedulers, and then use what we have in the nextflow.config file as a template to tweak.
We are unlikely to support udocker unless Nextflow does. See this link for a discussion https://www.nextflow.io/blog/2016/more-fun-containers-hpc.html
Nextflow supports execution on Amazon EC2. Of course, you can do your own custom thing on Amazon EC2, but there is direct support from Nextflow and we provide an Amazon AMI that allows you to use Amazon very easilyl. This discussion assumes you are familiar with Amazon and EC2 and shows you how to run the workflow on EC2:
- We assume you have an Amazon AWS account and have some familiariy with EC2. The easiest way to run is by building am Amazon Elastic File System (EFS) which persists between runs. Each time you run, you attach the EFS to the cluster you use. We assume you have
-
Your Amazon accessKey and secretKey
-
you have the ID of your EFS
-
you have the ID of the subnet you will use for your Amazon EC2.
Edit the nextflow config file to add your keys to the aws stanza, as well as changing the AMI ID, sharedStorageID, the mount and subnet ID. BUT see point 8 below for a better way of doing things.
aws { accessKey ='AAAAAAAAAAAAAAAAA' secretKey = 'raghdkGAHHGH13hg3hGAH18382GAJHAJHG11' region ='eu-west-1' } cloud { ... ... ... other options imageId = "ami-710b9108" // AMI which has cloud-init installed sharedStorageId = "fs-XXXXXXXXX" // Set a common mount point for images sharedStorageMount = "/mnt/shared subnetId = "subnet-XXXXXXX" }
Note that the AMI is the H3ABionet AMI ID, which you should use. The other information such as the keys, sharedStorageID and subnetID you have to set to what you have.
The instructions below assume you are using nextflow. If you launch the machine directly, the user will be ec2-user
; if you use the instructions below, you will be told who the user on Amazon instance is (probably the same userid as your own machine).
-
Create the cloud. For the simple example, you only need to have one machine. If you have many, big files adjust accordingly.
nextflow cloud create h3agwascloud -c 1
The name of the cluster is your choice (h3agwascloud is your choice).
-
If successful, you will be given the ID of the headnode of the cluster to log in. You should see a message like,
> cluster name: h3agwascloud
> instances count: 1
> Launch configuration:
- bootStorageSize: '20GB'
- driver: 'aws'
- imageId: 'ami-710b9108'
- instanceType: 'm4.xlarge'
- keyFile: /home/user/.ssh/id_rsa.pub
- sharedStorageId: 'fs-e17f461c'
- sharedStorageMount: '/mnt/shared'
- subnetId: 'subnet-b321c8c2'
- userName: 'scott'
- autoscale:
- enabled: true
- maxInstances: 5
- terminateWhenIdle: true
Please confirm you really want to launch the cluster with above configuration [y/n] y
Launching master node -- Waiting for `running` status.. ready.
Login in the master node using the following command:
ssh -i /home/scott/.ssh/id_rsa scott@ec2-54-246-155-85.eu-west-1.compute.amazonaws.com
-
ssh into the head node of your Amazon cluster. The EFS is mounted onto
/mnt/shared
. In our example, we will analyse the files sampleA.{bed,bim,fam} in the /mnt/shared/input directory The nextflow binary will be found in your home directory. (Note that you can choose to mount the EFS on another mount point by modifying the nextflow optionsharedStorageMount
; -
For real runs, upload any data you need. I suggest you put in the /mnt/shared directory, and do not put any data output on the home directory. Yo
-
Run the workflow -- you can run directly from github. The AMI doesn't have any of the bioinformatics software installed.
Specify the docker profile and nextflow will run using Docker, fetching any necessary images.
Do
nextflow pull h3abionet/h3agwas
This pull is not strictly necessary the first time you run the job, but it's a good practice to get into to check if there are updates.
-
Then run the workflow
nextflow run h3abionet/h3agwas -profile docker --input_dir=/mnt/shared/XXXXX/projects/h3abionet/h3agwas/input/ --work_dir=/mnt/shared
You will need to replace XXXXX with your userid -- the local copy of the repo is found in the
/mnt/shared/XXXXX/projects/h3abionet/h3agwas/
directory. But we want the work directory to be elsewhere.Of course, you can also use other parameters (e.g. -resume or --work_dir). For your own run you will want to use your nextflow.config file.
__ Need to change : By default, running the workflow like this runs the
qc
script. If you want to run one of the other scripts you would saynextflow run h3abionet/h3agwas/topbottom.nf
ornextflow run h3abionet/h3agwas/assoc.nf
etc. __ -
The output of the default runcan be found in
/mnt/shared/output
. The file sampleA.pdf is a report of the analysis that was done. -
Remember to shutdown the Amazon cluster to avoid unduly boosting Amazon's share price.
nextflow cloud shutdown h3agwascloud
-
Security considerations: Note that your Amazon credentials should be kept confidential. Practically this means adding the credentials to your nextflow.config file is a bad idea, especially if you put that under git control or if you share your nextflow scripts. So a better way of handling this is to put confidential information in a separate file that you don't share. So I have a file called scott.aws which has the following:
aws {
accessKey ='APT3YGD76GNbOP1HSTYU4'
secretKey = 'WHATEVERYOURSECRETKEYISGOESHERE'
region ='eu-west-1'
}
cloud {
sharedStorageId = "fs-XXXXXX"
subnetId = "subnet-XXXXXX"
}
Then when you create your cloud you say this on your local machine
nextflow -c scott.aws -c run10.config cloud create scottcluster -c 5
Note there are two uses of -c
. The positions of these arguments are crucial. The first are arguments to nextflow itself and gives the configuration files that nextflow to use. The second is an argument to cloud create which says how many nodes should be created.
The scott.aws file is not shared or put under git control. The nextflow.config and run10.config files can be archived, put under git control and so on because you want to share and archive this information with o thers.
AWS Batch is a service layered on top of EC2 by Amazon which may make it easier and / or cheaper than using EC2. My personal view is that if you are only our pipeline on Amazon and you have reasonable Linux experience then the EC2 implementation above is probably easier. However, if you use or plan to use AWS Batch for other services then, AWS Batch is a definite option.
Create an AWS Batch queue and computing environment. Setting up AWS Batch is beyond the scope of this document. You can look at Amazon's documentation or the general documentation from BioNet.
You also need to set up an S3 bucket for working space. Remember to set permissions on this bucket appropriately.
Create a nextflow config file with your personal information (this should not be put under git !). Set the process.queue
to the name of the queue you created in the previous step and replace the accessKey
, secretKey
and region
parameters with your values.
process.queue = 'queue_name'
aws {
accessKey ='accessKey'
secretKey = 'WHATEVERYOURSECRETKEYISGOESHERE'
region ='eu-west-1'
}
You can call your config file whatever you want, but for sake of the documentation below I'm assuming you called in aws.config
.
Set up your other config files as required. Note that data you wish to process can either be local or in an S3 bucket.
Run the job (in this example the qc worfklow). You need to specify the s3 bucket to be used and also the awsbatch
profile
nextflow run -c aws.config -c job.config qc -bucket-dir s3://my-bucket/some/path -profile awsbatch
One problem with our current workflow is that error messages can be obscure. Errors can be caused by
- bugs in our code
- your doing something odd
There are two related problems. When a Nextflow script fails for some reason, Nextflow prints out in great detail what went wrong. Second, we don't always catch mistakes that the user makes gracefully.
First, don't panic. Take a breath and read through the error message to see if you can find a sensible error message there.
A typical error message looks something like this
Command exit status:
1
Command output:
(empty)
Command error:
Traceback (most recent call last):
File ".command.sh", line 577, in <module>
bfrm, btext = getBatchAnalysis()
File ".command.sh", line 550, in getBatchAnalysis
result = miss_vals(ifrm,bfrm,args.batch_col,args.sexcheck_report)
File ".command.sh", line 188, in miss_vals
g = pd.merge(pfrm,ifrm,left_index=True,right_index=True,how='inner').groupby(pheno_col)
File "/usr/local/python36/lib/python3.6/site-packages/pandas/core/generic.py", line 5162, in groupby
**kwargs)
File "/usr/local/python36/lib/python3.6/site-packages/pandas/core/groupby.py", line 1848, in groupby
return klass(obj, by, **kwds)
File "/usr/local/python36/lib/python3.6/site-packages/pandas/core/groupby.py", line 516, in __init__
mutated=self.mutated)
File "/usr/local/python36/lib/python3.6/site-packages/pandas/core/groupby.py", line 2934, in _get_grouper
raise KeyError(gpr)
Column 'batches' unknown
Work dir:
/project/h3abionet/h3agwas/test/work/cf/335b6d21ad75841e1e806178933d3d
Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option `-resume`
-- Check '.nextflow.log' file for details
WARN: Killing pending tasks (1)
Buried in this is an error message that might help (did you say there was a column batches in the manifest?) If you're comfortable, you can change directory to the specified directory and explore. There'll you find
- Any input files for the process that failed
- Any output files that might have been created
- The script that was executed can be found in
.command.sh
- Output and error can be found as
.command.out
and.command.err
If you spot the error, you can re-run the workflow (from the original directory), appending -resume
. Nextflow will re-run your workflow as needed -- any steps that finished successfully will not need to be re-run.
If you are still stuck you can ask for help at two places
-
H3ABioNet Help desk --- https://www.h3abionet.org/support
-
On GitHub -- need a GitHub account if you have a GitHub account
These are in the aux directory
Can be used to update fam files. You probably won't need it, but others might find it useful. The intended application might be that there's been a mix-up of sample IDs and you want to correct. The program takes four parameters: the original sample sheet, a new sample sheet (only has to include those elements that have changed), the original fam file, and then the base of a newfam file name. The program takes the plate and well as the authorative ID of a sample. For every row in the updated sheet, the program finds the plate and well, looks up the corresponded entry in the original sheet, and then replaces that associated ID in the fam file. For example, if we have
Original sheet
Plate Well Sample
W77888 G01 AAAAAA
New sheet
Plate Well Sample
W77888 G01 BBBBBB
Then the new fam file has the AAAAA entry replaced with the BBBBB entry
Three files are output: a fam file, an error file (the IDs of individuals who are in th e sample sheet but not the fam file are output), and a switch file (containing all the changes that were made). Some problems like duplicate entries are detected.
Nextflow has great options for showing resourc usage. However, you have to remember to set those option when you run. It's easy to forget to do this. This very useful script by Harry Noyes (harry@liverpool.ac.uk) parses the .nextflow.log file for you
Makes a reference genome in a format the the pipeline can use. The first argument is a directory that contains FASTA files for each chromosome; the second is the strand report, the third is the manifest report, the fourt in the base of othe output files.
python3 make_ref.py auxfiles/37/ H3Africa_2017_20021485_A3_StrandReport_FT.txt H3Africa_2017_20021485_A3.csv h3aref201812
The program checks each SNP given in the manifest file by the chromosome and position number and then checks that the probe given in the manifest file actually matches the reference genome at that point. Minor slippage is acceptable because of indels.
The wrn file are SNPs which are probably OK but have high slippage (these are in the ref file) The err file are the SNPs which don't match.
This is used to depict where on the plates particular samples are. This is very useful for looking at problems in the data. If for example you find a bunch of sex mismatches this is most likely due to misplating. This script is a quick way of looking at the problem and seeing whether the errors are close together or spread out. There are two input arguments
- A file with the IDs of the individuals -- assuming that the first token on each line is an individual
- A sample sheet that gives the plating of each sample
There is one output parameter -- the name of a directory where output should go. The directory should exist.
You may need to change this line
batches['ID'] = batches['Institute Sample Label'].apply(lambda x:x[18:])
In our example, we assumed the ID can found in the column "Institute Sample Label" but from the position 18 (indexed from 0) in the string. Change as appropriate for you
If you use this workflow, please cite the following paper
- Baichoo S, Souilmi Y, Panji S, Botha G, Meintjes A, Hazelhurst S, Bendou H, De Beste E, Mpangase P, Souiai O, Alghali M, Yi L, O'Connor B, Crusoe M, Armstrong D, Aron S, Joubert D, Ahmed A, Mbiyavanga M, Van Heusden P, Magosi, L, Zermeno, J, Mainzer L, Fadlelmola F, Jongeneel CV, and Mulder N. (2018) Developing reproducible bioinformatics analysis workflows for heterogenous computing environments to support African genomics, BMC Bioinformatics 19, 457, 13 pages, doi:10.1186/s12859-018-2446-1.
We acknowledge funding by the National Institutes of Health through the NHGRI (U41HG006941). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
- We thank Sumir Panji and Nicola Mulder for their support and leadership
- We thank Fourie Joubert at the University of Pretoria for hosting our initial hackathon.
Current team: Scott Hazelhurst, Jean-Tristan Brandenburg, Lindsay Clark, Obokula Smile, Michael Ebo Turkson, Michael Thompson,
H3ABioNet Pipelines team leadership: Christopher Fields, Shakuntala Baichoo, Sumir Panji, Gerrit Botha.
Past members and contributors: Lerato E. Magosi, Shaun Aron, Rob Clucas, Eugene de Beste, Aboyomini Mosaku, Don Armstrong and the Wits Bioinformatics team
We thank Harry Noyes from the University of Liverpool and Ayton Meintjes from UCT who both spent significant effort being testers of the pipleine, and the many users at the Sydney Brenner Institute for Molecular Bioscience for their patience and suggestion.
This software is licensed under the MIT Licence.
We acknowledge the support from the NIH NHGRI H3ABioNet (U24HG006941) and AWI-Gen (U54HG006938)
git clone https://github.com/h3abionet/h3agwas