Pipeline Implementaton

Data preparation

A first step in any pipeline is to prepare the input data. You will find all the data required to run the pipeline in the folder data within the ngs2017-nf repository directory.

There are four data inputs that we will use in this tutorial:

Genome File (data/genome.fa)
- Human chromosome 22 in FASTA file format
Read Files (data/reads/)
- Sample ENCSR000COQ1: 76bp paired-end reads (ENCSR000COQ1_1.fq.gz and ENCSR000COQ1_2.fq.gz).
Variants File (data/known_variants.vcf.gz)
- Known variants, gzipped as a Variant Calling File (VCF) format.
Blacklist File (data/blacklist.bed)
- Genomic locations which are known to produce artifacts and spurious variants in Browser Extensible Data (BED) format.

Input parameters

We can begin writing the pipeline by creating and editing a text file called main.nf from the ngs2017-nf repository directory with your favourite text editor. In this example we are using nano:

cd ngs2017-nf
nano main.nf

Edit this file to specify the input files as script parameters. Using this notation allows you to override them by specifying different values when launching the pipeline execution.

/*
 * Define the default parameters (1)
 */

params.genome     = "data/genome.fa"
params.variants   = "data/known_variants.vcf.gz"
params.blacklist  = "data/blacklist.bed"
params.reads      = "data/reads/ENCSR000COQ1_{1,2}.fastq.gz" (2)
params.results    = "results" (3)
params.gatk       = "/opt/broad/GenomeAnalysisTK.jar" (4)

Tip	You can copy the above text by using the kbd:[Cmd+C] keys, then move in the terminal window, open `nano` and paste the above text by using the kbd:[Cmd+V] keys shortcut.

The /*, * and */ specify comment lines which are ignored by Nextflow.
The reads parameter uses a glob pattern to specify the forward (ENCSR000COQ1_1.fq.gz) and reverse (ENCSR000COQ1_2.fq.gz) reads are pairs of the same sample.
The results parameter is used to specify a directory called results.
The gatk parameter specifies the location of the GATK jar file.

Once you have the default parameters in the main.nf file, you can save and run the main script for the first time.

Tip	With `nano` you can save and close the file with kbd:[Ctrl+O], then kbd:[Enter], followed by kbd:[Ctrl+X].

To run the main script use the following command:

nextflow run main.nf

You should see the script execute, print Nextflow version and pipeline revision and then exit.

N E X T F L O W  ~  version 0.24.1
Launching `main.nf` [nauseous_wright] - revision: 83a0a5415a

Problem #1

Great, now we need to transform the parameters into proper file handlers and channel variables. To do that open the main.nf file and copy the lines below at the end of the file.

Tip	In `nano` you can move to the end of the file using kbd:[Ctrl+W] and then kbd:[Ctrl+V].

This time you must fill the BLANK spaces with the correct function and parameter.

/*
 *  Parse the input parameters
 */

genome_file     = file(params.genome)
variants_file   = BLANK
blacklist_file  = BLANK
reads_ch        = BLANK
GATK            = params.gatk

Tip

The first three should specify file objects containing a single file as shown in this basic example. For the reads channel, use the fromFilePairs channel factory method. The final one, GATK is simply creating a Nextflow variable specifying the path of the GATK application file and does not need any special function.

Once you think you have data organised, you can again run the pipeline. However this time, we can use the the -resume flag.

nextflow run main.nf -resume

Tip	See here for more details about using the `resume` option.

Process 1A

Create a FASTA genome index

Now we have our inputs set up we can move onto the processes. In our first process we will create a genome index using samtools.

You should implement a process having the following structure:

Name: 1A_prepare_genome_samtools
Command: create a genome index for the genome fasta with samtools
Input: the genome fasta file
Output: the samtools genome index file

Problem #2

Copy the code below and paste it at the end of main.nf.

Your aim is to replace BLANK placeholder with the the correct variable name of the genome file that you have defined in previous problem.

/*
 * Process 1A: Create a FASTA genome index with samtools
 */

process '1A_prepare_genome_samtools' { (1)

  input:
      file genome from BLANK (2)

  output:
      file "${genome}.fai" into genome_index_ch (3)

  script:
  """
  samtools faidx ${genome} (4)
  """
}

In plain english, the process could be written as:

A process called 1A_prepare_genome_samtools
takes as input the genome file from BLANK
and creates as output a genome index file which goes into channel genome_index_ch
script: using samtools create the genome index from the genome file

Now when we run the pipeline, we see that the process 1A is submitted:

nextflow run main.nf -resume

N E X T F L O W  ~  version 0.24.1
Launching `main.nf` [adoring_wilson] - revision: 89dbc97b8e
[warm up] executor > local
[17/b0eae4] Submitted process > 1A_prepare_genome_samtools

Process 1B

Create a FASTA genome sequence dictionary with Picard for GATK

Our first process created the genome index for GATK using samtools. For the next process we must do something very similar, this time creating a genome sequence dictionary using Picard.

You should implement a process having the following structure:

Name: 1B_prepare_genome_picard
Command: create a genome dictionary for the genome fasta with Picard tools
Input: the genome fasta file
Output: the genome dictionary file

Problem #3

Fill in the BLANK words for both the input and output sections.

Copy the code below and paste it at the end of main.nf.

Your aim is to insert the correct input name from into the input step (written as BLANK) of the process and run the pipeline.

Tip	You can choose any channel output name that makes sense to you.

/*
 * Process 1B: Create a FASTA genome sequence dictionary with Picard for GATK
 */

process '1B_prepare_genome_picard' {

  input:
      file genome BLANK BLANK

  output:
      file "${genome.baseName}.dict" BLANK BLANK

  script:
  """
  PICARD=`which picard.jar`
  java -jar \$PICARD CreateSequenceDictionary R= $genome O= ${genome.baseName}.dict
  """
}

Note	`.baseName` returns the filename without the file suffix. If `"${genome}"` is `human.fa`, then `"${genome.baseName}.dict"` would be `human.dict`.

Process 1C

Create STAR genome index file

Next we must create a genome index for the STAR mapping software.

You should implement a process having the following structure:

Name: 1C_prepare_star_genome_index
Command: create a STAR genome index for the genome fasta
Input: the genome fasta file
Output: a directory containing the STAR genome index

Problem #4

This is a similar exercise as problem 3, except this time both input and output lines have been left BLANK and must be completed.

/*
 * Process 1C: Create the genome index file for STAR
 */

process '1C_prepare_star_genome_index' {

  input:
      BLANK_LINE

  output:
      BLANK_LINE

  script:
  """
  mkdir genome_dir

  STAR --runMode genomeGenerate \
       --genomeDir genome_dir \
       --genomeFastaFiles ${genome} \
       --runThreadN ${task.cpus}
  """
}

Tip	The output of the STAR genomeGenerate command is specified here as `genome_dir`.

Process 1D

Filtered and recoded set of variants

Next on to something a little more tricky. The next process takes two inputs: the variants file and the blacklist file.

It should output a channel named prepared_vcf_ch which emitting a tuple of two files.

Note	In Nextflow, tuples can be defined in the input or output using the `set` qualifier.

You should implement a process having the following structure:

Name: 1D_prepare_vcf_file
Command: create a filtered and recoded set of variants
Input: the variants file
the blacklisted regions file
Output: a set containing the filtered/recoded VCF file and the tab index (TBI) file.

Problem #5

You must fill in the two BLANK_LINES in the input and the two BLANK output files.

/*
 * Process 1D: Create a file containing the filtered and recoded set of variants
 */

process '1D_prepare_vcf_file' {

  input:
      BLANK_LINE
      BLANK_LINE

  output:
      set BLANK, BLANK into prepared_vcf_ch

  script:
  """
  vcftools --gzvcf $variantsFile -c \//(1)
           --exclude-bed ${blacklisted} \//(2)
           --recode | bgzip -c \
           > ${variantsFile.baseName}.filtered.recode.vcf.gz (3)

  tabix ${variantsFile.baseName}.filtered.recode.vcf.gz (4)
  """
}

The input variable for the variants file
The input variable for the blacklist file
The first of the two output files
Generates the second output file named "${variantsFile.baseName}.filtered.recode.vcf.gz.tbi"

Try run the pipeline from the project directory with:

nextflow run main.nf -resume

Congratulations! Part 1 is now complete.

We have all the data prepared and into channels ready for the more serious steps

Process 2

STAR Mapping

In this process, for each sample, we align the reads to our genome using the STAR index we created previously.

You should implement a process having the following structure:

Name: 2_rnaseq_mapping_star
Command: mapping of the RNA-Seq reads using STAR
Input: the genome fasta file
the STAR genome index
a set containing the replicate id and paired read files
Output: a set containg replicate id, aligned bam file & aligned bam file index

Problem #6

Copy the code below and paste it at the end of main.nf.

You must fill in the three BLANK_LINE lines in the input and the one BLANK_LINE line in the output.

/*
 * Process 2: Align RNA-Seq reads to the genome with STAR
 */

process '2_rnaseq_mapping_star' {

  input:
      BLANK_LINE
      BLANK_LINE
      BLANK_LINE

  output:
      BLANK_LINE

  script:
  """
  # ngs-nf-dev Align reads to genome
  STAR --genomeDir $genomeDir \
       --readFilesIn $reads \
       --runThreadN ${task.cpus} \
       --readFilesCommand zcat \
       --outFilterType BySJout \
       --alignSJoverhangMin 8 \
       --alignSJDBoverhangMin 1 \
       --outFilterMismatchNmax 999

  # 2nd pass (improve alignmets using table of splice junctions and create a new index)
  mkdir genomeDir
  STAR --runMode genomeGenerate \
       --genomeDir genomeDir \
       --genomeFastaFiles $genome \
       --sjdbFileChrStartEnd SJ.out.tab \
       --sjdbOverhang 75 \
       --runThreadN ${task.cpus}

  # Final read alignments
  STAR --genomeDir genomeDir \
       --readFilesIn $reads \
       --runThreadN ${task.cpus} \
       --readFilesCommand zcat \
       --outFilterType BySJout \
       --alignSJoverhangMin 8 \
       --alignSJDBoverhangMin 1 \
       --outFilterMismatchNmax 999 \
       --outSAMtype BAM SortedByCoordinate \
       --outSAMattrRGline ID:$replicateId LB:library PL:illumina PU:machine SM:GM12878

  # Index the BAM file
  samtools index Aligned.sortedByCoord.out.bam
  """
}

Tip	The final command produces an bam index which is the full filename with an additional `.bai` suffix.

The next step is a filtering step using GATK. For each sample, we split all the reads that contain N characters in their CIGAR string.

Process 3

GATK Split on N

The process creates k+1 new reads (where k is the number of N cigar elements) that correspond to the segments of the original read beside/between the splicing events represented by the Ns in the original CIGAR.

You should implement a process having the following structure:

Name: 3_rnaseq_gatk_splitNcigar
Command: split reads on Ns in CIGAR string using GATK
Input: the genome fasta file
the genome index made with samtools
the genome dictionary made with picard
a set containg replicate id, aligned bam file and aligned bam file index from the STAR mapping
Output: a set containing the sample id, the split bam file and the split bam index file

Problem #7

Copy the code below and paste it at the end of main.nf.

You must fill in the four BLANK_LINE lines in the input and the one BLANK_LINE line in the output.

Caution

There is an optional tag line added to the start of this process. The tag line allows you to assign a name to a specific task (single execution of a process). This is particularly useful when there are many samples/replicates which pass through the same process.

process '3_rnaseq_gatk_splitNcigar' {
  tag OPTIONAL_BLANK

  input:
      BLANK_LINE
      BLANK_LINE
      BLANK_LINE
      BLANK_LINE

  output:
      BLANK_LINE

  script:
  """
  # SplitNCigarReads and reassign mapping qualities
  java -jar $GATK -T SplitNCigarReads \
                  -R $genome -I $bam \
                  -o split.bam \
                  -rf ReassignOneMappingQuality \
                  -RMQF 255 -RMQT 60 \
                  -U ALLOW_N_CIGAR_READS \
                  --fix_misencoded_quality_scores
  """
}

Tip	The GATK command above automatically creates a bam index (.bai) of the split.bam output file

Next we perform a Base Quality Score Recalibration step using GATK.

Process 4

GATK Recalibrate

This step uses GATK to detect systematic errors in the base quality scores, select unique alignments and then index the resulting bam file with samtools. You can find details of the specific GATK BaseRecalibrator parameters here.

You should implement a process having the following structure:

Name: 4_rnaseq_gatk_recalibrate
Command: recalibrate reads from each replicate using GATK
Input: the genome fasta file
the genome index made with samtools
the genome dictionary made with picard
a set containg replicate id, aligned bam file and aligned bam file index from process 3
a set containing the filtered/recoded VCF file and the tab index (TBI) file from process 1D
Output: a set containing the sample id, the unique bam file and the unique bam index file

Problem #8

Copy the code below and paste it at the end of main.nf.

You must fill in the five BLANK_LINE lines in the input and the one BLANK_LINE line in the output.

process '4_rnaseq_gatk_recalibrate' {
  tag "$replicateId"

  input:
      BLANK_LINE
      BLANK_LINE
      BLANK_LINE
      BLANK_LINE
      BLANK_LINE

  output:
      BLANK into (final_output_ch, bam_for_ASE_ch) (1)

  script:
    sampleId = replicateId.replaceAll(/[12]$/,'')
    """
    # Indel Realignment and Base Recalibration
    java -jar $GATK -T BaseRecalibrator \
                  --default_platform illumina \
                  -cov ReadGroupCovariate \
                  -cov QualityScoreCovariate \
                  -cov CycleCovariate \
                  -knownSites ${variants_file} \
                  -cov ContextCovariate \
                  -R ${genome} -I ${bam} \
                  --downsampling_type NONE \
                  -nct ${task.cpus} \
                  -o final.rnaseq.grp

     java -jar $GATK -T PrintReads \
                  -R ${genome} -I ${bam} \
                  -BQSR final.rnaseq.grp \
                  -nct ${task.cpus} \
                  -o final.bam

    # Select only unique alignments, no multimaps
    (samtools view -H final.bam; samtools view final.bam| grep -w 'NH:i:1') \
    |samtools view -Sb -  > ${replicateId}.final.uniq.bam (2)

    # Index BAM files
    samtools index ${replicateId}.final.uniq.bam (3)
    """
}

The files resulting from this process will be used in two downstream processes. If a process is executed more than once, and the downstream channel is used by more than one process, we must duplicate the channel. We can do this using the into operator with parenthesis in the output section. See here for more information on using into.
The unique bam file
The index of the unique bam file (bam file name + .bai)

Now we are ready to perform the variant calling with GATK.

Process 5

GATK Variant Calling

This steps call variants with GATK HaplotypeCaller. You can find details of the specific GATK HaplotypeCaller parameters here.

You should implement a process having the following structure:

Name: 5_rnaseq_call_variants
Command: variant calling of each sample using GATK
Input: the genome fasta file
the genome index made with samtools
the genome dictionary made with picard
a set containg replicate id, aligned bam file and aligned bam file index from process 4
Output: a set containing the sample id the resulting variant calling file (vcf)

Problem #9

In this problem we will introduce the use of a channel operator in the input section. The groupTuple operator groups together the tuples emitted by a channel which share a common key.

Caution

Note that in process 4, we used the sampleID (not replicateID) as the first element of the set in the output. Now we combine the replicates by grouping them on the sample ID. It follows from this that process 4 is run one time per replicate and process 5 is run one time per sample.

Fill in the BLANKS as before.

process '5_rnaseq_call_variants' {
  tag BLANK

  input:
      BLANK_LINE
      BLANK_LINE
      BLANK_LINE
      BLANK from BLANK.groupTuple()

  output:
      BLANK_LINE

  script:
  """
  echo "${bam.join('\n')}" > bam.list

  # Variant calling
  java -jar $GATK -T HaplotypeCaller \
                  -R $genome -I bam.list \
                  -dontUseSoftClippedBases \
                  -stand_call_conf 20.0 \
                  -o output.gatk.vcf.gz

  # Variant filtering
  java -jar $GATK -T VariantFiltration \
                  -R $genome -V output.gatk.vcf.gz \
                  -window 35 -cluster 3 \
                  -filterName FS -filter "FS > 30.0" \
                  -filterName QD -filter "QD < 2.0" \
                  -o final.vcf
  """
}

Processes 6A and 6B

ASE & RNA Editing

In the final steps we will create processes for Allele-Specific Expression and RNA Editing Analysis.

We must process the VCF result to prepare variants file for allele specific expression (ASE) analysis. We will implement both processes togther.

You should implement two processes having the following structure:

1st process

Name: 6A_post_process_vcf
Command: post-process the variant calling file (vcf) of each sample
Input: set containing the sample ID and vcf file
a set containing the filtered/recoded VCF file and the tab index (TBI) file from process 1D
Output: a set containing the sample id, the variant calling file (vcf) and a file containing common SNPs

2nd process

Name: 6B_prepare_vcf_for_ase
Command: prepare the VCF for allele specific expression (ASE) and generate a figure in R.
Input: a set containing the sample id, the variant calling file (vcf) and a file containing common SNPs
Output: a set containing the sample ID and known SNPs in the sample for ASE
a figure of the SNPs generated in R as a PDF file

Problem #10

Here we introduce the publishDir directive. This allows us to specifiy a location for the outputs of the process. See here for more details.

You must have the output of process 6A become the input of process 6B.

process '6A_post_process_vcf' {
  tag BLANK
  publishDir "$params.results/$sampleId" (1)

  input:
      BLANK_LINE
      BLANK_LINE

  output:
      BLANK_LINE

  script:
  '''
  grep -v '#' final.vcf | awk '$7~/PASS/' |perl -ne 'chomp($_); ($dp)=$_=~/DP\\=(\\d+)\\;/; if($dp>=8){print $_."\\n"};' > result.DP8.vcf

  vcftools --vcf result.DP8.vcf --gzdiff filtered.recode.vcf.gz  --diff-site --out commonSNPs
  '''
}


process '6B_prepare_vcf_for_ase' {
  tag BLANK
  publishDir BLANK

  input:
      BLANK_LINE
  output:
      BLANK_LINE
      BLANK_LINE

  script:
  '''
  awk 'BEGIN{OFS="\t"} $4~/B/{print $1,$2,$3}' commonSNPs.diff.sites_in_files  > test.bed

  vcftools --vcf final.vcf --bed test.bed --recode --keep-INFO-all --stdout > known_snps.vcf

  grep -v '#'  known_snps.vcf | awk -F '\\t' '{print $10}' \
               |awk -F ':' '{print $2}'|perl -ne 'chomp($_); \
               @v=split(/\\,/,$_); if($v[0]!=0 ||$v[1] !=0)\
               {print  $v[1]/($v[1]+$v[0])."\\n"; }' |awk '$1!=1' \
               >AF.4R

  gghist.R -i AF.4R -o AF.histogram.pdf
  '''
}

The final step is the GATK ASEReadCounter.

Problem #11

We have seen the basics of using processes in Nextflow. Yet one of the standout features of Nextflow is the operations that can be performed on channels outside of processes. See here for details on the specific operators.

Before we perform the GATK ASEReadCounter process, we must group the data for allele-specific expression. To do this we must combine channels.

The bam_for_ASE_ch channel emites tuples having the following structure, holding the final BAM/BAI files:

( sample_id, file_bam, file_bai )

The vcf_for_ASE channel emits tuples having the following structure:

( sample_id, output.vcf )

In the first operation, the BAMs are grouped together by sample id.

Next, this resulting channel is merged with the VCFs (vcf_for_ASE) having the same sample id.

We must take the merged channel and creates a channel named grouped_vcf_bam_bai_ch emitting the following tuples:

( sample_id, file_vcf, List[file_bam], List[file_bai] )

Your aim is to fill in the BLANKS below.

bam_for_ASE_ch
  .BLANK                            (1)
  .phase(vcf_for_ASE)               (2)
  .map{ left, right ->              (3)
    def sampleId = left[0]          (4)
    def bam = left[1]               (5)
    def bai = left[2]               (6)
    def vcf = right [1]             (7)
    tuple(BLANK, vcf, BLANK, BLANK) (8)
  .set { grouped_vcf_bam_bai_ch }   (9)

an operator that groups sets that contain a common first element.
the phase operator synchronizes the values emitted by two other channels. See here for more details
the map operator can apply any function to every item on a channel. In this case we take our tuple from the phase operation, define the seperate elements and create a new tuple.
define repID to be the first element of left.
define bam to be the second element of left.
define bai to be the third element of left.
define vcf to be the first element of right.
create a new tuple made of four elements
rename the resulting as grouped_vcf_bam_bai_ch

Caution

left and right above are arbitary names. From the phase operator documentation, we see that phase returns pairs of items. So here left originates from contents of the bam_for_ASE_ch channel and right originates from the contents of vcf_for_ASE channel.

Process 6C

Allele-Specific Expression analysis with GATK ASEReadCounter

Now we are ready for the final process.

You should implement a process having the following structure:

Name: 6C_ASE_knownSNPs
Command: calculate allele counts at a set of positions with GATK tools
Input: genome fasta file
genome index file from samtools
genome dictionary file
the `grouped_vcf_bam_bai_ch`channel
Output: the allele specific expression file (ASE.tsv)

Problem #12

You should construct the process and run the pipeline in its entirety.

  echo "${bam.join('\n')}" > bam.list

  java -jar $GATK -R ${genome} \
                  -T ASEReadCounter \
                  -o ASE.tsv \
                  -I bam.list \
                  -sites ${vcf}

Congratulations! If you made it this far you now have all the basics to create your own Nextflow workflows.

pipeline_results.adoc

Bonus step

Until now the pipeline has been executed using just a single sample (ENCSR000COQ1).

Now we can re-execute the pipeline specifying a large set of samples by using the command shown below:

nextflow run main.nf -resume --reads 'data/reads/ENCSR000C*_{1,2}.fastq.gz'

It will print an output similar to the one below:

N E X T F L O W  ~  version 0.24.1
Launching `main.nf` [backstabbing_nightingale] - revision: 1187e44c7a
[warm up] executor > local
[c6/75e3f4] Submitted process > 1A_prepare_genome_samtools (genome)
[7b/44e5d6] Submitted process > 1C_prepare_star_genome_index (genome)
[da/e19bcf] Submitted process > 1B_prepare_genome_picard (genome)
[95/1ad13d] Submitted process > 1D_prepare_vcf_file (known_variants.vcf)
[72/702900] Submitted process > 2_rnaseq_mapping_star (ENCSR000COR1)
[9a/5ca042] Submitted process > 2_rnaseq_mapping_star (ENCSR000CPO1)
[77/03ef01] Submitted process > 2_rnaseq_mapping_star (ENCSR000COR2)
[04/262db9] Submitted process > 2_rnaseq_mapping_star (ENCSR000COQ2)
[a4/64c69e] Submitted process > 2_rnaseq_mapping_star (ENCSR000CPO2)
[9e/ad3621] Submitted process > 2_rnaseq_mapping_star (ENCSR000COQ1)
[a5/cda1b0] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000COQ2)
[42/0565d7] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000COQ1)
[0c/68ce48] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000COR1)
[6b/3843e1] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000COR2)
[1c/8c474b] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000CPO1)
[98/f17992] Submitted process > 3_rnaseq_gatk_splitNcigar (ENCSR000CPO2)
[c2/8cdfca] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000COQ1)
[d1/1a6935] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000COR1)
[9f/b4c61d] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000COR2)
[aa/b43a43] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000COQ2)
[46/2d96f0] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000CPO1)
[85/6b9527] Submitted process > 4_rnaseq_gatk_recalibrate (ENCSR000CPO2)
[79/a7fb48] Submitted process > 5_rnaseq_call_variants (ENCSR000CPO)
[a5/29c017] Submitted process > 5_rnaseq_call_variants (ENCSR000COQ)
[22/1fdea2] Submitted process > 5_rnaseq_call_variants (ENCSR000COR)
[7d/e1adfb] Submitted process > 6A_post_process_vcf (ENCSR000CPO)
[0a/4d43fc] Submitted process > 6A_post_process_vcf (ENCSR000COQ)
[18/8d486b] Submitted process > 6A_post_process_vcf (ENCSR000COR)
[60/427153] Submitted process > 6B_prepare_vcf_for_ase (ENCSR000CPO)
[32/64eff0] Submitted process > 6B_prepare_vcf_for_ase (ENCSR000COQ)
[31/32ad40] Submitted process > 6B_prepare_vcf_for_ase (ENCSR000COR)
[6f/a5e211] Submitted process > 6C_ASE_knownSNPs (ENCSR000COR)
[ff/989dc1] Submitted process > 6C_ASE_knownSNPs (ENCSR000CPO)
[25/92875a] Submitted process > 6C_ASE_knownSNPs (ENCSR000COQ)

You can notice that this time the pipeline spawns the execution of more tasks because three samples have been provided instead of one.

This shows the ability of Nextflow to implicitly handle multiple parallel task executions depending on the specified pipeline input dataset.

Files

implementation.adoc

Latest commit

History

implementation.adoc

File metadata and controls

Pipeline Implementaton

Data preparation

Input parameters

Problem #1

Process 1A

Create a FASTA genome index

Problem #2

Process 1B

Create a FASTA genome sequence dictionary with Picard for GATK

Problem #3

Process 1C

Create STAR genome index file

Problem #4

Process 1D

Filtered and recoded set of variants

Problem #5

Process 2

STAR Mapping

Problem #6

Process 3

GATK Split on N

Problem #7

Process 4

GATK Recalibrate

Problem #8

Process 5

GATK Variant Calling

Problem #9

Processes 6A and 6B

ASE & RNA Editing

Problem #10

Problem #11

Process 6C

Allele-Specific Expression analysis with GATK ASEReadCounter

Problem #12

Bonus step