Input files for Sarek can be specified using a tsv file given to the --sample
parameter. The tsv file is a Tab Separated Value file with columns: subject gender status sample lane fastq1 fastq2
or subject gender status sample bam bai
.
The content of these columns should be quite straight-forward:
subject
designate the subject, it should be the ID of the Patient, or if you don't have one, it could be the Normal ID Sample.gender
is the gender of the Patient, (XX or XY)status
is the status of the Patient, (0 for Normal or 1 for Tumor)sample
designate the Sample, it should be the ID of the Sample (it is possible to have more than one tumor sample for each patient)lane
is used when the sample is multiplexed on several lanesfastq1
is the path to the first pair of the fastq filefastq2
is the path to the second pair of the fastq filebam
is the bam filebai
is the index
All examples are given for a normal/tumor pair. If no tumors are listed in the TSV file, then the workflow will proceed as if it was a single normal sample instead of a normal/tumor pair.
In this sample for the normal case there are 3 read groups, and 2 for the tumor. It is recommended to add the absolute path of the paired FASTQ files, but relative path should work also. Note, the delimiter is the tab (\t) character:
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFACXX111207.1_1.fastq.gz pathToFiles/C09DFACXX111207.1_2.fastq.gz
G15511 XX 0 C09DFN C09DF_2 pathToFiles/C09DFACXX111207.2_1.fastq.gz pathToFiles/C09DFACXX111207.2_2.fastq.gz
G15511 XX 0 C09DFN C09DF_3 pathToFiles/C09DFACXX111207.3_1.fastq.gz pathToFiles/C09DFACXX111207.3_2.fastq.gz
G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMACXX111207.1_1.fastq.gz pathToFiles/D0ENMACXX111207.1_2.fastq.gz
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMACXX111207.2_1.fastq.gz pathToFiles/D0ENMACXX111207.2_2.fastq.gz
On the other hand, if you have BAMs (T/N pairs that were not realigned together) and their indexes, you should use a structure like:
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.real.bam pathToFiles/G15511.C09DFN.md.real.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.real.bam pathToFiles/G15511.D0ENMT.md.real.bai
All the files will be created in the Preprocessing/NonRealigned/ directory, and by default a corresponding TSV file will also be deposited there. Generally, getting MuTect1 and Strelka calls on the preprocessed files should be done by:
nextflow run /shared/ucl/depts/cancer/apps/nextflow_pipelines/Sarek/main.nf -profile legion --sample Preprocessing/NonRealigned/mysample.tsv --step realign
nextflow run /shared/ucl/depts/cancer/apps/nextflow_pipelines/Sarek/somaticVC.nf -profile legion --sample Preprocessing/Recalibrated/mysample.tsv --tools Mutect2,Strelka
The same way, if you have recalibrated BAMs (T/N pairs that were realigned together) and their indexes, you should use a structure like:
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.real.bam pathToFiles/G15511.C09DFN.md.real.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.real.bam pathToFiles/G15511.D0ENMT.md.real.bai
All the files will be in he Preprocessing/Recalibrated/ directory, and by default a corresponding TSV file will also be deposited there. Generally, getting MuTect1 and Strelka calls on the recalibrated files should be done by:
nextflow run /shared/ucl/depts/cancer/apps/nextflow_pipelines/Sarek/somaticVC.nf -profile legion --sample Preprocessing/Recalibrated/mysample.tsv --tools Mutect2,Strelka
The input folder, containing the FASTQ files for one individual (ID) should be organized into one subfolder for every sample. All fastq files for that sample should be collected here.
ID
+--sample1
+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample2
+------sample2_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample2_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample3
+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
Fastq filename structure:
sample_lib_flowcell-index_lane_R1_1000.fastq.gz
andsample_lib_flowcell-index_lane_R2_1000.fastq.gz
Where:
sample
= sample idlib
= indentifier of libaray preparationflowcell
= identifyer of flow cell for the sequencing runlane
= identifier of the lane of the sequencing run
Read group information will be parsed from fastq file names according to this:
RGID
= "sample_lib_flowcell_index_lane"RGPL
= "Illumina"PU
= sampleRGLB
= lib