Can we standarize the output directory a bit more to include sample ID? And or add a sample-ID paramater to the inputcsv? #103

Alfredo-Enrique · 2023-07-11T06:43:02Z

Running through the metapipeline a whole bunch of samples and noticed that call-sSV is the only one that does not fully have a standardized output folder structure. The main output folder is based on whatever the input file bam name was eg (BWA-MEM2-2.2.1_GATK-4.2.4.1_TCGA-STSA_H-MV-3B-A9HS-01A-11D-A38Z-09) as opposed to the sample-ID we use for the other pipelines.

This makes it nonstandard for future analysis scripts as we will have to do additional parsing of metadata to match the specified input file name as opposed to using our internal sample IDs

Best,

EXAMPLE
call-sSV-5.0.0/
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_TCGA-STSA_H-MV-3B-A9HS-01A-11D-A38Z-09
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_TCGA-STSA_H-MV-DX-A23Y-01A-11D-A27P-09
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_TCGA-STSA_H-MV-DX-A240-01A-32D-A27P-09
...
call-gSNP-10.0.0-rc.1/
├── TCGASTSA000001-T001-P01-P
├── TCGASTSA000002-T001-P01-P
├── TCGASTSA000003-T001-P01-P
...
call-mtSNV-3.0.0/
├── TCGASTSA000001-T001-P01-P
├── TCGASTSA000002-T001-P01-P
├── TCGASTSA000003-T001-P01-P

I believe #65 is related.

Alfredo-Enrique · 2023-07-12T21:00:21Z

@Faizal-Eeman @yashpatel6 @tyamaguchi-ucla

Faizal-Eeman · 2023-07-12T21:47:48Z

Thanks for bringing this up @Alfredo-Enrique. Back then we had temporarily held back on this front of standardization, we can have this finalized and added in the next immediate release.

Faizal-Eeman · 2023-07-17T22:17:58Z

@Alfredo-Enrique @tyamaguchi-ucla @yashpatel6

I just checked call-sSV's test results for release v5.0.0 and I see the sample level dir structure already exists.

/hot/software/pipeline/pipeline-call-sSV/Nextflow/development/5.0.0/mmootor-release-5-0-0/test_5.0.0_rc.1/call-sSV-5.0.0-rc.1/ILHNLNEV000009-T002-L01-F_realigned_recalibrated_reheadered/

Is it perhaps the way its written from meta-pipeline?

P.S. This comment below addressed issue #65 which is different from the current issue

Thanks for bringing this up @Alfredo-Enrique. Back then we had temporarily held back on this front of standardization, we can have this finalized and added in the next immediate release.

Alfredo-Enrique · 2023-07-17T23:09:14Z

Hmmm looking at the code I'm pretty sure right now it's just taking the basename of the tumor bam? I don't see any code on the pipeline or the input files where we specify sample-id.

For example here is the input csv of that run!

normal_bam,tumor_bam
/hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/pipelines/call-gSNP/2020-12-22/ILHNLNEV000009-T002-L01-F/gSNP/2021-01-22_11.01.06/ILHNLNEV000009/SAMtools-1.10_Picard-2.23.3/recalibrated_reheadered_bam_and_bai/ILHNLNEV000009-N001-B01-F_realigned_recalibrated_reheadered.bam,/hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/pipelines/call-gSNP/2020-12-22/ILHNLNEV000009-T002-L01-F//gSNP/2021-01-22_11.01.06/ILHNLNEV000009/SAMtools-1.10_Picard-2.23.3/recalibrated_reheadered_bam_and_bai/ILHNLNEV000009-T002-L01-F_realigned_recalibrated_reheadered.bam

Then if you look at the config, nowhere to specify sample-id:
/hot/software/pipeline/pipeline-call-sSV/Nextflow/development/5.0.0/mmootor-release-5-0-0/test_5.0.0_rc.1/test_5.0.0-rc.1_ILHNLNEV000009-T002-L01-F.config

// Inputs/parameters of the pipeline
params {
    dataset_id = "ILHNLNEV"

    blcds_registered_dataset = false
   
    input_csv = "/hot/user/rhughwhite/ILHNLNEV/call-sSV/test_5.0.0_rc.1/test_5.0.0_ILHNLNEV000009-T002-L01-F_tumor_control_pair.csv"

    reference_fasta = "/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta"

    exclusion_file = "/hot/ref/tool-specific-input/Delly/hg38/human.hg38.excl.tsv"

    output_dir = "/hot/users/rhughwhite/ILHNLNEV/call-sSV/test_5.0.0_rc.1"

    // select the tool(s) to run
    algorithm = ['delly', 'manta']
    
    save_intermediate_files = false

    verbose = false

    /**
    * Set up the Delly filtering parameters
    * The below default values are recommended to reduce runtimes.
    * See - https://github.com/dellytools/delly 'Delly is running too slowly what can I do?' for more 
    */
    map_qual = 20
    min_clique_size = 5
    mad_cutoff = 15

    /** 
    * The filter condition used by the filter_BCF_BCFtools process. 
    * See http://samtools.github.io/bcftools/bcftools.html#expressions 
    * Note, put single quotes inside double quotes. 
    */
    filter_condition = "FILTER=='PASS'"
    }

Alfredo-Enrique · 2023-07-17T23:27:16Z

Found it, here you go, right now we're just taking the filename. This is the code generating the input channel for the downstream modules. You can see line 107 we just get the file name.

pipeline-call-sSV/main.nf

Lines 102 to 111 in 0c8def0

 input_paired_bams_ch = Channel 

 .fromPath(params.input_csv, checkIfExists:true) 

 .splitCsv(header:true) 

 .map{ 

 row -> tuple( 

 Paths.get(row.tumor_bam).getFileName().toString().split('.bam')[0], 

 row.tumor_bam, 

 "${row.tumor_bam}.bai", 

 row.normal_bam, 

 "${row.normal_bam}.bai"

This is the 5-value tuple being fed as input to the modules (A), and the first value of the tuple is what's being used in our generate_standard_filename code excerpt (B):

A:

pipeline-call-sSV/module/delly.nf

Lines 26 to 27 in 0c8def0

 input: 

 tuple(val(tumor_id), path(tumor_bam), path(tumor_bai), path(normal_bam), path(normal_bai))

B:

pipeline-call-sSV/module/delly.nf

Lines 39 to 43 in 0c8def0

 script: 

 output_filename = generate_standard_filename( 

 "DELLY-${params.delly_version}", 

 params.dataset_id, 

 tumor_id,

Faizal-Eeman · 2023-07-18T00:11:09Z

Sample ID parsing from BAM was intentional and output dir structure is determined in methods.config

pipeline-call-sSV/config/methods.config

Lines 23 to 33 in 0c8def0

 set_output_dir = { 

 def sample 

 // assumes that project and samples name are in the pipeline.config 

 def reader = new FileReader(params.input_csv) 

 reader.splitEachLine(',') { parts -> [sample = parts[1].split('/')[-1].split('.bam')[0]] } 

 params.sample = "${sample}" 

 params.output_dir_base = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample}" 

 }

Anyway, I see the concern now. Actual sample ID should be used instead of ID parsed from BAM file name.

Faizal-Eeman · 2023-07-18T18:44:41Z

@Alfredo-Enrique can you share an example path to meta pipeline output?

Alfredo-Enrique · 2023-07-24T17:43:46Z

Yes happy to @Faizal-Eeman !
Config: /hot/user/alfgonzalez/project/project-LAND-SARC/project-SARC-LAND/pipeline/TCGA-STSA/metapipeline-DNA-5.0.0-rc.5/input/test_WXS_TCGA-STSA_meta.input.config
metapipeline_output_folder: /hot/project/disease/SarcomaTumor/SARC-000118-SarcomaLandscape/data/TCGA-STSA/WXS/
The more direct folder within that is here: /hot/project/disease/SarcomaTumor/SARC-000118-SarcomaLandscape/data/TCGA-STSA/WXS/metapipeline-DNA-5.0.0-rc.4/TCGA-STSA/main_workflow/output

Let me know if you have any quesitons or if I can help in any way!

Alfredo-Enrique mentioned this issue Jul 11, 2023

Standardize input CSV #69

Closed

8 tasks

Alfredo-Enrique changed the title ~~Can we standarize the output directory a bit more to include patient ID? And or add a patient folder paramater to the inputcsv?~~ Can we standarize the output directory a bit more to include sample ID? And or add a sample-ID paramater to the inputcsv? Jul 12, 2023

tyamaguchi-ucla assigned Faizal-Eeman Jul 12, 2023

Faizal-Eeman mentioned this issue Jul 18, 2023

Parse Patient ID or Tumor ID from BAM file #105

Closed

Faizal-Eeman mentioned this issue Jul 25, 2023

Replace input CSV with YAML and parse sample ID from BAM #106

Merged

8 tasks

Faizal-Eeman closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we standarize the output directory a bit more to include sample ID? And or add a sample-ID paramater to the inputcsv? #103

Can we standarize the output directory a bit more to include sample ID? And or add a sample-ID paramater to the inputcsv? #103

Alfredo-Enrique commented Jul 11, 2023 •

edited

Loading

Alfredo-Enrique commented Jul 12, 2023

Faizal-Eeman commented Jul 12, 2023

Faizal-Eeman commented Jul 17, 2023

Alfredo-Enrique commented Jul 17, 2023

Alfredo-Enrique commented Jul 17, 2023 •

edited

Loading

Faizal-Eeman commented Jul 18, 2023

Faizal-Eeman commented Jul 18, 2023

Alfredo-Enrique commented Jul 24, 2023

Can we standarize the output directory a bit more to include sample ID? And or add a sample-ID paramater to the inputcsv? #103

Can we standarize the output directory a bit more to include sample ID? And or add a sample-ID paramater to the inputcsv? #103

Comments

Alfredo-Enrique commented Jul 11, 2023 • edited Loading

Alfredo-Enrique commented Jul 12, 2023

Faizal-Eeman commented Jul 12, 2023

Faizal-Eeman commented Jul 17, 2023

Alfredo-Enrique commented Jul 17, 2023

Alfredo-Enrique commented Jul 17, 2023 • edited Loading

Faizal-Eeman commented Jul 18, 2023

Faizal-Eeman commented Jul 18, 2023

Alfredo-Enrique commented Jul 24, 2023

Alfredo-Enrique commented Jul 11, 2023 •

edited

Loading

Alfredo-Enrique commented Jul 17, 2023 •

edited

Loading