Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we standarize the output directory a bit more to include sample ID? And or add a sample-ID paramater to the inputcsv? #103

Closed
Alfredo-Enrique opened this issue Jul 11, 2023 · 8 comments
Assignees

Comments

@Alfredo-Enrique
Copy link

Alfredo-Enrique commented Jul 11, 2023

Running through the metapipeline a whole bunch of samples and noticed that call-sSV is the only one that does not fully have a standardized output folder structure. The main output folder is based on whatever the input file bam name was eg (BWA-MEM2-2.2.1_GATK-4.2.4.1_TCGA-STSA_H-MV-3B-A9HS-01A-11D-A38Z-09) as opposed to the sample-ID we use for the other pipelines.

This makes it nonstandard for future analysis scripts as we will have to do additional parsing of metadata to match the specified input file name as opposed to using our internal sample IDs

Best,

EXAMPLE
call-sSV-5.0.0/
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_TCGA-STSA_H-MV-3B-A9HS-01A-11D-A38Z-09
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_TCGA-STSA_H-MV-DX-A23Y-01A-11D-A27P-09
├── BWA-MEM2-2.2.1_GATK-4.2.4.1_TCGA-STSA_H-MV-DX-A240-01A-32D-A27P-09
...
call-gSNP-10.0.0-rc.1/
├── TCGASTSA000001-T001-P01-P
├── TCGASTSA000002-T001-P01-P
├── TCGASTSA000003-T001-P01-P
...
call-mtSNV-3.0.0/
├── TCGASTSA000001-T001-P01-P
├── TCGASTSA000002-T001-P01-P
├── TCGASTSA000003-T001-P01-P

I believe #65 is related.

@Alfredo-Enrique Alfredo-Enrique changed the title Can we standarize the output directory a bit more to include patient ID? And or add a patient folder paramater to the inputcsv? Can we standarize the output directory a bit more to include sample ID? And or add a sample-ID paramater to the inputcsv? Jul 12, 2023
@Alfredo-Enrique
Copy link
Author

@Faizal-Eeman
Copy link
Contributor

Thanks for bringing this up @Alfredo-Enrique. Back then we had temporarily held back on this front of standardization, we can have this finalized and added in the next immediate release.

@Faizal-Eeman
Copy link
Contributor

@Alfredo-Enrique @tyamaguchi-ucla @yashpatel6

I just checked call-sSV's test results for release v5.0.0 and I see the sample level dir structure already exists.

/hot/software/pipeline/pipeline-call-sSV/Nextflow/development/5.0.0/mmootor-release-5-0-0/test_5.0.0_rc.1/call-sSV-5.0.0-rc.1/ILHNLNEV000009-T002-L01-F_realigned_recalibrated_reheadered/

Is it perhaps the way its written from meta-pipeline?

P.S. This comment below addressed issue #65 which is different from the current issue

Thanks for bringing this up @Alfredo-Enrique. Back then we had temporarily held back on this front of standardization, we can have this finalized and added in the next immediate release.

@Alfredo-Enrique
Copy link
Author

Hmmm looking at the code I'm pretty sure right now it's just taking the basename of the tumor bam? I don't see any code on the pipeline or the input files where we specify sample-id.

For example here is the input csv of that run!

normal_bam,tumor_bam
/hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/pipelines/call-gSNP/2020-12-22/ILHNLNEV000009-T002-L01-F/gSNP/2021-01-22_11.01.06/ILHNLNEV000009/SAMtools-1.10_Picard-2.23.3/recalibrated_reheadered_bam_and_bai/ILHNLNEV000009-N001-B01-F_realigned_recalibrated_reheadered.bam,/hot/project/disease/HeadNeckTumor/HNSC-000084-LNMEvolution/pipelines/call-gSNP/2020-12-22/ILHNLNEV000009-T002-L01-F//gSNP/2021-01-22_11.01.06/ILHNLNEV000009/SAMtools-1.10_Picard-2.23.3/recalibrated_reheadered_bam_and_bai/ILHNLNEV000009-T002-L01-F_realigned_recalibrated_reheadered.bam

Then if you look at the config, nowhere to specify sample-id:
/hot/software/pipeline/pipeline-call-sSV/Nextflow/development/5.0.0/mmootor-release-5-0-0/test_5.0.0_rc.1/test_5.0.0-rc.1_ILHNLNEV000009-T002-L01-F.config

// Inputs/parameters of the pipeline
params {
    dataset_id = "ILHNLNEV"

    blcds_registered_dataset = false
   
    input_csv = "/hot/user/rhughwhite/ILHNLNEV/call-sSV/test_5.0.0_rc.1/test_5.0.0_ILHNLNEV000009-T002-L01-F_tumor_control_pair.csv"

    reference_fasta = "/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta"

    exclusion_file = "/hot/ref/tool-specific-input/Delly/hg38/human.hg38.excl.tsv"

    output_dir = "/hot/users/rhughwhite/ILHNLNEV/call-sSV/test_5.0.0_rc.1"

    // select the tool(s) to run
    algorithm = ['delly', 'manta']
    
    save_intermediate_files = false

    verbose = false

    /**
    * Set up the Delly filtering parameters
    * The below default values are recommended to reduce runtimes.
    * See - https://github.com/dellytools/delly 'Delly is running too slowly what can I do?' for more 
    */
    map_qual = 20
    min_clique_size = 5
    mad_cutoff = 15

    /** 
    * The filter condition used by the filter_BCF_BCFtools process. 
    * See http://samtools.github.io/bcftools/bcftools.html#expressions 
    * Note, put single quotes inside double quotes. 
    */
    filter_condition = "FILTER=='PASS'"
    }

@Alfredo-Enrique
Copy link
Author

Alfredo-Enrique commented Jul 17, 2023

Found it, here you go, right now we're just taking the filename. This is the code generating the input channel for the downstream modules. You can see line 107 we just get the file name.

pipeline-call-sSV/main.nf

Lines 102 to 111 in 0c8def0

input_paired_bams_ch = Channel
.fromPath(params.input_csv, checkIfExists:true)
.splitCsv(header:true)
.map{
row -> tuple(
Paths.get(row.tumor_bam).getFileName().toString().split('.bam')[0],
row.tumor_bam,
"${row.tumor_bam}.bai",
row.normal_bam,
"${row.normal_bam}.bai"

This is the 5-value tuple being fed as input to the modules (A), and the first value of the tuple is what's being used in our generate_standard_filename code excerpt (B):

A:

input:
tuple(val(tumor_id), path(tumor_bam), path(tumor_bai), path(normal_bam), path(normal_bai))

B:

script:
output_filename = generate_standard_filename(
"DELLY-${params.delly_version}",
params.dataset_id,
tumor_id,

@Faizal-Eeman
Copy link
Contributor

Sample ID parsing from BAM was intentional and output dir structure is determined in methods.config

set_output_dir = {
def sample
// assumes that project and samples name are in the pipeline.config
def reader = new FileReader(params.input_csv)
reader.splitEachLine(',') { parts -> [sample = parts[1].split('/')[-1].split('.bam')[0]] }
params.sample = "${sample}"
params.output_dir_base = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample}"
}

Anyway, I see the concern now. Actual sample ID should be used instead of ID parsed from BAM file name.

@Faizal-Eeman
Copy link
Contributor

@Alfredo-Enrique can you share an example path to meta pipeline output?

@Alfredo-Enrique
Copy link
Author

Yes happy to @Faizal-Eeman !
Config: /hot/user/alfgonzalez/project/project-LAND-SARC/project-SARC-LAND/pipeline/TCGA-STSA/metapipeline-DNA-5.0.0-rc.5/input/test_WXS_TCGA-STSA_meta.input.config
metapipeline_output_folder: /hot/project/disease/SarcomaTumor/SARC-000118-SarcomaLandscape/data/TCGA-STSA/WXS/
The more direct folder within that is here: /hot/project/disease/SarcomaTumor/SARC-000118-SarcomaLandscape/data/TCGA-STSA/WXS/metapipeline-DNA-5.0.0-rc.4/TCGA-STSA/main_workflow/output

Let me know if you have any quesitons or if I can help in any way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants