Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace input CSV with YAML and parse sample ID from BAM #106

Merged
merged 20 commits into from
Aug 1, 2023

Conversation

Faizal-Eeman
Copy link
Contributor

@Faizal-Eeman Faizal-Eeman commented Jul 25, 2023

Description

  • Replace CSV with YAML input
  • Parse YAML input and organize input channels
  • Parse sample ID from tumor BAM and use it to name pipeline output directory

Closes

The following related issues,
#103
#104
#105

Testing Results

  • DNA A-mini TWGSAMIN000001-T003-S03-F
    • internal id: TWGSAMIN000001-N003-S03-F, TWGSAMIN000001-T003-S03-F.bam
    • sample: HG002.N-0, S2.T-0 (sample names in BAM file)
    • input YAML: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/input/yaml/call-sSV-input-TWGSAMIN000001-T003-S03-F.yaml
    • config: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-replace-input-csv-with-yaml/TWGSAMIN000001-T003-S03-F.config
    • output: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-replace-input-csv-with-yaml/TWGSAMIN000001-T003-S03-F/call-sSV-5.0.0/S2_v1.1.5/

This is an edge case where the sample ID in the BAM is different from the internal sample ID. However in production, the sample ID in the BAM would be the same as the internal sample ID.

  • DNA real sample ILHNLNEV000009-T002-L01-F
    • sample: ILHNLNEV000009-N001-B01-F, ILHNLNEV000009-T002-L01-F
    • input YAML: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-replace-input-csv-with-yaml/ILHNLNEV000009-T002-L01-F_F32.yaml
    • config: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-replace-input-csv-with-yaml/ILHNLNEV000009-T002-L01-F_F32.config
    • output: /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-replace-input-csv-with-yaml/ILHNLNEV000009-T002-L01-F/

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the [github standards](https://confluence.mednet.ucla.edu/pages/viewpage.action?spaceKey=BOUTROSLAB&title=GitHub+Standards#GitHubStanda
    rds-Branchprotectionrule) before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one A-mini sample with algorithm = ['delly', 'manta']. The paths to the test config files and output directories are attached above.

Copy link

@Alfredo-Enrique Alfredo-Enrique left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of good stuff Faizal!

Made some comments/questions!

config/methods.config Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
config/schema.yaml Outdated Show resolved Hide resolved
config/schema.yaml Outdated Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
Comment on lines +92 to +99
tumor_id_bam_bai = input_ch_samples_with_index
.filter{ it.sample_type == 'tumor' }
.map{ it -> [it.id, it.path, it.index] }
normal_bam_bai = input_ch_samples_with_index
.filter{ it.sample_type == 'normal' }
.map{ it -> [it.path, it.index] }

input_paired_bams_ch = tumor_id_bam_bai.combine(normal_bam_bai)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm this main input_paired_bams_ch has 4 items correct? I thought the first input used by the SV caller modules intakes a tuple with 5 arguments? From Delly:

 input:
        tuple(val(tumor_id), path(tumor_bam), path(tumor_bai), path(normal_bam), path(normal_bai))

Did you run into any issues there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues here. The input_paired_bams_ch channel has 5 items, 3 from tumor_id_bam_bai and 2 from normal_bam_bai which are combined in L99.

@Alfredo-Enrique
Copy link

Alfredo-Enrique commented Jul 25, 2023

Oh! Noticed this only parrtially addresses #103 . The output directory automatically created is named after the tumor "SM" tag I believe as opposed to the sample-id. In the case where these two are the same, this issue is hidden, but that's not the case for many datasets. Here for example, the sample ID is TWGSAMIN000001-N003-S03-F but you can see from the tree that the output dir created is named S2_v1.1.5

For context, the provided config file has the output directory path:

// Inputs/parameters of the pipeline
params {
....
    output_dir = "/hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-replace-input-csv-with-yaml/TWGSAMIN000001-T003-S03-F/"

...
    

But the final output directory structure is detached from the provided sample-id in the yaml:

(base) [alfgonzalez@ip-0A125248 TWGSAMIN000001-T003-S03-F]$ tree -L 5
.
└── call-sSV-5.0.0
    └── S2_v1.1.5
        ├── DELLY-1.1.3
        │   └── output
        │       ├── DELLY-1.1.3_TWGSAMIN000001_S2-v1.1.5.bcf
        │       ├── DELLY-1.1.3_TWGSAMIN000001_S2-v1.1.5.bcf.csi
        │       ├── DELLY-1.1.3_TWGSAMIN000001_S2-v1.1.5.bcf.csi.sha512
        │       └── DELLY-1.1.3_TWGSAMIN000001_S2-v1.1.5.bcf.sha512
        ├── log-call-sSV-5.0.0-20230725T072204Z
        │   └── nextflow-log
        │       ├── report.html
        │       ├── timeline.html
        │       └── trace.txt
        ├── log-call-sSV-5.0.0-20230725T072639Z
        │   ├── nextflow-log
        │   │   ├── report.html
        │   │   ├── timeline.html
        │   │   └── trace.txt
        │   └── process-log
        │       ├── call_sSV_Delly
        │       ├── call_sSV_Manta
        │       ├── filter_BCF_BCFtools
        │       ├── filter_sSV_Delly
        │       ├── generate_sha512_BCFtools
        │       ├── generate_sha512_Manta
        │       ├── PipeVal-4.0.0-rc.2
        │       └── query_SampleName_BCFtools
        ├── Manta-1.6.0
        │   ├── output
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_candidateSmallIndels.vcf.gz
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_candidateSmallIndels.vcf.gz.sha512
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_candidateSmallIndels.vcf.gz.tbi
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_candidateSmallIndels.vcf.gz.tbi.sha512
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_candidateSV.vcf.gz
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_candidateSV.vcf.gz.sha512
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_candidateSV.vcf.gz.tbi
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_candidateSV.vcf.gz.tbi.sha512
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_diploidSV.vcf.gz
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_diploidSV.vcf.gz.sha512
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_diploidSV.vcf.gz.tbi
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_diploidSV.vcf.gz.tbi.sha512
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_somaticSV.vcf.gz
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_somaticSV.vcf.gz.sha512
        │   │   ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_somaticSV.vcf.gz.tbi
        │   │   └── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_somaticSV.vcf.gz.tbi.sha512
        │   └── QC
        │       ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_alignmentStatsSummary.txt
        │       ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_svCandidateGenerationStats.tsv
        │       ├── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_svCandidateGenerationStats.xml
        │       └── Manta-1.6.0_TWGSAMIN000001_S2-v1.1.5_svLocusGraphStats.tsv
        └── validation
            └── run_validate_PipeVal
                └── input_validation.txt

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments:

config/custom_schema_types.config Outdated Show resolved Hide resolved
config/methods.config Outdated Show resolved Hide resolved
input/call-sSV-input.yaml Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
main.nf Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
@Faizal-Eeman
Copy link
Contributor Author

Oh! Noticed this only parrtially addresses #103 . The output directory automatically created is named after the tumor "SM" tag I believe as opposed to the sample-id. In the case where these two are the same, this issue is hidden, but that's not the case for many datasets. Here for example, the sample ID is TWGSAMIN000001-N003-S03-F but you can see from the tree that the output dir created is named S2_v1.1.5

This is intended as the SM tag removes any chance of user caused mislabelling of sample_id.
This test run is an edge case where the Internal ID wasn't used during the alignment process, hence the original S2_v1.1.5 in the BAM SM tag. For production runs, the internal sample ID and the SM tag would match.

@Faizal-Eeman
Copy link
Contributor Author

Testing the updated PR on a real sample ILHNLNEV000009-T002-L01-F from one of the registered projects.

/hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-replace-input-csv-with-yaml/yaml_input_F32_ILHNLNEV000009-T002-L01-F.log

@Faizal-Eeman
Copy link
Contributor Author

Faizal-Eeman commented Jul 26, 2023

While the test run for a real sample is still in progress, we can see the sample directory is set up as expected.

/hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-replace-input-csv-with-yaml/ILHNLNEV000009-T002-L01-F\
/call-sSV-5.0.0/ILHNLNEV000009-T002-L01-F/

any other comments here? @tyamaguchi-ucla @yashpatel6 @Alfredo-Enrique

config/methods.config Outdated Show resolved Hide resolved
@Faizal-Eeman
Copy link
Contributor Author

Faizal-Eeman commented Jul 28, 2023

Test runs completed successfully.

/hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-replace-input-csv-with-yaml/TWGSAMIN000001-T003-S03-F/call-sSV-5.0.0/S2_v1.1.5/

/hot/software/pipeline/pipeline-call-sSV/Nextflow/development/unreleased/mmootor-replace-input-csv-with-yaml/ILHNLNEV000009-T002-L01-F/call-sSV-5.0.0/ILHNLNEV000009-T002-L01-F/

(Updated paths) @tyamaguchi-ucla @yashpatel6

@Faizal-Eeman
Copy link
Contributor Author

@tyamaguchi-ucla @yashpatel6 I think this PR is ready to be merged and we can proceed with the next PR to release call-sSV v6.0.0?

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

For the development inputs, /hot/software/pipeline/pipeline-call-sSV/Nextflow/development/input/yaml/call-sSV-input.yaml - I would suggest naming the file to indicate the sample rather than the generic name currently given; we'll likely have more YAMLs there in the future and it's clearer to name base on sample.

Anything else to add @tyamaguchi-ucla ?

config/methods.config Outdated Show resolved Hide resolved
Copy link
Contributor

@tyamaguchi-ucla tyamaguchi-ucla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@Faizal-Eeman Faizal-Eeman merged commit 9917c17 into main Aug 1, 2023
1 check passed
@Faizal-Eeman Faizal-Eeman deleted the replace-input-csv-with-yaml branch August 1, 2023 20:34
@Faizal-Eeman Faizal-Eeman mentioned this pull request Aug 1, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants