Improve tumor vs normal sample recognition in a vcf file #47

ruslan-forostianov · 2022-09-20T11:39:05Z

Currently, standartize_mutation_data.py while reading vcf file with 2 columns interprets first as a tumor sample and second as a normal sample.

See https://github.com/genome-nexus/annotation-tools/blob/master/standardize_mutation_data.py#L854

We work with vcf files that don't have fixed order of the sample columns.

However, our vcf header contains metadata like the following:

##normal_sample=sample_45345
##tumor_sample=sample_867657

Although, this metadata does not seem to be part of the vcf specification (https://samtools.github.io/hts-specs/VCFv4.1.pdf),
it seems to be used out there.

My proposal is to make def get_vcf_sample_and_normal_ids(filename) function to look for normal_sample and tumor_sample in the header first. It'd fall back to the existing logic if such metadata has not been found.

The text was updated successfully, but these errors were encountered:

ruslan-forostianov · 2022-09-20T11:43:11Z

@ao508 @inodb @sheridancbio I would love to hear your thoughts before start implementing the solution. Would the change be beneficial for the upstream?

inodb · 2022-09-21T15:56:18Z

@ruslan-forostianov hmm i haven’t seen that before, but seems like it could be a nice enhancement as an option to the cli e.g. --tumor-sample-vcf-name tumor_sample and --normal-sample-vcf-name normal_sample. Something along those lines?

ruslan-forostianov · 2022-09-22T17:20:17Z

@inodb I like the idea 👍

ruslan-forostianov · 2022-09-30T12:35:49Z

I’ve realised that the script receives option to specify input directory with vcfs, not a single vcf. Passing tumor/normal sample name with options does not make sense at this level.

standardize_mutation_data.py --input-directory [path/to/input/directory] --output-directory [path/to/output/directory] --center [default name for center] --sequence-source [WGS | WXS] --extensions [comma-separated list of extensions]

After some additional research and discussion with @inodb, I decided to proceed with the original idea of using normal_sample and tumor_sample in the vcf meta header.

These variables are not part of the VCF specification but used widely by gatk. See https://github.com/broadinstitute/gatk/search?q=%23%23normal_sample%3D

Here is the PR with an implementation:
#48

ruslan-forostianov mentioned this issue Sep 30, 2022

Improve tumor / nomal sample column detection in vcf files #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tumor vs normal sample recognition in a vcf file #47

Improve tumor vs normal sample recognition in a vcf file #47

ruslan-forostianov commented Sep 20, 2022 •

edited

Loading

ruslan-forostianov commented Sep 20, 2022

inodb commented Sep 21, 2022

ruslan-forostianov commented Sep 22, 2022

ruslan-forostianov commented Sep 30, 2022

Improve tumor vs normal sample recognition in a vcf file #47

Improve tumor vs normal sample recognition in a vcf file #47

Comments

ruslan-forostianov commented Sep 20, 2022 • edited Loading

ruslan-forostianov commented Sep 20, 2022

inodb commented Sep 21, 2022

ruslan-forostianov commented Sep 22, 2022

ruslan-forostianov commented Sep 30, 2022

ruslan-forostianov commented Sep 20, 2022 •

edited

Loading