Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tumor vs normal sample recognition in a vcf file #47

Open
ruslan-forostianov opened this issue Sep 20, 2022 · 4 comments
Open

Comments

@ruslan-forostianov
Copy link
Contributor

ruslan-forostianov commented Sep 20, 2022

Currently, standartize_mutation_data.py while reading vcf file with 2 columns interprets first as a tumor sample and second as a normal sample.

See https://github.com/genome-nexus/annotation-tools/blob/master/standardize_mutation_data.py#L854

We work with vcf files that don't have fixed order of the sample columns.

However, our vcf header contains metadata like the following:

##normal_sample=sample_45345
##tumor_sample=sample_867657

Although, this metadata does not seem to be part of the vcf specification (https://samtools.github.io/hts-specs/VCFv4.1.pdf),
it seems to be used out there.

My proposal is to make def get_vcf_sample_and_normal_ids(filename) function to look for normal_sample and tumor_sample in the header first. It'd fall back to the existing logic if such metadata has not been found.

@ruslan-forostianov
Copy link
Contributor Author

@ao508 @inodb @sheridancbio I would love to hear your thoughts before start implementing the solution. Would the change be beneficial for the upstream?

@inodb
Copy link
Member

inodb commented Sep 21, 2022

@ruslan-forostianov hmm i haven’t seen that before, but seems like it could be a nice enhancement as an option to the cli e.g. --tumor-sample-vcf-name tumor_sample and --normal-sample-vcf-name normal_sample. Something along those lines?

@ruslan-forostianov
Copy link
Contributor Author

@inodb I like the idea 👍

@ruslan-forostianov
Copy link
Contributor Author

I’ve realised that the script receives option to specify input directory with vcfs, not a single vcf. Passing tumor/normal sample name with options does not make sense at this level.

standardize_mutation_data.py --input-directory [path/to/input/directory] --output-directory [path/to/output/directory] --center [default name for center] --sequence-source [WGS | WXS] --extensions [comma-separated list of extensions]

After some additional research and discussion with @inodb, I decided to proceed with the original idea of using normal_sample and tumor_sample in the vcf meta header.

These variables are not part of the VCF specification but used widely by gatk. See https://github.com/broadinstitute/gatk/search?q=%23%23normal_sample%3D

Here is the PR with an implementation:
#48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants