Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec for short tandem repeat varaints #619

Closed
Talya-dor opened this issue Jan 3, 2022 · 9 comments
Closed

spec for short tandem repeat varaints #619

Talya-dor opened this issue Jan 3, 2022 · 9 comments
Labels

Comments

@Talya-dor
Copy link

Talya-dor commented Jan 3, 2022

hello
I am trying to write several tools as part of a program for interpretation of STR variants. Currently there are different variant callers with different vcf formats that are used widely (ExpansionHunter by illumina, lobSTR etc.)
It would be great to have some unifying spec that can be used as a baseline / guideline.
Is it possible to add this to the next vcf spec ?

The convention I would suggest:
chr - chromosome ('chr3')
pos - position of nucleotide before repeat begins (123)
ID - any string or int
ref - nucleotide in POS position (A)
alt - < STRn > where n is the number of repeats (<STR20>)
QUAL - any string or int
FILTER - any string or int
INFO
FORMAT
SAMPLE

@hdashnow
Copy link

hdashnow commented Jan 3, 2022

I've also been thinking about this for the purposes of deciding on an output VCF format for STRling (https://github.com/quinlan-lab/STRling)
Adding to this, it would need a repeat unit(s) and ideally some consistent way of representing uncertainty/range in the number of repeat units.

@jmarshall jmarshall added the vcf label Jan 4, 2022
@jmarshall
Copy link
Member

jmarshall commented Jan 4, 2022

I think there would be little appetite for adding a new different style of value for the ALT field, as that is something that would need to be implemented in all VCF parser implementations. On the other hand, a proposal for common INFO tags to represent the details of STRs would likely be welcomed — e.g., lobSTR's RU INFO tag appears to be a fairly appropriate candidate for blessing in the VCF spec.

See the example in VCFv4.3 §5.3 for the flavour of how tandem duplication variants are represented in REF/ALT currently. See also (parts of) PR #465 for the ongoing work of improving this — you may be interested in contributing to the discussion there (and in associated issues & PRs).

@Talya-dor
Copy link
Author

Talya-dor commented Jan 9, 2022

hi @jmarshall , thank you for the reply!
I have started reading the PR and related discussions you suggested. I do agree a lot of the SV suggestions can work well for STR variants, but I am wondering if STR variants are not widely considered different than SV <DUP> variants in the bioinformatics community and should have some sort of differentiation in the spec.
Also, the example of how tandem duplication variants are currently represented is problematic in my opinion - mainly the long ALT, and a ALT option that is similar to those we know from SV variants would be more appropriate (i.e. <STR>)

@ahwagner
Copy link

ahwagner commented Jan 10, 2022

This seems like a great topic for GA4GH VRS/VCF alignment. While VRS currently handles this with Repeated Sequence Expressions, there is an open issue about how to represent compound repeated sequences and it would be great to align any VRS and VCF solutions to this challenge if possible.

h/t to @rhdolin for connecting the dots.

@jmarshall
Copy link
Member

Thanks everyone for the interesting discussion on the call just now — I for one think I have a better understanding of the issues than before! It might be useful if @Talya-dor (and anyone else who'd like to) could add here some examples of the sorts of things they want to represent, to remind us of the examples discussed on the call.

@jmarshall
Copy link
Member

For the non-expert such as myself, the Kutner document linked from ga4gh/vrs#363 (as mentioned in #619 (comment)) is a very informative primer.

For current VCF, IMHO it would be appropriate to represent STRs in info fields, whether that would be split out as per e.g. lobSTR's collection of fields, or (primarily) by a single string field containing the familiar CTG[30]CAG[50] bracketed repeat count notation (which would need to be parsed by applications, unlike separate info tags that would be parsed primarily by the VCF library).

For future VCF revisions, one idea discussed is that it would be nice if the STR repeat count notation could be used in REF and ALT fields alongside the existing breakend and other notation (assuming the brackets in such notation and those in breakends could be unambiguously parsed):

…  REF      ALT              …
…  CAG[25]  CAG[34],CAG[38]  …

Allowing it in REF (as well as ALT) allows the reference allele to be shown naturally, though there is some interplay with normalisation to be considered.

@VJalili
Copy link

VJalili commented Feb 1, 2022

I guess it would be also interesting/comprehensive to consider complex STR patterns, where one STR site is composed of multiple repeat expansions. For instance: (CAG)[*](CAACAG)(CCG)[*]. The ExpansionHunter paper provides more details and examples: https://academic.oup.com/bioinformatics/article/35/22/4754/5499079

@ahwagner
Copy link

ahwagner commented Feb 1, 2022

We have opted for the ComposedSequenceExpression concept in VRS, which is currently in a community PR review stage: ga4gh/vrs#376. I am in favor of aligning Alleles using ComposedSequenceExpression to follow a similar convention to the @jmarshall REF ALT proposal.

@d-cameron
Copy link
Contributor

Implemented in #676

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants