-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spec for short tandem repeat varaints #619
Comments
I've also been thinking about this for the purposes of deciding on an output VCF format for STRling (https://github.com/quinlan-lab/STRling) |
I think there would be little appetite for adding a new different style of value for the ALT field, as that is something that would need to be implemented in all VCF parser implementations. On the other hand, a proposal for common INFO tags to represent the details of STRs would likely be welcomed — e.g., lobSTR's RU INFO tag appears to be a fairly appropriate candidate for blessing in the VCF spec. See the example in VCFv4.3 §5.3 for the flavour of how tandem duplication variants are represented in REF/ALT currently. See also (parts of) PR #465 for the ongoing work of improving this — you may be interested in contributing to the discussion there (and in associated issues & PRs). |
hi @jmarshall , thank you for the reply! |
This seems like a great topic for GA4GH VRS/VCF alignment. While VRS currently handles this with Repeated Sequence Expressions, there is an open issue about how to represent compound repeated sequences and it would be great to align any VRS and VCF solutions to this challenge if possible. h/t to @rhdolin for connecting the dots. |
Thanks everyone for the interesting discussion on the call just now — I for one think I have a better understanding of the issues than before! It might be useful if @Talya-dor (and anyone else who'd like to) could add here some examples of the sorts of things they want to represent, to remind us of the examples discussed on the call. |
For the non-expert such as myself, the Kutner document linked from ga4gh/vrs#363 (as mentioned in #619 (comment)) is a very informative primer. For current VCF, IMHO it would be appropriate to represent STRs in info fields, whether that would be split out as per e.g. lobSTR's collection of fields, or (primarily) by a single string field containing the familiar For future VCF revisions, one idea discussed is that it would be nice if the STR repeat count notation could be used in REF and ALT fields alongside the existing breakend and other notation (assuming the brackets in such notation and those in breakends could be unambiguously parsed):
Allowing it in REF (as well as ALT) allows the reference allele to be shown naturally, though there is some interplay with normalisation to be considered. |
I guess it would be also interesting/comprehensive to consider complex STR patterns, where one STR site is composed of multiple repeat expansions. For instance: |
We have opted for the |
Implemented in #676 |
hello
I am trying to write several tools as part of a program for interpretation of STR variants. Currently there are different variant callers with different vcf formats that are used widely (ExpansionHunter by illumina, lobSTR etc.)
It would be great to have some unifying spec that can be used as a baseline / guideline.
Is it possible to add this to the next vcf spec ?
The convention I would suggest:
chr - chromosome ('chr3')
pos - position of nucleotide before repeat begins (
123
)ID - any string or int
ref - nucleotide in POS position (
A
)alt - < STRn > where n is the number of repeats (
<STR20>
)QUAL - any string or int
FILTER - any string or int
INFO
FORMAT
SAMPLE
The text was updated successfully, but these errors were encountered: