Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCFv4.3 - first batch of changes #88

Merged
merged 35 commits into from
Oct 10, 2015
Merged

VCFv4.3 - first batch of changes #88

merged 35 commits into from
Oct 10, 2015

Conversation

pd3
Copy link
Member

@pd3 pd3 commented Jun 2, 2015

Please see also the list of open issues here:
vcf

@droazen
Copy link

droazen commented Jun 2, 2015

Please note that this pull request was opened at my request to provide a centralized location for discussion of the existing and proposed changes in VCF 4.3. It should NOT be merged yet, as there are still a few open issues under consideration.

@droazen
Copy link

droazen commented Jun 2, 2015

Here is a link to a PDF version of the current VCF 4.3 draft (built from this branch), to make it easier to see what's changed (changes are highlighted in red):

https://gist.github.com/droazen/71d394379607674764f8/raw/d9e43db33fa2dbbf4f0db060aa57bf379526dc1a/VCFv4.3.draft.pdf

@pgrosu
Copy link

pgrosu commented Jun 2, 2015

Thanks @droazen - much easier to read :)

@droazen
Copy link

droazen commented Jun 2, 2015

Comments on changes in the existing draft should go directly in this thread. Comments on open issues ( vcf ) not yet incorporated into the draft should go on the page for each respective issue.

@vruano
Copy link

vruano commented Jun 2, 2015

Why GP is from 0 to 1? Does not seem practical when you are dealing with very small probs. and we already have a mixture of log10 and Phred scale annotations.

@vruano
Copy link

vruano commented Jun 2, 2015

In 2.1 VCF Tag naming conventions...

I think is a bit too absolutist to say that The "X" suffix means "Blah". I think that the standard should be flexible and allow users to, for example, have some annotations that finish in "L", that are not likelihoods.

I would change the three points to: "If you mean 'Blah' the annotation name should (or must?) finish with 'X'.

So from "The 'L' suffix means 'Likelihoods'" -> "The name of likelihood containing annotations should finish with the 'L' suffix"

@vruano
Copy link

vruano commented Jun 2, 2015

As with GP, why CNP is in 0 to 1 scale?

@bhandsaker
Copy link

Section 5.4.11 needs to be updated so that the #PEDIGREE lines have IDs as required by the changes to section 1.2.10.

@bhandsaker
Copy link

Section 1.4.1 (subsection 5 - ALT) seems inconsistent with new section 5.5 as to whether the "unspecified" allele is represented as "" or as "<>".

I would also suggest adding to section 1.4.1, subsection 5, that "The '*' allele is reserved to indicate that the allele is missing due to an overlapping deletion or some other variant incompatible with the listed alleles."

The overlapping deletion does not necessarily need to be upstream, although this is the easiest case to visualize. Consider two offset deletions of several Kb each. Both the first and second deletion records may use '' to represent a missing allele, since the situation is symmetrical, yet for the first record the conflicting allele is downstream (i.e. appears later in the file). It is also possible that the conflicting variants are not just deletions. For example, two overlapping inversions might also use the '' pseudo-allele to indicate that an allele is unspecified, missing, or described in a different VCF record.

@pd3
Copy link
Member Author

pd3 commented Jun 3, 2015

@vruano @atks I updated the draft to d6b46df, it now includes the points discussed in #83.

@droazen There are quite a few parallel topics already in this thread. Wouldn't it be easier to navigate if each was open as a separate issues?

@pd3
Copy link
Member Author

pd3 commented Jun 3, 2015

@bhandsaker The * was added already in 4.2 and its purpose seems to be different from <*>. Is * being used at all?

@bhandsaker Good point about the ID and PEDIGREE. It would be good to hear from someone who is actually using it.

@reedacartwright
Copy link

I'm using the PEDIGREE tags and find that IDs are important as they allow me to specify where in a pedigree an mutation occurred:

DNT=AAxAA>AC;DNL=LB-M3;DNQ=40

These INFO tags describe that the predict de novo type is AAxAA>AC, which occurred in one of the branches leading to LB-M3

The only issue with the PEDIGREE tags is that I cannot store additional information. According to the specs, everything other than ID describes an ancestor. In order to support additional pedigree information it would be useful to have another tag, e.g:

##PEDIGREE=<ID=LB-N3,Father=GL-N1,Mother=GL-N2>
##PEDIGREEDATA=<NID=LB-N3,Type=float,Description="Mutation Rates",Father=1e-8,Mother=1e-9>

Where NID is used in PEDIGREEDATA so there can be multiple lines that contain data for the same node in the pedigree.

@pd3
Copy link
Member Author

pd3 commented Sep 1, 2015

Could the existing SAMPLE field be used for this perhaps?

@reedacartwright
Copy link

I'm not sure if SAMPLE is the proper field to use. My pedigree involves a mix of both germline meiotic events as well as somatic mitotic events. Not every node in my pedigree represents a sample. Some represents libraries (since we support technical replicates of the same sample). Others represent inferred ancestral states that occurred during family history or somatic development.

For me conceptually, the SAMPLE field provides information about the biological samples that were sequenced. Whereas, the data I need to store is about the analysis that was done. E.g. we had 2 libraries for this sample and we assumed that the error rate due to sequencing was 1e-8, etc.

I've already extended the "sample" columns in the vcf to output information about my nodes, including inferred ancestral states and library-specific depths. I have no issues reusing the SAMPLE field as well, but would like to propose some changes to the wording of the VCF format to support a broader concept of what a sample is. If that seems reasonable, I will work on a proposal.

@pd3
Copy link
Member Author

pd3 commented Sep 4, 2015

I see what you mean now, you are right that SAMPLE is not right for the job here. However, the specification does not explicitly say that everything other than ID refers to the ancestor, it only lists some examples. How about using the META field also for PEDIGREE and have something like this:

##META=<ID=MR,Type=Float,Number=.,Description="Mutation Rates">
##PEDIGREE=<ID=LB-N3,Father=GL-N1,Mother=GL-N2,MR=1e-8,1e-9>

The PEDIGREEDATA field is a valid solution as well, only the NID attribute would have to become ID.

@reedacartwright
Copy link

I think the ###META field makes the most sense, since it allows me to keep the meta data with the pedigree information. Is its usage defined in the spec?

However, I would be concerned about the ordering of rates in your example in case some conversion decided to swap the order of Father and Mother without changing the MR pattern. Is there a guarantee in the spec that that the attributes can be retrieved in the order written and that order will be maintained if converted between formats?

##PEDIGREEDATA might be a safer option if I can't rely on the order of attributes mattering in a VCF. The reason why I decided to use NID instead of ID because ID's are supposed to be unique withing a field type. For example, the following would not be allowed:

##PEDIGREE=<ID=LB-N3,Father=GL-N1,Mother=GL-N2>
##PEDIGREEDATA=<ID=LB-N3,Type=float,Description="Mutation Rates",Father=1e-8,Mother=1e-9>
##PEDIGREEDATA=<ID=LB-N3,Type=float,Description="Indel Rates",Father=1e-9,Mother=1e-9>

@pd3
Copy link
Member Author

pd3 commented Sep 9, 2015

You are making a good point about the order of attributes, the specification is silent about this. How about splitting the field into FatherMR and MotherMR? I am afraid it was agreed that ID is mandatory, so the original proposal with NID cannot be accepted as is.

The implicit filter PASS was described inconsistently throughout BCFv2.1.
It is encoded as the first entry in the dictionary, not the last.
@reedacartwright
Copy link

Yes, we could use separate ##META lines for mother and father mutation rate.

We could also use ID to uniquely define a ##PEDIGREEDATA line and use PedigreeID to connect the ##PEDIGREEDATA line to a line in the pedigree. It is a little verbose, but perhaps less verbose than defining separate tags for each mother, father, and original value.

##PEDIGREE=<ID=LB-N3,Father=GL-N1,Mother=GL-N2>
##PEDIGREEDATA=<ID=MR-LB-N3,PedigreeID=LB-N3,Type=float,Description="Mutation Rates",Father=1e-8,Mother=1e-9>
##PEDIGREEDATA=<ID=IR-LB-N3,PedigreeID=LB-N3,Type=float,Description="Indel Rates",Father=1e-9,Mother=1e-9>

pd3 added a commit that referenced this pull request Sep 9, 2015
@pd3
Copy link
Member Author

pd3 commented Sep 9, 2015

Yeah, both are possible. The degree of verbosity depends on the number of samples in the study. If there are many samples, the FatherMR and MotherMR are defined only once with the META solution. Also no new tag is required

##META=<ID=FatherMR,Type=Float,Number=1,Description="Paternal Mutation Rate">
##META=<ID=MotherMR,Type=Float,Number=1,Description="Maternal Mutation Rate">
##PEDIGREE=<ID=LB-N1,Father=GL-N1,Mother=GL-N2,FatherMR=1e-8,MotherMR=1e-9>    
##PEDIGREE=<ID=LB-N2,Father=GL-N3,Mother=GL-N4,FatherMR=1e-8,MotherMR=1e-8>
##PEDIGREE=<ID=LB-N3,Father=GL-N5,Mother=GL-N6,FatherMR=1e-9,MotherMR=1e-8>

@reedacartwright
Copy link

I will give it a go using your META solution.

@pd3
Copy link
Member Author

pd3 commented Sep 24, 2015

OK, it seems we are ready to merge the changes. If there are no further objections, I'll do it tomorrow.

@tk2
Copy link

tk2 commented Sep 28, 2015

@droazen as the GATK rep - are you in agreement with merging?

@jmarshall
Copy link
Member

In §1.3 (Data types),

Data types supported by VCF are: Integer (32-bit, signed) […] For the Integer type, the values from −231 to −231 + 7 cannot be stored in the binary version, see 6.3.3.

It's not too good to have things that are representable in one format but not the other. So we should say that -231 to −231 + 7 are disallowed in both VCF and BCF.


In §1.6.1 (Fixed fields, §1.4.1 in previous spec versions),

There are 8 fixed fields per record. All data lines are tab-delimited with no tab character at the end of the line. The last data line should end with a line separator. In all cases, missing values are specified with a dot (‘.’).

The first sentence is about fixed fields. The next two sentences (re tabs and lines) are about the whole line, so should be lifted to §1.6 (Data lines).

The final sentence (re missing value dots (‘.’)) is hopefully intended to apply to all fields. However at present being in this section it only applies to the 8 fixed fields. It should also be lifted to §1.6, or a similar sentence needs to be added to §1.6.2 (Genotype fields).

@jmarshall
Copy link
Member

Re missing value dots, see the conversation that motivated this: samtools/htsjdk#340.

pd3 added a commit that referenced this pull request Oct 10, 2015
VCFv4.3 - first batch of changes
@pd3 pd3 merged commit bd89f29 into master Oct 10, 2015
@yfarjoun yfarjoun deleted the VCFv4.3 branch July 22, 2018 01:29
jmarshall added a commit to jmarshall/hts-specs that referenced this pull request Jan 5, 2022
Introduce the term "unstructured meta-information line", and reword this
section so it describes the two flavours of meta-information line clearly.
Specify that an unstructured value must not start with `<` (so that
structured/unstructured are easily distinguished) and must be non-empty.

Remove `<>` from unstructured `##pedigreeDB` example. PR samtools#88 removed
the `<>` from one instance of `##pedigreeDB=<url>` presumably on the
grounds that they were merely metasyntactic variable notation and not
intended to appear literally, but missed this instance.
jmarshall added a commit to jmarshall/hts-specs that referenced this pull request Jan 6, 2022
PR samtools#88 removed the `<>` from `##pedigreeDB=<url>` in VCFv4.3.tex,
presumably on the grounds that they were merely metasyntactic variable
notation and not intended to appear literally. As some readers still
refer to these older documents, remove the misleading notation here too.
d-cameron pushed a commit that referenced this pull request Aug 22, 2022
…nes (#620)

* Clarify structured vs unstructured meta-information lines

Introduce the term "unstructured meta-information line", and reword this
section so it describes the two flavours of meta-information line clearly.
Specify that an unstructured value must not start with `<` (so that
structured/unstructured are easily distinguished) and must be non-empty.

Remove `<>` from unstructured `##pedigreeDB` example. PR #88 removed
the `<>` from one instance of `##pedigreeDB=<url>` presumably on the
grounds that they were merely metasyntactic variable notation and not
intended to appear literally, but missed this instance.

* Remove `<>` from VCF test file `##pedigreeDB` lines

The specification now consistently reflects that `##pedigreeDB`'s value
should not be delimited by angle brackets (despite being an URL!).

Adjust the failed_meta_pedigreedb_002.vcf files as the claimed cause
of failure is the invalid URL hostname rather than the angle brackets.

* Fix misleading `##pedigreeDB=<url>` notation in older VCF specifications

PR #88 removed the `<>` from `##pedigreeDB=<url>` in VCFv4.3.tex,
presumably on the grounds that they were merely metasyntactic variable
notation and not intended to appear literally. As some readers still
refer to these older documents, remove the misleading notation here too.

* Mention structured lines that are not defined by the VCF specification
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.