Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify description of structured vs unstructured meta-information lines #620

Merged
merged 4 commits into from
Aug 22, 2022

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Jan 5, 2022

PR #88 introduced the term “structured meta-information line”, but was unclear about the relationship between those and other meta-information lines, and §1.4 remains unclear overall — see e.g. #608.

This PR introduces the term “unstructured meta-information line” (an alternative might be “simple meta-information line”), and rewords this section so it describes the two flavours of meta-information line clearly.

This PR does not attempt to tighten up the syntax around what should be quoted etc as advocated for in #608. OTOH it does newly specify that:

  • unstructured values must be non-empty (as is claimed as a rule by some of the test files)
  • unstructured values must not start with < (so that structured/unstructured are easily distinguished; this is already de facto enforced by the HTSlib implementation)

Hence it also removes <> from §5.4.11's unstructured ##pedigreeDB example. As discussed in #608 (comment), PR #88 removed the <> from §1.4.9's instance of ##pedigreeDB=<url> presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally, but missed this instance. Adjust the test files accordingly.

The test/vcf/4.[123]/passed/complexfile_passed_000.vcf test files still contain another unstructured meta-information line with angle bracket delimiters:

##AnalysisTitleQuotes="FINRISK: Whole-exome sequencing of Dietary, life style, and genetic determinants of obesity and metabolic syndrome (DILGOM)"
##AnalysisTitleBrackets=<"FINRISK: Whole-exome sequencing of Dietary, life style, and genetic determinants of obesity and metabolic syndrome (DILGOM)">
##AnalysisTitlePlain=FINRISK: Whole-exome sequencing of Dietary, life style, and genetic determinants of obesity and metabolic syndrome (DILGOM)

The maintainers should also update the ##AnalysisTitleBrackets lines, either just removing them or perhaps rewriting them as ##AnalysisTitleStructured=<ID="FINRISK: … (DILGOM)">.

The maintainers will likely wish to apply similar changes to VCFv4.4.draft.tex.

Introduce the term "unstructured meta-information line", and reword this
section so it describes the two flavours of meta-information line clearly.
Specify that an unstructured value must not start with `<` (so that
structured/unstructured are easily distinguished) and must be non-empty.

Remove `<>` from unstructured `##pedigreeDB` example. PR samtools#88 removed
the `<>` from one instance of `##pedigreeDB=<url>` presumably on the
grounds that they were merely metasyntactic variable notation and not
intended to appear literally, but missed this instance.
The specification now consistently reflects that `##pedigreeDB`'s value
should not be delimited by angle brackets (despite being an URL!).

Adjust the failed_meta_pedigreedb_002.vcf files as the claimed cause
of failure is the invalid URL hostname rather than the angle brackets.
@jmarshall jmarshall added the vcf label Jan 5, 2022
@hts-specs-bot
Copy link

Changed PDFs as of 6f8b89a: VCFv4.3 (diff). This link will expire in 30 days

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good improvement to the meta-data section. I assume we need to apply this to VCF4.4.draft.tex too though.

I also noted in my comments that we have a number of other issues related to quoting and "default" vs "extra" fields. I understand what it's trying to say, but it only really works if we actually define the default fields, which we don't! That's a topic for another PR though probably.

For example:
\begin{verbatim}
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="description",Version="128">
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="128">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this hasn't changed and has come from the earlier version, but given it's an example I think it would be clearer with actual examples. Especially as "number" isn't a valid number, but we're using the verbatim style text rather than italic to indicate it's a placeholder term.

Eg:

##INFO=<ID=varType,Number=1,Type=String,Description="Variant type",Source="example",Version="128">

VCFv4.3.tex Outdated
\verb|##|\emph{key}\verb|=<|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\ldots\verb|>|
\end{quote}
All structured lines require an ID which must be unique within their type, i.e., within all the meta-information lines with the same ``\verb|##|\emph{key}\verb|=|'' prefix.
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.), extra fields can be included after the default fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the clarification, and later it explains (partially) something that's always been a total mystery to me: why some strings are quoted in VCF and others are not.

However I'm unclear what "default fields" means. Can we assume for the examples shown that everything in the example is classified as "default". For example the definitions of ##INFO always have ID, Number, Type and Description so it's clear.

However this isn't univerally true for all structured lines. For example, the definition of ##contig in section 1.4.7 is:

The structured \texttt{contig} field must include the ID attribute and typically includes also sequence length, MD5 checksum, URL tag to indicate where the sequence can be found, etc.
For example:
\begin{verbatim}
##contig=<ID=ctg1,length=81195210,URL=ftp://somewhere.org/assembly.fa,...>
\end{verbatim}

"Typically includes" isn't exactly a tight term for describing what the default fields are. Why do our contig lines not quote all those values, given they're not explicitly defined and are simply examples of things that may occur. Is the fact that our examples always quote "species" but not "md5" is because we view "species" as a non-default field and "md5" as a default field? Or is it because in our examples the species is Homo sapiens and we decided to quote because of the space?

If I had to interpret this with no knowledge of what happens in practice, I'd say the ID, length and URL are default fields and the "..." indicates it can be followed by extra fields, but that's simply not true.

We later have an example here:

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

This ends with taxonomy=x which isn't quoted, and therefore is considered to be one of the default fields, and by dint of being after species we can determine that is also a default field (using the rule than extra fields always follow the default ones).

Put simply, I appreciate your improvements and are minded to merge as-is, but the entire "default" vs "extra" field is clear as mud. Maybe one of the original authors would care to explain this in a better way?

I think the error here though isn't with this PR, but with elsewhere in the document. Specifically the lack of a proper definition for ##contig. (Or PEDIGREE with the "Name_1" .. "Name_N" being used in the example, without quoting.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put simply, I appreciate your improvements and are minded to merge as-is

That is surely a determination to be made by the VCF maintainers…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put simply, I appreciate your improvements and are minded to merge as-is

That is surely a determination to be made by the VCF maintainers…

Agreed, I wasn't actually going to merge, but rather poorly adding my personal view.

As for the other points I made, I note that ##INFO uses the terms "required" and "recommended" to define "default" and "extra" fields.

However they're not really the same thing. "Default" may mean a field whose key is defined in the VCF spec, but it could be optional. So probably ##INFO needs to be more explicit too (along with every other structured definition).

@jmarshall
Copy link
Member Author

A good improvement to the meta-data section. I assume we need to apply this to VCF4.4.draft.tex too though.

As noted in the introductory PR comment, “The maintainers will likely wish to apply similar changes to VCFv4.4.draft.tex”.

Most of your comments are IMHO issues for followup clarifications. I hope the VCF maintainers will consider this PR as is and look to merge it within a finite timespan, and consider questions around which fields are default fields etc to be issues they may wish to take up for separate future improvement.

PR samtools#88 removed the `<>` from `##pedigreeDB=<url>` in VCFv4.3.tex,
presumably on the grounds that they were merely metasyntactic variable
notation and not intended to appear literally. As some readers still
refer to these older documents, remove the misleading notation here too.
@jkbonfield
Copy link
Contributor

One other thought here, we know we can add our own additional meta-information lines. However as these are not formally defined, I assume all keys are therefore mandatory quoted (even if numbers) as they are all to be considered as "extra" and hence string only rather than the "default" keys with differing types? Does that need to be made more explicit somewhere?

An example would be ##GATKCommandLine=<...>, which does indeed quote everything, although in this case it's all strings anyway. I've also seen ##Method=<...> too, also quoted.

(Bcftools only seems to add unstructured lines.)

@hts-specs-bot
Copy link

Changed PDFs as of a8e8c2a: VCFv4.1 (diff), VCFv4.2 (diff), VCFv4.3 (diff). This link will expire in 30 days

@jmarshall
Copy link
Member Author

Suggestion from the meeting: also mention that there can be structured lines not specified by the specification, and say something about default fields (only ID presumably) for those and that all others need to be quoted.

@hts-specs-bot
Copy link

Changed PDFs as of 2da6d4e: VCFv4.1 (diff), VCFv4.2 (diff), VCFv4.3 (diff). This link will expire in 30 days

@jmarshall
Copy link
Member Author

jmarshall commented Feb 1, 2022

As per the suggestion in #620 (comment) and at last month's meeting, I've added a sentence mentioning the possibility of other structured meta-information lines beyond those defined in the spec and noting that their only default field is ID.

I did not add anything about quoting on these lines, because IMHO the existing text makes an unclear distinction between “extra fields” (which for these lines would be all fields other than ID) and “optional fields“ (which are the ones that require quoting). Going further would require a VCF maintainer to determine what is intended here.

\verb|##|\emph{key}\verb|=<|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\ldots\verb|>|
\end{quote}
All structured lines require an ID which must be unique within their type, i.e., within all the meta-information lines with the same ``\verb|##|\emph{key}\verb|=|'' prefix.
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.) described in this specification, extra fields can be included after the default fields.
Copy link
Contributor

@tcezard tcezard Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.) described in this specification, extra fields can be included after the default fields.
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.) described in this specification, extra fields can be included and are in some case required, after the default fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which ones are you suggesting are required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my mind, ID+Number/Type/Description are the “default fields” for ##INFO.

IMHO the problem here is that §1.4 starts talking about details of INFO etc that should be in §1.4.2 etc. Disentangling that (and revisiting the mysterious default/extra/optional fields terminology) I think should be a followup to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants