Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify description of structured vs unstructured meta-information lines #620

Merged
merged 4 commits into from
Aug 22, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 24 additions & 10 deletions VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -100,24 +100,37 @@ \subsection{Data types}
For the Integer type, the values from $-2^{31}$ to $-2^{31}+7$ cannot be stored in the binary version and therefore are disallowed in both VCF and BCF, see \ref{BcfTypeEncoding}.

\subsection{Meta-information lines}
File meta-information is included after the \#\# string and must be key=value pairs.
Meta-information lines are optional, but if they are present then they must be completely well-formed.
Note that BCF, the binary counterpart of VCF, requires that all entries are present.
It is recommended to include meta-information lines describing the entries used in the body of the VCF file.
File meta-information lines start with ``\verb|##|'' and must appear first in the VCF file, before the header line (section~\ref{header-line}) and data record lines (section~\ref{data-lines}).
They may be either \emph{unstructured} or \emph{structured}.

An \emph{unstructured} meta-information line consists of a~\emph{key} (denoting the type of meta-information recorded) and a~\emph{value} (which may not be empty and must not start with a `\verb|<|' character), separated by an `\verb|=|' character:
\begin{quote}
\verb|##|\emph{key}\verb|=|\emph{value}
\end{quote}
Several unstructured meta-information lines are defined in this specification, notably \verb|##fileformat|.
Others not defined by this specification, e.g.\ \verb|##fileDate| and \verb|##source|, are commonly found in VCF files.
These typically have meanings that are obvious, or they are immaterial for processing the file, or both.

All structured lines that have their value enclosed within ``$<>$'' require an ID which must be unique within their type.
For all of the structured lines (\#\#INFO, \#\#FORMAT, \#\#FILTER, etc.), extra fields can be included after the default fields.
A \emph{structured} meta-information line is similar, but the value is itself a comma-separated list of key=value pairs, enclosed within `\verb|<|' and `\verb|>|' characters:
\begin{quote}
\verb|##|\emph{key}\verb|=<|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\emph{key}\verb|=|\emph{value}\verb|,|\ldots\verb|>|
\end{quote}
All structured lines require an ID which must be unique within their type, i.e., within all the meta-information lines with the same ``\verb|##|\emph{key}\verb|=|'' prefix.
For all of the structured lines (\verb|##INFO|, \verb|##FORMAT|, \verb|##FILTER|, etc.), extra fields can be included after the default fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the clarification, and later it explains (partially) something that's always been a total mystery to me: why some strings are quoted in VCF and others are not.

However I'm unclear what "default fields" means. Can we assume for the examples shown that everything in the example is classified as "default". For example the definitions of ##INFO always have ID, Number, Type and Description so it's clear.

However this isn't univerally true for all structured lines. For example, the definition of ##contig in section 1.4.7 is:

The structured \texttt{contig} field must include the ID attribute and typically includes also sequence length, MD5 checksum, URL tag to indicate where the sequence can be found, etc.
For example:
\begin{verbatim}
##contig=<ID=ctg1,length=81195210,URL=ftp://somewhere.org/assembly.fa,...>
\end{verbatim}

"Typically includes" isn't exactly a tight term for describing what the default fields are. Why do our contig lines not quote all those values, given they're not explicitly defined and are simply examples of things that may occur. Is the fact that our examples always quote "species" but not "md5" is because we view "species" as a non-default field and "md5" as a default field? Or is it because in our examples the species is Homo sapiens and we decided to quote because of the space?

If I had to interpret this with no knowledge of what happens in practice, I'd say the ID, length and URL are default fields and the "..." indicates it can be followed by extra fields, but that's simply not true.

We later have an example here:

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

This ends with taxonomy=x which isn't quoted, and therefore is considered to be one of the default fields, and by dint of being after species we can determine that is also a default field (using the rule than extra fields always follow the default ones).

Put simply, I appreciate your improvements and are minded to merge as-is, but the entire "default" vs "extra" field is clear as mud. Maybe one of the original authors would care to explain this in a better way?

I think the error here though isn't with this PR, but with elsewhere in the document. Specifically the lack of a proper definition for ##contig. (Or PEDIGREE with the "Name_1" .. "Name_N" being used in the example, without quoting.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put simply, I appreciate your improvements and are minded to merge as-is

That is surely a determination to be made by the VCF maintainers…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put simply, I appreciate your improvements and are minded to merge as-is

That is surely a determination to be made by the VCF maintainers…

Agreed, I wasn't actually going to merge, but rather poorly adding my personal view.

As for the other points I made, I note that ##INFO uses the terms "required" and "recommended" to define "default" and "extra" fields.

However they're not really the same thing. "Default" may mean a field whose key is defined in the VCF spec, but it could be optional. So probably ##INFO needs to be more explicit too (along with every other structured definition).

For example:
\begin{verbatim}
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="description",Version="128">
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="128">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this hasn't changed and has come from the earlier version, but given it's an example I think it would be clearer with actual examples. Especially as "number" isn't a valid number, but we're using the verbatim style text rather than italic to indicate it's a placeholder term.

Eg:

##INFO=<ID=varType,Number=1,Type=String,Description="Variant type",Source="example",Version="128">

\end{verbatim}
In the above example, the extra fields of ``Source'' and ``Version'' are provided.
Optional fields must be stored as strings even for numeric values.
The values of optional fields must be written as quoted strings, even for numeric values.

It is recommended in VCF and required in BCF that the header includes tags describing the reference and contigs backing the data contained in the file.
These tags are based on the SQ field from the SAM spec; all tags are optional (see the VCF example above).

Meta-information lines can be in any order with the exception of `fileformat` which must come first.
Meta-information lines are optional, but if they are present then they must be completely well-formed.
Other than \verb|##fileformat|, they may appear in any order.
Note that BCF, the binary counterpart of VCF, requires that all entries are present.
It is recommended to include meta-information lines describing the entries used in the body of the VCF file.


\subsubsection{File format}
Expand Down Expand Up @@ -266,6 +279,7 @@ \subsubsection{Pedigree field format}


\subsection{Header line syntax}
\label{header-line}
The header line names the 8 fixed, mandatory columns. These columns are as follows:
\begin{center}
\#CHROM
Expand Down Expand Up @@ -1306,7 +1320,7 @@ \subsubsection{Clonal derivation relationships}
Alternately, if data on the genomes is compiled in a database, a simple pointer can be provided:

\begin{verbatim}
##pedigreeDB=<url>
##pedigreeDB=URL
\end{verbatim}

\begin{samepage}
Expand Down
2 changes: 1 addition & 1 deletion test/vcf/4.1/failed/failed_meta_pedigreedb_002.vcf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
##fileformat=VCFv4.1
##CauseOfFailure=Non-valid URL
##pedigreeDB=<ftp://8080:8080/not-valid/host/to/pedigreeDB>
##pedigreeDB=ftp://8080:8080/not-valid/host/to/pedigreeDB
#CHROM POS ID REF ALT QUAL FILTER INFO
1 123 . TC T . . .
2 changes: 1 addition & 1 deletion test/vcf/4.1/passed/complexfile_passed_000.vcf
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
##assembly=ftp://user@host:8080/path/to/file.fastq
##PEDIGREE=<Name_0=Something>
##PEDIGREE=<Name_0=Something,Name_1=Something-else>
##pedigreeDB=<ftp://user@host:8080/path/to/pedigreeDB?arg1=db1>
##pedigreeDB=ftp://user@host:8080/path/to/pedigreeDB?arg1=db1
##contig=<ID=1>
##contig=<ID=contig_url,URL=ftp://user@host:8080/path/to/contig>
##contig=<ID=contig_accession,species="Homo sapiens",accession=GCA_000001405.1>
Expand Down
4 changes: 2 additions & 2 deletions test/vcf/4.1/passed/passed_meta_pedigreedb.vcf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
##fileformat=VCFv4.1
##pedigreeDB=<ftp://www.ebi.ac.uk:8080/valid/host/to/file.db>
##pedigreeDB=<http://123.0.1.2:8080/valid/host/to/file.db>
##pedigreeDB=ftp://www.ebi.ac.uk:8080/valid/host/to/file.db
##pedigreeDB=http://123.0.1.2:8080/valid/host/to/file.db
#CHROM POS ID REF ALT QUAL FILTER INFO
1 123 . TC T . . .
2 changes: 1 addition & 1 deletion test/vcf/4.2/failed/failed_meta_pedigreedb_002.vcf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
##fileformat=VCFv4.2
##CauseOfFailure=Non-valid URL
##pedigreeDB=<ftp://8080:8080/not-valid/host/to/pedigreeDB>
##pedigreeDB=ftp://8080:8080/not-valid/host/to/pedigreeDB
#CHROM POS ID REF ALT QUAL FILTER INFO
1 123 . TC T . . .
2 changes: 1 addition & 1 deletion test/vcf/4.2/passed/complexfile_passed_000.vcf
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
##assembly=ftp://user@host:8080/path/to/file.fastq
##PEDIGREE=<Name_0=Something>
##PEDIGREE=<Name_0=Something,Name_1=Something-else>
##pedigreeDB=<ftp://user@host:8080/path/to/pedigreeDB?arg1=db1>
##pedigreeDB=ftp://user@host:8080/path/to/pedigreeDB?arg1=db1
##contig=<ID=1>
##contig=<ID=contig_url,URL=ftp://user@host:8080/path/to/contig>
##contig=<ID=contig_accession,species="Homo sapiens",accession=GCA_000001405.1>
Expand Down
4 changes: 2 additions & 2 deletions test/vcf/4.2/passed/passed_meta_pedigreedb.vcf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
##fileformat=VCFv4.2
##pedigreeDB=<ftp://www.ebi.ac.uk:8080/valid/host/to/file.db>
##pedigreeDB=<http://123.0.1.2:8080/valid/host/to/file.db>
##pedigreeDB=ftp://www.ebi.ac.uk:8080/valid/host/to/file.db
##pedigreeDB=http://123.0.1.2:8080/valid/host/to/file.db
#CHROM POS ID REF ALT QUAL FILTER INFO
1 123 . TC T . . .
2 changes: 1 addition & 1 deletion test/vcf/4.3/failed/failed_meta_pedigreedb_002.vcf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
##fileformat=VCFv4.3
##CauseOfFailure=Non-valid URL
##pedigreeDB=<ftp://8080:8080/not-valid/host/to/pedigreeDB>
##pedigreeDB=ftp://8080:8080/not-valid/host/to/pedigreeDB
#CHROM POS ID REF ALT QUAL FILTER INFO
1 123 . TC T . . .
2 changes: 1 addition & 1 deletion test/vcf/4.3/passed/complexfile_passed_000.vcf
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
##assembly=ftp://user@host:8080/path/to/file.fastq
##PEDIGREE=<ID=Pedigree1,Original=Something>
##PEDIGREE=<ID=Pedigree2,Name_0=Something,Name_1=Something-else>
##pedigreeDB=<ftp://user@host:8080/path/to/pedigreeDB?arg1=db1>
##pedigreeDB=ftp://user@host:8080/path/to/pedigreeDB?arg1=db1
##contig=<ID=1>
##contig=<ID=contig_url,URL=ftp://user@host:8080/path/to/contig>
##contig=<ID=contig_accession,species="Homo sapiens",accession=GCA_000001405.1>
Expand Down
4 changes: 2 additions & 2 deletions test/vcf/4.3/passed/passed_meta_pedigreedb.vcf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
##fileformat=VCFv4.3
##pedigreeDB=<ftp://www.ebi.ac.uk:8080/valid/host/to/file.db>
##pedigreeDB=<http://123.0.1.2:8080/valid/host/to/file.db>
##pedigreeDB=ftp://www.ebi.ac.uk:8080/valid/host/to/file.db
##pedigreeDB=http://123.0.1.2:8080/valid/host/to/file.db
#CHROM POS ID REF ALT QUAL FILTER INFO
1 123 . TC T . . .