Created directory for test data with empty VCF file #491

rishidev · 2020-04-30T19:29:27Z

Proposed structure for the test files is test-file/file-type/the-test-files. For this initial version the path has been created with an empty file as an initial VCF example

rishidev · 2020-05-17T18:30:46Z

Relates to #443

jmarshall · 2020-05-18T13:15:09Z

Note that (IIRC) refget, SAM, and VCF are all currently considering adding example / test files alongside their specifications in this repository. So the layout is a discussion amongst all of us.

The first question is whether we want these data files to be copied to the http://samtools.github.io/hts-specs/ website, perhaps with prose lists explaining what the various files are. This could vary on a case by case basis: examples of common constructs could be made easily accessible on the web site but obscure corner case test files might not be.

For most spec readers, these are sample or example files; it's only us spec (and maybe tool) authors who want to emphasise the test data aspect. So I'd prefer a top-level directory of samples or (even better) examples, with per-format directories under there in which the various groups could do whatever they wanted. These subdirectories could contain index.md files listing the example files with links and brief descriptions. Purely corner-case test files could be hidden from the web site in _test subdirectories.

(Also note that an empty file is not a valid VCF file.)

rishidev · 2020-05-20T15:36:43Z

The commit made reflects decisions from the Future of VCF group 2020-05-18, which was to have a directory per file(s) with a readme. Invalid files are acceptable here as they can be used for testing. @jmarshall Thanks for the comments, will socialise and request comments from outside the VCF group.

jmmut · 2020-05-27T13:04:02Z

Just to kickstart discussion, I added a commit to rishi's branch as agreed in the last "future of VCF" call. The files are basically taken from https://github.com/EBIvariation/vcf-validator/tree/master/test/input_files/v4.3 .

Ideas to discuss:

README file per folder? what kind of template do we use?
separation in different folders by VCF version?
file naming conventions?

jmarshall · 2020-06-15T20:28:02Z

The bulk of the files added in this PR since #491 (comment) are clearly test files to exercise tools and corner cases rather than example files that would be of interest to newcomers learning about the format.

It may be worth having separate examples/{sam,cram,vcf}/… and _test/{sam,cram,vcf}/… trees for the two categories. The former, examples/… would appear on the website and need index.md files describing and linking to the example files. (The example files might also be \verbatiminputed or similar into the specification documents, as I think has been suggested for SAM/VCF.) The latter, _test/… (the leading underscore hides the directory from Jekyll) would be more freeform and usable by scripts in other repositories.

(A parallel PR for CRAM uses a uppercase directory name for …/SAM/… and/or …/CRAM/…. This would need to be agreed on; personally I prefer this PR's lowercase for the format directory.)

jkbonfield · 2020-06-16T07:58:50Z

I dumped a huge block of SAM files as starting points, but like this they're mainly htslib / io_lib tests. Hence very much WIP and a place to crib ideas from rather than the finished product.

I've been going through more rigorously creating specific individual tests for each of the SAM fields, so will replace them with that at some point. Those are tailored very much to testing normal content (useful to get started) and corner cases such as read length 254 (should work) and 255 (should fail) plus "*" and similar.

I'll try and get what I've done added to my branch so people can see the route it's going.

jmmut · 2020-06-23T20:31:14Z

I updated this PR according to the comments from @jmarshall:

rename to "_test/" to avoid copying to website.
add "examples/". I only wrote the first example in the VCF spec but we might want to do that for most examples. I only had to replace spaces by tabs when they were column separators.
add README.md in _test/vcf/4.3/ to describe a bit the "passed/" and "failed/" folders. I also added a basic index.md for the examples but we need to define better what we want to include.

Also, questions:

added a "4.3/" version folder. Does this make sense for the other formats?
Do we want to keep a consistent file naming? the ones in VCF are like failed_body_info_001.vcf and the ones in Initial set of validation / test data for SAM #497 are like cigar.fail4.sam and hdr.SQ5.sam. I'm happy to change the VCF ones but I also would understand avoiding putting effort on normalising things that will diverge naturally anyway. Even for tools that deal with several formats, a uniform naming would not help as it would usually run on all files in the folder, e.g. _test/vcf/4.3/failed/*.
For the "examples/index.md", my gut feeling is to either have nothing (just a list of links to the examples), or have a section for each example with basically a copy of the explanation on the spec. If we come up with better explanations for the examples, I think it's worth to include that in the spec too.

Some links: general issue #443, VCF PR #491 (this one), SAM PR #497.

It would be great to have feedback about those questions and anything else people think. @lbergelson, @pd3 , @yfarjoun , @jkbonfield , @daviesrob , @tskir, @d-cameron. (sorry if I'm missing anyone interested).

@andrewyatz We didn't have time in the last refget call to talk about this. Do you think this folder structure for test files makes sense for refget? Being an API it would be useful mostly to show "examples/" but maybe adding some expected responses under "_test/" is also useful for testing both implementations and clients. This applies to htsget too, I guess, @mlin, but I know there's something more ambitious regarding testing going on htsget.

jkbonfield · 2020-06-24T08:08:26Z

I should rename some of the SAM ones. They're rather pointless having pass and fail in the filenames when they also have subdirectories. That stems from when I had them in the same directory. hdr.RG1.sam, hdr.RG2.sam etc isn't too descriptive either. Maybe it should be hdr.RG-ID.sam, hdr.RG-BC.sam, and so on. More work to do here, along with renaming the directory to lowercase.

I also have a category of "warn" though. They're passes as they are legal syntax, but aren't internally consistent. They're often the sort of thing we may see in real data too, so a decode should be able to cope.

For example alignment off the end of the chromosome (bwa can sometimes do this), unmapped data with CIGAR and MAPQ values filled out (bwa again), or incorrect TLEN fields (syntacically correct, just incorrect maths).

Other cases of warnings are things that are technically valid, but don't occur in real world examples and are likely problematic. As they are spec compliant in every way we need to have them, but it's justified perhaps to complain. Eg a zero length sequence with cigar 0M, non IUPAC bases in sequence (legal, but never going to work for BAM).

For CRAM, not yet in a PR, I've been taking a different approach with naming. All my files start with numerics, so they naturally sort. None of the early files use any CRAM features tested by the later files. Eg we start with header, then empty containers, then unmapped data, then mapped data without sequence differences, etc, building up the complexity layer by layer. The aim here is to provide a sort of narrative for people wishing to write a decoder implementation so the validation data is naturally also a to-do list.

As for versioning, one option is to change the toplevel directory. eg cram3.0 and cram4.0. We're never going to have a silly number of versions so putting them side by side isn't a big issue. However possibly you could have vcf/passed, vcf/failed for generic cross-version tests and vcf/4.3/passed, vcf/4.4/passed, etc for version specific additional tests. So there may be some merit to a hierarchy.

jkbonfield · 2020-06-24T08:18:35Z

@jmarshall what's the reasoning for renaming test to _test as opposed to simply adding test to the existing list of exclude lines in _config.yml as we already do for new and scripts?

I can understand using underscores for hts-specs internal stuff, like the Jekyll layouts and include directories, but test is a "first class citizen" that we wish people to be looking at. We don't want to give the appearance that it's some internal thing they're not meant to be exploring.

To be honest, I also don't really see why it's harmful if it gets copied to the web site either. Maybe it'll slow down web site deployment?

jmarshall · 2020-06-24T09:01:43Z

The reasoning was to follow the existing explicit convention (_config.yml, _includes, _layouts) for checked-in things, and be explicit about the website implications difference between examples/ and _test/. The ephemeral directories (new, diff) are a bit looser… and I guess we already diverged from that with img and scripts. Anyway I don't have a strong opinion either way, and renaming to a _config.yml-excluded test/ is probably better — ‘first-class’ as you say.

They shouldn't be on the web site because as corner case tests they are not of interest to readers of the web site. (If a particular file were of interest to web site readers, it should be in examples/ instead.) Omitting them from the web site also means we don't have to write index.md prose describing them or supplying links to them.

jkbonfield · 2020-06-24T16:14:55Z

I've tidied up the SAM PR #497 and I think it's reasonably complete now.

The CRAM one is ongoing, but I created a draft PR #512 so people can comment if they wish.

…ty file

topics: - README file per folder? what kind of template do we use? - separation in different folders by VCF version? - file naming conventions?

nh13 · 2020-08-24T15:47:20Z

@rishidev @jmarshall I am writing a tool to validate VCFs for my own purposes. Is this the main PR for VCF test files?

tskir · 2021-04-01T15:33:02Z

As discussed on the File Formats call 2021-04-01 between @jmarshall, @lbergelson, @tcezard and @tskir, this PR will be merged in its current form in four weeks (2021-04-29), unless there are strong objections.

examples/vcf/index.md

jmarshall · 2021-04-29T14:55:35Z

The test/vcf/4.{1,2,3} directories appear to essentially contain three copies of the same files, basically just with different ##fileformat=VCFv4.X header lines. While this may be useful for the validator test suite that the files came from, is it really beneficial here?

tskir · 2021-04-29T14:59:25Z

@jmarshall I see your point, but I don't know if there's a better way to store this. Each VCF version does need a separate test suite, and even if there are no differences for now, this can change over time. Especially once 4.4 is released

rishidev added the vcf label Apr 30, 2020

rishidev requested a review from lbergelson April 30, 2020 20:48

rishidev added cram refget sam labels May 20, 2020

jmmut mentioned this pull request Jun 23, 2020

Examples of VCFs as regular independent files #443

Open

d-cameron mentioned this pull request Jul 13, 2020

Added SVCLAIM field #517

Merged

rishidev and others added 10 commits July 13, 2020 10:42

Created directory for test data with empty VCF file

8036c3d

added version to the file path with empty VCF examples for both

e1fd673

Moved to a _ in main dir name and to per directory file sets with emp…

9a927a3

…ty file

add failed and passed VCFs (v4.3) to spark discussion

5e92de9

topics: - README file per folder? what kind of template do we use? - separation in different folders by VCF version? - file naming conventions?

renamed test-files to _test

a68565f

add examples/vcf/index.md

9457b84

rename _test to test

a574de3

add test/vcf/4.1 and test/vcf/4.2

dfadc81

exclude test/ from showing in the website (through github pages)

fafcb97

move README to cover all VCF versions

7408974

jmmut force-pushed the test-directory branch from dbbc3f8 to 7408974 Compare July 13, 2020 10:07

tskir approved these changes Apr 1, 2021

View reviewed changes

jmarshall reviewed Apr 29, 2021

View reviewed changes

examples/vcf/index.md Outdated Show resolved Hide resolved

examples/vcf/index.md Outdated Show resolved Hide resolved

Address comments by @jmarshall

cf717d7

tskir merged commit 7aeed5b into samtools:master Apr 29, 2021

daviesrob mentioned this pull request May 18, 2021

Added case of bracketed list of values to bcf_hdr_parse_line. samtools/htslib#1240

Merged

jmarshall mentioned this pull request Feb 7, 2023

vcf: Valid contig ID in failed example #701

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Created directory for test data with empty VCF file #491

Created directory for test data with empty VCF file #491

rishidev commented Apr 30, 2020

rishidev commented May 17, 2020

jmarshall commented May 18, 2020

rishidev commented May 20, 2020

jmmut commented May 27, 2020 •

edited

Loading

jmarshall commented Jun 15, 2020

jkbonfield commented Jun 16, 2020

jmmut commented Jun 23, 2020

jkbonfield commented Jun 24, 2020

jkbonfield commented Jun 24, 2020

jmarshall commented Jun 24, 2020

jkbonfield commented Jun 24, 2020

nh13 commented Aug 24, 2020

tskir commented Apr 1, 2021

jmarshall commented Apr 29, 2021

tskir commented Apr 29, 2021

Created directory for test data with empty VCF file #491

Created directory for test data with empty VCF file #491

Conversation

rishidev commented Apr 30, 2020

rishidev commented May 17, 2020

jmarshall commented May 18, 2020

rishidev commented May 20, 2020

jmmut commented May 27, 2020 • edited Loading

jmarshall commented Jun 15, 2020

jkbonfield commented Jun 16, 2020

jmmut commented Jun 23, 2020

jkbonfield commented Jun 24, 2020

jkbonfield commented Jun 24, 2020

jmarshall commented Jun 24, 2020

jkbonfield commented Jun 24, 2020

nh13 commented Aug 24, 2020

tskir commented Apr 1, 2021

jmarshall commented Apr 29, 2021

tskir commented Apr 29, 2021

jmmut commented May 27, 2020 •

edited

Loading