Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created directory for test data with empty VCF file #491

Merged
merged 11 commits into from
Apr 29, 2021

Conversation

rishidev
Copy link
Contributor

Proposed structure for the test files is test-file/file-type/the-test-files. For this initial version the path has been created with an empty file as an initial VCF example

@rishidev
Copy link
Contributor Author

Relates to #443

@jmarshall
Copy link
Member

Note that (IIRC) refget, SAM, and VCF are all currently considering adding example / test files alongside their specifications in this repository. So the layout is a discussion amongst all of us.

The first question is whether we want these data files to be copied to the http://samtools.github.io/hts-specs/ website, perhaps with prose lists explaining what the various files are. This could vary on a case by case basis: examples of common constructs could be made easily accessible on the web site but obscure corner case test files might not be.

For most spec readers, these are sample or example files; it's only us spec (and maybe tool) authors who want to emphasise the test data aspect. So I'd prefer a top-level directory of samples or (even better) examples, with per-format directories under there in which the various groups could do whatever they wanted. These subdirectories could contain index.md files listing the example files with links and brief descriptions. Purely corner-case test files could be hidden from the web site in _test subdirectories.

(Also note that an empty file is not a valid VCF file.)

@rishidev
Copy link
Contributor Author

The commit made reflects decisions from the Future of VCF group 2020-05-18, which was to have a directory per file(s) with a readme. Invalid files are acceptable here as they can be used for testing. @jmarshall Thanks for the comments, will socialise and request comments from outside the VCF group.

@jmmut
Copy link
Contributor

jmmut commented May 27, 2020

Just to kickstart discussion, I added a commit to rishi's branch as agreed in the last "future of VCF" call. The files are basically taken from https://github.com/EBIvariation/vcf-validator/tree/master/test/input_files/v4.3 .

Ideas to discuss:

  • README file per folder? what kind of template do we use?
  • separation in different folders by VCF version?
  • file naming conventions?

@jmarshall
Copy link
Member

The bulk of the files added in this PR since #491 (comment) are clearly test files to exercise tools and corner cases rather than example files that would be of interest to newcomers learning about the format.

It may be worth having separate examples/{sam,cram,vcf}/… and _test/{sam,cram,vcf}/… trees for the two categories. The former, examples/… would appear on the website and need index.md files describing and linking to the example files. (The example files might also be \verbatiminputed or similar into the specification documents, as I think has been suggested for SAM/VCF.) The latter, _test/… (the leading underscore hides the directory from Jekyll) would be more freeform and usable by scripts in other repositories.

(A parallel PR for CRAM uses a uppercase directory name for …/SAM/… and/or …/CRAM/…. This would need to be agreed on; personally I prefer this PR's lowercase for the format directory.)

@jkbonfield
Copy link
Contributor

I dumped a huge block of SAM files as starting points, but like this they're mainly htslib / io_lib tests. Hence very much WIP and a place to crib ideas from rather than the finished product.

I've been going through more rigorously creating specific individual tests for each of the SAM fields, so will replace them with that at some point. Those are tailored very much to testing normal content (useful to get started) and corner cases such as read length 254 (should work) and 255 (should fail) plus "*" and similar.

I'll try and get what I've done added to my branch so people can see the route it's going.

@jmmut
Copy link
Contributor

jmmut commented Jun 23, 2020

I updated this PR according to the comments from @jmarshall:

  • rename to "_test/" to avoid copying to website.
  • add "examples/". I only wrote the first example in the VCF spec but we might want to do that for most examples. I only had to replace spaces by tabs when they were column separators.
  • add README.md in _test/vcf/4.3/ to describe a bit the "passed/" and "failed/" folders. I also added a basic index.md for the examples but we need to define better what we want to include.

Also, questions:

  • added a "4.3/" version folder. Does this make sense for the other formats?
  • Do we want to keep a consistent file naming? the ones in VCF are like failed_body_info_001.vcf and the ones in Initial set of validation / test data for SAM #497 are like cigar.fail4.sam and hdr.SQ5.sam. I'm happy to change the VCF ones but I also would understand avoiding putting effort on normalising things that will diverge naturally anyway. Even for tools that deal with several formats, a uniform naming would not help as it would usually run on all files in the folder, e.g. _test/vcf/4.3/failed/*.
  • For the "examples/index.md", my gut feeling is to either have nothing (just a list of links to the examples), or have a section for each example with basically a copy of the explanation on the spec. If we come up with better explanations for the examples, I think it's worth to include that in the spec too.

Some links: general issue #443, VCF PR #491 (this one), SAM PR #497.

It would be great to have feedback about those questions and anything else people think. @lbergelson, @pd3 , @yfarjoun , @jkbonfield , @daviesrob , @tskir, @d-cameron. (sorry if I'm missing anyone interested).

@andrewyatz We didn't have time in the last refget call to talk about this. Do you think this folder structure for test files makes sense for refget? Being an API it would be useful mostly to show "examples/" but maybe adding some expected responses under "_test/" is also useful for testing both implementations and clients. This applies to htsget too, I guess, @mlin, but I know there's something more ambitious regarding testing going on htsget.

@jkbonfield
Copy link
Contributor

I should rename some of the SAM ones. They're rather pointless having pass and fail in the filenames when they also have subdirectories. That stems from when I had them in the same directory. hdr.RG1.sam, hdr.RG2.sam etc isn't too descriptive either. Maybe it should be hdr.RG-ID.sam, hdr.RG-BC.sam, and so on. More work to do here, along with renaming the directory to lowercase.

I also have a category of "warn" though. They're passes as they are legal syntax, but aren't internally consistent. They're often the sort of thing we may see in real data too, so a decode should be able to cope.

For example alignment off the end of the chromosome (bwa can sometimes do this), unmapped data with CIGAR and MAPQ values filled out (bwa again), or incorrect TLEN fields (syntacically correct, just incorrect maths).

Other cases of warnings are things that are technically valid, but don't occur in real world examples and are likely problematic. As they are spec compliant in every way we need to have them, but it's justified perhaps to complain. Eg a zero length sequence with cigar 0M, non IUPAC bases in sequence (legal, but never going to work for BAM).

For CRAM, not yet in a PR, I've been taking a different approach with naming. All my files start with numerics, so they naturally sort. None of the early files use any CRAM features tested by the later files. Eg we start with header, then empty containers, then unmapped data, then mapped data without sequence differences, etc, building up the complexity layer by layer. The aim here is to provide a sort of narrative for people wishing to write a decoder implementation so the validation data is naturally also a to-do list.

As for versioning, one option is to change the toplevel directory. eg cram3.0 and cram4.0. We're never going to have a silly number of versions so putting them side by side isn't a big issue. However possibly you could have vcf/passed, vcf/failed for generic cross-version tests and vcf/4.3/passed, vcf/4.4/passed, etc for version specific additional tests. So there may be some merit to a hierarchy.

@jkbonfield
Copy link
Contributor

@jmarshall what's the reasoning for renaming test to _test as opposed to simply adding test to the existing list of exclude lines in _config.yml as we already do for new and scripts?

I can understand using underscores for hts-specs internal stuff, like the Jekyll layouts and include directories, but test is a "first class citizen" that we wish people to be looking at. We don't want to give the appearance that it's some internal thing they're not meant to be exploring.

To be honest, I also don't really see why it's harmful if it gets copied to the web site either. Maybe it'll slow down web site deployment?

@jmarshall
Copy link
Member

The reasoning was to follow the existing explicit convention (_config.yml, _includes, _layouts) for checked-in things, and be explicit about the website implications difference between examples/ and _test/. The ephemeral directories (new, diff) are a bit looser… and I guess we already diverged from that with img and scripts. Anyway I don't have a strong opinion either way, and renaming to a _config.yml-excluded test/ is probably better — ‘first-class’ as you say.

They shouldn't be on the web site because as corner case tests they are not of interest to readers of the web site. (If a particular file were of interest to web site readers, it should be in examples/ instead.) Omitting them from the web site also means we don't have to write index.md prose describing them or supplying links to them.

@jkbonfield
Copy link
Contributor

I've tidied up the SAM PR #497 and I think it's reasonably complete now.

The CRAM one is ongoing, but I created a draft PR #512 so people can comment if they wish.

@d-cameron d-cameron mentioned this pull request Jul 13, 2020
@nh13
Copy link
Member

nh13 commented Aug 24, 2020

@rishidev @jmarshall I am writing a tool to validate VCFs for my own purposes. Is this the main PR for VCF test files?

@tskir
Copy link
Member

tskir commented Apr 1, 2021

As discussed on the File Formats call 2021-04-01 between @jmarshall, @lbergelson, @tcezard and @tskir, this PR will be merged in its current form in four weeks (2021-04-29), unless there are strong objections.

examples/vcf/index.md Outdated Show resolved Hide resolved
examples/vcf/index.md Outdated Show resolved Hide resolved
@jmarshall
Copy link
Member

The test/vcf/4.{1,2,3} directories appear to essentially contain three copies of the same files, basically just with different ##fileformat=VCFv4.X header lines. While this may be useful for the validator test suite that the files came from, is it really beneficial here?

@tskir
Copy link
Member

tskir commented Apr 29, 2021

@jmarshall I see your point, but I don't know if there's a better way to store this. Each VCF version does need a separate test suite, and even if there are no differences for now, this can change over time. Especially once 4.4 is released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants