-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Created directory for test data with empty VCF file #491
Conversation
Relates to #443 |
Note that (IIRC) refget, SAM, and VCF are all currently considering adding example / test files alongside their specifications in this repository. So the layout is a discussion amongst all of us. The first question is whether we want these data files to be copied to the http://samtools.github.io/hts-specs/ website, perhaps with prose lists explaining what the various files are. This could vary on a case by case basis: examples of common constructs could be made easily accessible on the web site but obscure corner case test files might not be. For most spec readers, these are sample or example files; it's only us spec (and maybe tool) authors who want to emphasise the test data aspect. So I'd prefer a top-level directory of (Also note that an empty file is not a valid VCF file.) |
The commit made reflects decisions from the Future of VCF group 2020-05-18, which was to have a directory per file(s) with a readme. Invalid files are acceptable here as they can be used for testing. @jmarshall Thanks for the comments, will socialise and request comments from outside the VCF group. |
Just to kickstart discussion, I added a commit to rishi's branch as agreed in the last "future of VCF" call. The files are basically taken from https://github.com/EBIvariation/vcf-validator/tree/master/test/input_files/v4.3 . Ideas to discuss:
|
The bulk of the files added in this PR since #491 (comment) are clearly test files to exercise tools and corner cases rather than example files that would be of interest to newcomers learning about the format. It may be worth having separate (A parallel PR for CRAM uses a uppercase directory name for |
I dumped a huge block of SAM files as starting points, but like this they're mainly htslib / io_lib tests. Hence very much WIP and a place to crib ideas from rather than the finished product. I've been going through more rigorously creating specific individual tests for each of the SAM fields, so will replace them with that at some point. Those are tailored very much to testing normal content (useful to get started) and corner cases such as read length 254 (should work) and 255 (should fail) plus "*" and similar. I'll try and get what I've done added to my branch so people can see the route it's going. |
I updated this PR according to the comments from @jmarshall:
Also, questions:
Some links: general issue #443, VCF PR #491 (this one), SAM PR #497. It would be great to have feedback about those questions and anything else people think. @lbergelson, @pd3 , @yfarjoun , @jkbonfield , @daviesrob , @tskir, @d-cameron. (sorry if I'm missing anyone interested). @andrewyatz We didn't have time in the last refget call to talk about this. Do you think this folder structure for test files makes sense for refget? Being an API it would be useful mostly to show "examples/" but maybe adding some expected responses under "_test/" is also useful for testing both implementations and clients. This applies to htsget too, I guess, @mlin, but I know there's something more ambitious regarding testing going on htsget. |
I should rename some of the SAM ones. They're rather pointless having pass and fail in the filenames when they also have subdirectories. That stems from when I had them in the same directory. hdr.RG1.sam, hdr.RG2.sam etc isn't too descriptive either. Maybe it should be hdr.RG-ID.sam, hdr.RG-BC.sam, and so on. More work to do here, along with renaming the directory to lowercase. I also have a category of "warn" though. They're passes as they are legal syntax, but aren't internally consistent. They're often the sort of thing we may see in real data too, so a decode should be able to cope. For example alignment off the end of the chromosome (bwa can sometimes do this), unmapped data with CIGAR and MAPQ values filled out (bwa again), or incorrect TLEN fields (syntacically correct, just incorrect maths). Other cases of warnings are things that are technically valid, but don't occur in real world examples and are likely problematic. As they are spec compliant in every way we need to have them, but it's justified perhaps to complain. Eg a zero length sequence with cigar 0M, non IUPAC bases in sequence (legal, but never going to work for BAM). For CRAM, not yet in a PR, I've been taking a different approach with naming. All my files start with numerics, so they naturally sort. None of the early files use any CRAM features tested by the later files. Eg we start with header, then empty containers, then unmapped data, then mapped data without sequence differences, etc, building up the complexity layer by layer. The aim here is to provide a sort of narrative for people wishing to write a decoder implementation so the validation data is naturally also a to-do list. As for versioning, one option is to change the toplevel directory. eg cram3.0 and cram4.0. We're never going to have a silly number of versions so putting them side by side isn't a big issue. However possibly you could have vcf/passed, vcf/failed for generic cross-version tests and vcf/4.3/passed, vcf/4.4/passed, etc for version specific additional tests. So there may be some merit to a hierarchy. |
@jmarshall what's the reasoning for renaming I can understand using underscores for hts-specs internal stuff, like the Jekyll layouts and include directories, but test is a "first class citizen" that we wish people to be looking at. We don't want to give the appearance that it's some internal thing they're not meant to be exploring. To be honest, I also don't really see why it's harmful if it gets copied to the web site either. Maybe it'll slow down web site deployment? |
The reasoning was to follow the existing explicit convention ( They shouldn't be on the web site because as corner case tests they are not of interest to readers of the web site. (If a particular file were of interest to web site readers, it should be in |
topics: - README file per folder? what kind of template do we use? - separation in different folders by VCF version? - file naming conventions?
@rishidev @jmarshall I am writing a tool to validate VCFs for my own purposes. Is this the main PR for VCF test files? |
As discussed on the File Formats call 2021-04-01 between @jmarshall, @lbergelson, @tcezard and @tskir, this PR will be merged in its current form in four weeks (2021-04-29), unless there are strong objections. |
The test/vcf/4.{1,2,3} directories appear to essentially contain three copies of the same files, basically just with different |
@jmarshall I see your point, but I don't know if there's a better way to store this. Each VCF version does need a separate test suite, and even if there are no differences for now, this can change over time. Especially once 4.4 is released |
Proposed structure for the test files is test-file/file-type/the-test-files. For this initial version the path has been created with an empty file as an initial VCF example