-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new annot-tsv program #1619
Conversation
Mmm, I am confused why the tests complain about |
In subsequent emails I think I decided on better terminology of "input" and "annotation". "Source" still feels nebulous and not very descriptive of the function. The usage has:
From this, it sounds like "target" is our file we are extending, with annotations from "source". The man page says:
This is a bit ambiguous as it sort of implies it can work in either direction. I'm still struggling to work out if it's completely symmetric and we basically have file1 file2 that are merged (akin to unix Think of it in this way. Let's say we have an awk script in a very strict layout:
We can feed any data to it and the script will amend specific matching columns based on the contents of other columns. It's clear here that a null script degenerates to "cat". It's clear the input is the input, and the script is the script. We never really do things like pipe a script into awk and specify a filename as input on the command line. Here it's a little different as we have to have at least one matching column between two files (hence why I say it's like a more powerful "join"). That does change things. I don't understand the ordering though. What's the use case for it working on unordered data? What's the benefit here? It just feels like it'd make it far slower than necessary. Or the half way house: is there a benefit to one order (primary) and one unordered (annotation) file? I'm pleased however to see the addition of "-o" for output since I last looked at this, and renaming the old overlap option. |
Unless there is a danger of confusion, let's preserve the existing naming please. It is to avoid conflicting short names, I already had to make sacrifices by abandoning the term "destination" which you disliked. The naming was intended to imply the information flow and for that reason I liked best the original "source" and "destination". Using that naming would also prevent the next question:
It is not symmetrical and is not intended to imply a change in direction. It aims to explain that the argument to @daviesrob, @whitwham could you please help to improve the wording of the man page and the usage text to make it clearer? Despite my best effort I am not succeeding in making this understandable.
I am confused by the question. There are two types of ordering you might be asking, by column and by row. You seem to be asking about column ordering, which is the only type of ordering mentioned throughout the man page and the code. To state the obvious, the benefit of supporting tab-delimited files with columns in arbitrary order is that one can use the program on files with columns in arbitrary order. For example, on files that were exported from spreadsheets, people tend to use them a lot. Regarding ordering by row, even though that is not mentioned anywhere, the program can work with data unordered by row as well. That's thanks to the regidx library. As a side note, comparing the program to |
I meant row ordering, so classic sorted vs unsorted (or non-position sorted) genomic data. We have lots of algorithms that are efficient because they require sorted data. Eg samtools merge. Here you seem to be stating that both files can be unsorted. This obviously means it is using a less efficient strategy, which is fine if that is the user requirement. It also depends on the size of files you're designing it to operate on. Is this for small data of a few thousand rows, or are you planning for it to lift over annotations from files with 10s of millions of records? I'm just asking where that requirement for operating on unsorted data comes from. What inputs are we dealing with that aren't sorted? (Even unix "join" requires sorted inputs) |
Maybe so, but I asked others in the NPG group where they thought the input came from and where the output went, and universally they said source and destination! Let's just agree to disagree on this one. I'm looking for better terms, not to rehash old arguments. I still think input and output are by far the most universally understood terms. As you've agreed it's not symmetric and one file is the primary input, then why not call that file input, and the other something more appropriate to its function? |
With a trivial src and dst (transfer) file:
I can do a nop function of specifying nothing to transfer over:
I notice a few things. Firstly, the header line has changed format. Is that deliberate or just debugging left over? It's undocumented. Also what you label as the dst file in your tests (old destination) is the default output, and not the source as I had expected! If we use I also tried stress testing. Removing one of the columns from one src row (I deleted the "b" at the end of the 2nd line) to get illegal data caused a core dump:
We need to stress test more corner cases so it's robust. |
What specifies the ordering? My source file has columns "tag3" and "m" in that order, but I always get them in the opposite order regardless of what I specify in the
Also, please consider making the usage appear to stdout unless an error has occurred. This is already a cause of a lot of "known failures" in the Samtools test harness, and we shouldn't be adding more here. Specifically an error is an error (eg |
In this mode it functions as |
That's a deliberate feature. If it is bothersome to anyone, we can add an option to not modify the header. I would leave this for future improvements though. |
I really don't understand the problem. The documentation is clear on that, no? The program transfers columns from the file with source annotations (
To turn this around: both files are technically inputs and typically both have annotations. I can be equally confused here. |
Thank you, added a commit to address issues like this ee299e3. More breaking test cases will be much appreciated |
New columns are only appended, never added among existing ones. I suspect the dst.1.txt file in this example contains the column Reordering of existing columns in the target/destination file is not considered in this version, perhaps it could be addressed in a future enhancement request. |
The exit code for |
annot-tsv.c
Outdated
#define NBP_IS_END(x) (((x)&1)==1) | ||
typedef struct | ||
{ | ||
int n,m; // n is a multiple of two: breakpoints are stored in regs, not regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be size_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int
or uint
should be sufficient for any normal use. But ok, no harm in using size_t
.
annot-tsv.c
Outdated
typedef struct | ||
{ | ||
int n,m; // n is a multiple of two: breakpoints are stored in regs, not regions | ||
hts_pos_t *regs; // change to uint64_t for very large genomes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hts_pos_t
is uint64_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original annot-regs
used int
, the comment was not updated when transitioning to annot-tsv
..
annot-tsv.c
Outdated
} | ||
static inline void nbp_add(nbp_t *nbp, hts_pos_t beg, hts_pos_t end) | ||
{ | ||
if ( end >= REGIDX_MAX>>1 ) error("Error: the coordinate is too big (%u). Possible todo: switch to uint64_t\n",end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REGIDX_MAX
is (1ULL << 35)
so already only fits in 64 bits.
|
||
typedef struct | ||
{ | ||
uint32_t n,m; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should really be size_t
too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean a file with more than 4,294,967,295 columns
annot-tsv.c
Outdated
static int read_next_line(dat_t *dat) | ||
{ | ||
if ( dat->line.l ) return dat->line.l; | ||
if ( hts_getline(dat->fp, KS_SEP_LINE, &dat->line) > 0 ) return dat->line.l; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to check for errors reported by hts_getline()
(i.e. return value < -1) to tell them apart from EOF.
annot-tsv.c
Outdated
char *temp_dir, *out_fname; | ||
BGZF *out_fp; | ||
int allow_dups, reciprocal, ignore_headers, max_annots, mode; | ||
float overlap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use float
here instead of double
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The precision of double
is overkill here, but happy to change...
annot-tsv.c
Outdated
if (a > b) return 1; | ||
return 0; | ||
} | ||
static int nbp_length(nbp_t *nbp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should return hts_pos_t
annot-tsv.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to only have one copy of the manual page checked in - and preferably the nroff one so HTSlib doesn't gain a build dependency on asciidoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that the man page is formatted in nroff directly? @jkbonfield originally suggested text markdown is fine. Maybe we should discuss offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the reserveration to asciidoc though, asciidoctor is what we've been using in bcftools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I said it's fine in principle, but if we use it it should be universal and not just one thing. Adding dependencies without consideration for other devs and users isn't ideal so we have to be considerate and not just drift into policy decisions like this.
Bcftools aleady uses this, so it's fine there. Htslib right now is nroff -man
so we should stick to that unless we decide to move enmasse to a new regime.
annot-tsv.1
Outdated
The program was written by Petr Danecek and was originally published on github as annot\-regs | ||
.SH "COPYING" | ||
.sp | ||
The MIT/Expat License or GPL License, see the LICENSE document for details\&. Copyright (c) Genome Research Ltd\&. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MIT/Expat only. GPL is not currently mentioned in the LICENSE document, and I don't see why anyone would want to use it if you can go for MIT/Expat instead.
test/test.pl
Outdated
test_plugin_loading($opts); | ||
test_realn($opts); | ||
test_bcf_set_variant_type($opts); | ||
run_test(\&test_bgzip,$opts, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change to enable the --function
option doesn't quite work, because some of the tests currently rely on artefacts made by earlier tests. Rather than try to fix that as part of this PR, it would be easier to just have the annot-tsv
tests run the old way. Adding the --function
option can then be done as separate work in a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is not bullet proof for all tests, but it works if you want to rerun and debug the ones you're currently developing. Saves lot of time and does not have to be used, I'd advocate for keeping it
test/test.pl
Outdated
run_test(\&test_realn,$opts); | ||
run_test(\&test_bcf_set_variant_type,$opts); | ||
|
||
run_test(\&test_annot_tsv,$opts,src=>'src.1.txt',dst=>'dst.1.txt',out=>'out.1.1.txt',args=>'-f smpl:overlap --allow-dups'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep in line with the rest of this top-level code, it should just call test_annot_tsv($opts);
and then have that run the individual tests.
test/test.pl
Outdated
my $pid = fork(); | ||
if ( !$pid ) | ||
{ | ||
exec('bash', '-o','pipefail','-c', "($cmd) 2>$tmp.e >$tmp.o"); | ||
} | ||
waitpid($pid,0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be replaced by
system('bash', '-o','pipefail','-c', "($cmd) 2>$tmp.e >$tmp.o");
Compared with _cmd()
, the $ENV{TEST_PRECMD}
functionality has been lost (although we don't use it that much these days, anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
test/test.pl
Outdated
my (@out,@err); | ||
if ( open(my $fh,'<',"$tmp.o") ) | ||
{ | ||
@out = <$fh>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we use an array for this (it's in other code too). It would be more efficient to just read everything into a string:
local $/; # Read whole file
$out = <$fh>;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just out of habit, usually I'd do something with the output per-line. I don't think this is a performance issue in these tests, but okay.
test/test.pl
Outdated
@@ -216,7 +304,16 @@ sub test_cmd | |||
} | |||
else | |||
{ | |||
failed($opts,$test,"The outputs differ:\n\t\t$$opts{path}/$args{out}\n\t\t$$opts{path}/$args{out}.new"); | |||
if ( exists($args{exp}) && !-e "$$opts{path}/$args{out}" ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second part of this will always be false due to the condition for the surrounding if
statement.
test/test.pl
Outdated
my ($ret,$out) = _cmd("$args{cmd}"); | ||
if ( $ret ) { failed($opts,$test); return; } | ||
my ($ret,$out,$err) = _cmd3("$args{cmd}"); | ||
if ( length($err) ) { $err =~ s/\n/\n\t\t/gs; $err = "\n\n\t\t$err\n"; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be done more easily with $err =~ s/^/\t/mg;
test/test.pl
Outdated
print $fh $exp; | ||
close($fh); | ||
} | ||
my @diff = `diff $$opts{path}/$args{out} $$opts{path}/$args{out}.new`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easier to capture into a string, then use s/^/\t/mg
.
test/test.pl
Outdated
waitpid($pid,0); | ||
|
||
my $status = $? >> 8; | ||
my $signal = $? & 127; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is never used.
- checks for memory allocation failures - sanitize numeric types and clean up old comments
Instead of trying to get the function name from a code ref, pass it in directly as a string. It can be looked up in the symbol table to get the corresponding code ref.
Some tests rely on others having been run first. Add a "needed_by" arg to run_test() (essentially an alias for "cmd", but it better documents the relationship) so these dependencies can be resolved. Add "needed_by" args so all tests work when run on their own.
Use local $/; to read entire files into a string. Simplify substitutions to put tabs at the start of lines. Remove unused chunk of code from test_cmd(), and capture diff output in a string instead of an array.
Add HTS_FORMAT, HTS_NORETURN annotations to error() function Make error() flush stdout, stderr so errors are always written after any output. Fix reported error() format string mismatches Fix checks on number of columns
And select the nroff version as the single source of truth.
Rebased, with some squashing of a couple of trivial bug-fix commits, and some adjustments from me... |
See also
https://raw.githubusercontent.com/pd3/utils/master/annot-regs/doc.annot-regs.pdf
https://github.com/pd3/utils/tree/master/annot-regs