-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vcf write speed #1663
Vcf write speed #1663
Conversation
72cfc4a
to
741bcc0
Compare
Checking it again I just spotted another slow down. I obviously hadn't finished my repeated rounds of optimisation and forgot where I was at, but this is still an improvement. I have a tiny new commit to add later though. |
Latest benchmarks, in cycle counts for the same files
|
I'll invesitgate the test failures on Monday. Bizarrely it's some index difference - I assume because the VCF is somehow different? All my VCFs I produced matched perfectly, and the locally run test harness also works fine. Anyway, I can check which commit it is easily by reverting and pushing over this to validate. Would be easier if I could reproduce it locally though so I'll try other systems to start with. Most perplexing. |
60272eb
to
dddffc2
Compare
Note: the cause of the test failure was using |
32e9ed8
to
92a36bb
Compare
I've rebased and squashed it back down again. A few small additional tweaks to the vcf_format code, but the main change was more major improvements to kputd. The final loops were culled as we already know the end point (within 1 at least, due to the slight rounding problem of eg 9999.999 to 10000). Benchmarks on sprintf: 252.2 |
92a36bb
to
a8ab2d8
Compare
Added Also adjusted the layout slightly, so it's more a thing of beauty. :) |
a8ab2d8
to
b78da10
Compare
vcf.c
Outdated
else e |= kputc(*p, s) < 0; | ||
char *p = (char *)data; | ||
|
||
// Can bcf_str_missing only occur at the start of a CHAR array? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the spec, I think bcf_str_missing
(0x07) comes from the type encoding of an empty string (i.e. type 7, length 0). So the bcf_str_missing
value will actually occur before the start of the string, and will be handled by the n == 0
case above.
I therefore suspect the original version of this code was wrong, and note that the equivalent in htsjdk doesn't go looking for 0x07 characters in the actual string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code layout very much implies this is a value and not a type. Eg the position of it in the header file has it adjacent to the missing integers, which are indeed embedded within the value stream.
The use of 7 as a type is already catered for in the definition of BCF_BT_CHAR.
However the BCF spec clearly details missing value codes for lists of integer and floating point, while omitting them from within strings. It does however state: "Suppose you want to encode the missing value ‘.’. This is simply a string of size 0 = 0x07". I am guessing this is where bcf_str_missing
originates, which is a subtle difference as it's the whole string and not an element of a list. The BCF section goes on to elaborate on this by stating that vectors of strings cannot exist - they just become comma-separated strings - and the reader has to tokenise and separate them out itself. Hence there is never an element of a string-list as the concept is alien to BCF.
Hence I agree with your interpretation, but will do something to make this distinction clear in the header as right now it lead me up the garden path somewhat.
vcf.c
Outdated
// Note bcf_str_missing is already accounted for in n==0 above. | ||
if (n >= 8) { | ||
char *p_end = memchr(p, 0, n); | ||
kputsn(p, p_end ? p_end-p : n, s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kputsn(p, p_end ? p_end-p : n, s); | |
e |= kputsn(p, p_end ? p_end-p : n, s) < 0; |
This is a major improvement as we're not continually calling strlen on the keys every time we go from a key ID to a string. We have to do it at the last point (writing) as some tools (e.g. bcftools annotate) modify the header without setting it to dirty and without calling bcf_hdr_sync. Also add a specialised version of bcf_fmt_array for array length of 1, which is a very common scenario, and change most kputc calls to kputc_ as we don't need to keep nul-terminating while constructing the string. Optimised bcf_unpack_info_core1 a bit too.
…nts. - We handle packed data with INFO/FORMAT only at present. - The main FORMAT loop checks for GT and also does something special for the first time through the loop. Once both conditions have passed (GT is often the first item anyway) we switch to a simpler loop so subsequent fields are reported quicker. This is 5-6% speedup on gcc 13 and clang 13, but no change on the system gcc7. - Optimise bcf_fmt_array when given long strings. Assumption: A "string" is an array of BCF_BT_CHAR. The VCF spec is a bit muddled here as it states strings are arrays of chars, but also has lists of strings as arrays of chars too. It's unclear quite what that means, but in practice it looks like they're all strings regardless as "A,.,B" in a Number=A,Type=String FORMAT field is just stored as "A,.,B" verbatim. I'm not even sure bcf_str_missing can ever occur, as the spec states missing strings have length zero, and apparently lists of strings use "." already. I could find no code that writes this value. Picard doesn't support BCF, so I know of no other trivial way of creating BCFs to test either. A spec mystery.
We were on a reasonable track before, but given this is emulating an sprintf %g format which has 6 significant digits, converting the full "i" integer to ASCII and then only printing the first 6 is redundant. Instead of multiplying d by a fixed amount and adding varying amounts for rounding, we multiply by a varying amount and add a fixed .5 for rounding, in doing so also forcing the integer to now always be 6 digits (bar some very rare rounding issues of i == 1000000). I've tested this with some 23 million numbers from 0.0001 to 999999 and found no differences. The use of __builtin_clz can speed this up. We'd need to check "if (__builtin_clz(1+(int)d) < 31)" and jump past the fractions to the "d < 10" (&& >= 1) case, but in tests it looks to only alter speed by a couple of percent, so it's probably not worth the extra complexity of compiler conditionals.
Also changed the header file to be clearer in the distinction of missing elements of a list vs the type code implying a missing value (single/array/string).
df9b7e1
to
7018ee7
Compare
Squashed and rebased... |
Speeds up the VCF writer. The main changes are:
bcf_fmt_array
on character arrays, treating them only as strings. This is a bit vague in the spec, but I believe this code to be valid.kputsn
overkputs
).Benchmarks showing 3 trials of "perf stat" cycle counts, in millions to make it easier to read, for develop (1st) vs this PR (2nd) on a variety of data sets. The third number in parentheses is the number of cycles spent during the read portion of test_view, so for the true write speed up that could be subtracted from both the Dev vs PR values. Host was an old Intel box, using the system gcc (7).
"GIAB" is the GIAB HG002 truth set. "info" and "fmt" are INFO heavy and FORMAT 1000 genomes heavy files from GNOMAD. Bcftools, Freebayes and GATK are calls made by the repsective tools on SynDip (CHM1/CHM13), so represent real-world single-sample tool outputs.
My test timings were
./test/test_view /tmp/_$i.bcf -p /tmp/_.vcf
where the BCF file is an uncompressed BCF, so causing the minimum of read/decode time. The read-portion in the above chart came fromtest_view -B /tmp/_$i.bcf
.Most files are 10-20% faster at writing, with the GIAB data being 40% faster for some reason (probably due to being dominated by very long strings in INFO). All files were md5sumed to validate the output matched.