Vcf write speed #1663

jkbonfield · 2023-08-18T11:13:57Z

Speeds up the VCF writer. The main changes are:

VCF writer can handle packed BCF objects, meaning we don't have to unpack before writing as VCF.
Improve the GT formatting loop so once we've found it (often early on) we don't continue the penalty of looking again. This assumes it only ever turns up once, but the VCF specification forbids duplicate keys.
Optimise bcf_fmt_array on character arrays, treating them only as strings. This is a bit vague in the spec, but I believe this code to be valid.
Cache header key lengths to avoid repeated calls to strlen everywhere (kputsn over kputs).

Benchmarks showing 3 trials of "perf stat" cycle counts, in millions to make it easier to read, for develop (1st) vs this PR (2nd) on a variety of data sets. The third number in parentheses is the number of cycles spent during the read portion of test_view, so for the true write speed up that could be subtracted from both the Dev vs PR values. Host was an old Intel box, using the system gcc (7).

=== bcftools ===                                                                
  24,756   22,725    (3,992)     M.cycles  #    2.983 GHz                       
  25,892   22,445    (3,961)     M.cycles  #    2.983 GHz                       
  25,039   22,741    (4,037)     M.cycles  #    2.983 GHz                       
=== freebayes ===                                                               
  75,838   59,067   (11,068)     M.cycles  #    2.982 GHz                       
  75,342   59,437   (10,359)     M.cycles  #    2.983 GHz                       
  78,406   60,511   (10,309)     M.cycles  #    2.983 GHz                       
=== gatk ===                                                                    
   2,823    2,703      (387)     M.cycles  #    2.983 GHz                       
   2,680    2,248      (387)     M.cycles  #    2.983 GHz                       
   2,662    2,245      (386)     M.cycles  #    2.981 GHz                       
=== info ===                                                                    
  12,555    7,693    (1,724)     M.cycles  #    2.983 GHz                       
  12,562    7,947    (1,724)     M.cycles  #    2.983 GHz                       
  12,365    7,970    (1,734)     M.cycles  #    2.983 GHz                       
=== fmt ===                                                                     
  25,125   22,917    (3,395)     M.cycles  #    2.977 GHz                       
  25,074   22,604    (3,421)     M.cycles  #    2.981 GHz                       
  25,228   22,450    (3,408)     M.cycles  #    2.967 GHz                       
=== GIAB ===                                                                    
  43,056   26,989    (7,488)     M.cycles  #    2.972 GHz                       
  43,959   27,237    (7,409)     M.cycles  #    2.975 GHz                       
  47,072   28,113    (7,475)     M.cycles  #    2.979 GHz

"GIAB" is the GIAB HG002 truth set. "info" and "fmt" are INFO heavy and FORMAT 1000 genomes heavy files from GNOMAD. Bcftools, Freebayes and GATK are calls made by the repsective tools on SynDip (CHM1/CHM13), so represent real-world single-sample tool outputs.

My test timings were ./test/test_view /tmp/_$i.bcf -p /tmp/_.vcf where the BCF file is an uncompressed BCF, so causing the minimum of read/decode time. The read-portion in the above chart came from test_view -B /tmp/_$i.bcf.

Most files are 10-20% faster at writing, with the GIAB data being 40% faster for some reason (probably due to being dominated by very long strings in INFO). All files were md5sumed to validate the output matched.

jkbonfield · 2023-08-18T11:40:53Z

Checking it again I just spotted another slow down. I obviously hadn't finished my repeated rounds of optimisation and forgot where I was at, but this is still an improvement. I have a tiny new commit to add later though.

jkbonfield · 2023-08-18T16:15:54Z

Latest benchmarks, in cycle counts for the same files

=== bcftools ===                                                                
       19743638639      cycles                    #    2.983 GHz                
       19489797455      cycles                    #    2.973 GHz                
       19395872945      cycles                    #    2.971 GHz                
300f6c790bc5bc725005a2a36f8d67fd  -                                             
=== freebayes ===                                                               
       49194527204      cycles                    #    2.983 GHz                
       50985255922      cycles                    #    2.982 GHz                
       51973400318      cycles                    #    2.984 GHz                
3ab21fcda2287e0cb012c99337010be3  -                                             
=== gatk ===                                                                    
        2401848476      cycles                    #    2.931 GHz                
        1927701416      cycles                    #    2.915 GHz                
        1924751331      cycles                    #    2.937 GHz                
968b6446197e4d132ae382acc24cee71  -                                             
=== info ===                                                                    
        7494270098      cycles                    #    2.984 GHz                
        7702558176      cycles                    #    2.983 GHz                
        7703784698      cycles                    #    2.969 GHz                
690a78c76697c04eeb9faf564a8c78d6  -                                             
=== fmt ===                                                                     
       22428582248      cycles                    #    2.982 GHz                
       22596287183      cycles                    #    2.976 GHz                
       22498706702      cycles                    #    2.984 GHz                
898432fcce471fbfa89065b6befcd459  -                                             
=== GIAB ===                                                                    
       24562360470      cycles                    #    2.977 GHz                
       25045186584      cycles                    #    2.983 GHz                
       25870306585      cycles                    #    2.980 GHz                
de4cb618fbe55bad8268a3e510a28384  -

jkbonfield · 2023-08-18T16:26:25Z

I'll invesitgate the test failures on Monday. Bizarrely it's some index difference - I assume because the VCF is somehow different? All my VCFs I produced matched perfectly, and the locally run test harness also works fine.

Anyway, I can check which commit it is easily by reverting and pushing over this to validate. Would be easier if I could reproduce it locally though so I'll try other systems to start with. Most perplexing.

jkbonfield · 2023-08-21T11:32:47Z

Note: the cause of the test failure was using builtin_clz to check the binary size of the double, after conversion to integer. This function is undefined when the value is zero, meaning d < 1. A simple +1 would have sufficed, but benchmarking it again I only see a couple percent difference so decided it's sufficiently minimal gains to not be worth while, so culled that part of code.

jkbonfield · 2023-08-22T09:30:00Z

I've rebased and squashed it back down again. A few small additional tweaks to the vcf_format code, but the main change was more major improvements to kputd. The final loops were culled as we already know the end point (within 1 at least, due to the slight rounding problem of eg 9999.999 to 10000).

Benchmarks on sprintf(buf, "%g", d), kputd in devel and kputd in this PR for values generated by for (d = 0.0001; d <= 999999; d *= 1.0000001) are, in billions of CPU cycles:

sprintf: 252.2
kputd (devel): 37.2
kputd (this): 16.0

jkbonfield · 2023-08-25T16:45:11Z

Added rint in kputd as suggested by Rob. A note to others and self, values like 1.015625 were rounded up instead of down. This was something the original got wrong too.

Also adjusted the layout slightly, so it's more a thing of beauty. :)

daviesrob · 2023-08-31T09:13:58Z

vcf.c

-            else e |= kputc(*p, s) < 0;
+        char *p = (char *)data;
+
+        // Can bcf_str_missing only occur at the start of a CHAR array?


Looking at the spec, I think bcf_str_missing (0x07) comes from the type encoding of an empty string (i.e. type 7, length 0). So the bcf_str_missing value will actually occur before the start of the string, and will be handled by the n == 0 case above.

I therefore suspect the original version of this code was wrong, and note that the equivalent in htsjdk doesn't go looking for 0x07 characters in the actual string.

The code layout very much implies this is a value and not a type. Eg the position of it in the header file has it adjacent to the missing integers, which are indeed embedded within the value stream.

The use of 7 as a type is already catered for in the definition of BCF_BT_CHAR.

However the BCF spec clearly details missing value codes for lists of integer and floating point, while omitting them from within strings. It does however state: "Suppose you want to encode the missing value ‘.’. This is simply a string of size 0 = 0x07". I am guessing this is where bcf_str_missing originates, which is a subtle difference as it's the whole string and not an element of a list. The BCF section goes on to elaborate on this by stating that vectors of strings cannot exist - they just become comma-separated strings - and the reader has to tokenise and separate them out itself. Hence there is never an element of a string-list as the concept is alien to BCF.

Hence I agree with your interpretation, but will do something to make this distinction clear in the header as right now it lead me up the garden path somewhat.

vcf.c

daviesrob · 2023-09-01T11:08:37Z

vcf.c

+        // Note bcf_str_missing is already accounted for in n==0 above.
+        if (n >= 8) {
+            char *p_end = memchr(p, 0, n);
+            kputsn(p, p_end ? p_end-p : n, s);


Suggested change

kputsn(p, p_end ? p_end-p : n, s);

e |= kputsn(p, p_end ? p_end-p : n, s) < 0;

This is a major improvement as we're not continually calling strlen on the keys every time we go from a key ID to a string. We have to do it at the last point (writing) as some tools (e.g. bcftools annotate) modify the header without setting it to dirty and without calling bcf_hdr_sync. Also add a specialised version of bcf_fmt_array for array length of 1, which is a very common scenario, and change most kputc calls to kputc_ as we don't need to keep nul-terminating while constructing the string. Optimised bcf_unpack_info_core1 a bit too.

…nts. - We handle packed data with INFO/FORMAT only at present. - The main FORMAT loop checks for GT and also does something special for the first time through the loop. Once both conditions have passed (GT is often the first item anyway) we switch to a simpler loop so subsequent fields are reported quicker. This is 5-6% speedup on gcc 13 and clang 13, but no change on the system gcc7. - Optimise bcf_fmt_array when given long strings. Assumption: A "string" is an array of BCF_BT_CHAR. The VCF spec is a bit muddled here as it states strings are arrays of chars, but also has lists of strings as arrays of chars too. It's unclear quite what that means, but in practice it looks like they're all strings regardless as "A,.,B" in a Number=A,Type=String FORMAT field is just stored as "A,.,B" verbatim. I'm not even sure bcf_str_missing can ever occur, as the spec states missing strings have length zero, and apparently lists of strings use "." already. I could find no code that writes this value. Picard doesn't support BCF, so I know of no other trivial way of creating BCFs to test either. A spec mystery.

We were on a reasonable track before, but given this is emulating an sprintf %g format which has 6 significant digits, converting the full "i" integer to ASCII and then only printing the first 6 is redundant. Instead of multiplying d by a fixed amount and adding varying amounts for rounding, we multiply by a varying amount and add a fixed .5 for rounding, in doing so also forcing the integer to now always be 6 digits (bar some very rare rounding issues of i == 1000000). I've tested this with some 23 million numbers from 0.0001 to 999999 and found no differences. The use of __builtin_clz can speed this up. We'd need to check "if (__builtin_clz(1+(int)d) < 31)" and jump past the fractions to the "d < 10" (&& >= 1) case, but in tests it looks to only alter speed by a couple of percent, so it's probably not worth the extra complexity of compiler conditionals.

Also changed the header file to be clearer in the distinction of missing elements of a list vs the type code implying a missing value (single/array/string).

daviesrob · 2023-09-20T18:19:40Z

Squashed and rebased...

jkbonfield force-pushed the vcf_write_speed branch from 72cfc4a to 741bcc0 Compare August 18, 2023 11:15

jkbonfield force-pushed the vcf_write_speed branch from 60272eb to dddffc2 Compare August 21, 2023 11:18

jkbonfield force-pushed the vcf_write_speed branch from 32e9ed8 to 92a36bb Compare August 22, 2023 09:09

daviesrob self-assigned this Aug 22, 2023

jkbonfield force-pushed the vcf_write_speed branch from 92a36bb to a8ab2d8 Compare August 25, 2023 13:31

jkbonfield force-pushed the vcf_write_speed branch from a8ab2d8 to b78da10 Compare August 25, 2023 16:45

daviesrob reviewed Aug 31, 2023

View reviewed changes

daviesrob reviewed Sep 1, 2023

View reviewed changes

jkbonfield added 4 commits September 20, 2023 18:18

Minor tweaks to bcf_fmt_array and bcf_str_missing usage.

7018ee7

Also changed the header file to be clearer in the distinction of missing elements of a list vs the type code implying a missing value (single/array/string).

daviesrob force-pushed the vcf_write_speed branch from df9b7e1 to 7018ee7 Compare September 20, 2023 18:19

daviesrob merged commit 7018ee7 into samtools:develop Sep 20, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vcf write speed #1663

Vcf write speed #1663

jkbonfield commented Aug 18, 2023

jkbonfield commented Aug 18, 2023

jkbonfield commented Aug 18, 2023

jkbonfield commented Aug 18, 2023

jkbonfield commented Aug 21, 2023

jkbonfield commented Aug 22, 2023 •

edited

Loading

jkbonfield commented Aug 25, 2023 •

edited

Loading

daviesrob Aug 31, 2023

jkbonfield Aug 31, 2023 •

edited

Loading

daviesrob Sep 1, 2023

daviesrob commented Sep 20, 2023

	kputsn(p, p_end ? p_end-p : n, s);
	e \|= kputsn(p, p_end ? p_end-p : n, s) < 0;

Vcf write speed #1663

Vcf write speed #1663

Conversation

jkbonfield commented Aug 18, 2023

jkbonfield commented Aug 18, 2023

jkbonfield commented Aug 18, 2023

jkbonfield commented Aug 18, 2023

jkbonfield commented Aug 21, 2023

jkbonfield commented Aug 22, 2023 • edited Loading

jkbonfield commented Aug 25, 2023 • edited Loading

daviesrob Aug 31, 2023

Choose a reason for hiding this comment

jkbonfield Aug 31, 2023 • edited Loading

Choose a reason for hiding this comment

daviesrob Sep 1, 2023

Choose a reason for hiding this comment

daviesrob commented Sep 20, 2023

jkbonfield commented Aug 22, 2023 •

edited

Loading

jkbonfield commented Aug 25, 2023 •

edited

Loading

jkbonfield Aug 31, 2023 •

edited

Loading