Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vcf write speed #1663

Merged
merged 4 commits into from
Sep 20, 2023
Merged

Vcf write speed #1663

merged 4 commits into from
Sep 20, 2023

Conversation

jkbonfield
Copy link
Contributor

Speeds up the VCF writer. The main changes are:

  • VCF writer can handle packed BCF objects, meaning we don't have to unpack before writing as VCF.
  • Improve the GT formatting loop so once we've found it (often early on) we don't continue the penalty of looking again. This assumes it only ever turns up once, but the VCF specification forbids duplicate keys.
  • Optimise bcf_fmt_array on character arrays, treating them only as strings. This is a bit vague in the spec, but I believe this code to be valid.
  • Cache header key lengths to avoid repeated calls to strlen everywhere (kputsn over kputs).

Benchmarks showing 3 trials of "perf stat" cycle counts, in millions to make it easier to read, for develop (1st) vs this PR (2nd) on a variety of data sets. The third number in parentheses is the number of cycles spent during the read portion of test_view, so for the true write speed up that could be subtracted from both the Dev vs PR values. Host was an old Intel box, using the system gcc (7).

=== bcftools ===                                                                
  24,756   22,725    (3,992)     M.cycles  #    2.983 GHz                       
  25,892   22,445    (3,961)     M.cycles  #    2.983 GHz                       
  25,039   22,741    (4,037)     M.cycles  #    2.983 GHz                       
=== freebayes ===                                                               
  75,838   59,067   (11,068)     M.cycles  #    2.982 GHz                       
  75,342   59,437   (10,359)     M.cycles  #    2.983 GHz                       
  78,406   60,511   (10,309)     M.cycles  #    2.983 GHz                       
=== gatk ===                                                                    
   2,823    2,703      (387)     M.cycles  #    2.983 GHz                       
   2,680    2,248      (387)     M.cycles  #    2.983 GHz                       
   2,662    2,245      (386)     M.cycles  #    2.981 GHz                       
=== info ===                                                                    
  12,555    7,693    (1,724)     M.cycles  #    2.983 GHz                       
  12,562    7,947    (1,724)     M.cycles  #    2.983 GHz                       
  12,365    7,970    (1,734)     M.cycles  #    2.983 GHz                       
=== fmt ===                                                                     
  25,125   22,917    (3,395)     M.cycles  #    2.977 GHz                       
  25,074   22,604    (3,421)     M.cycles  #    2.981 GHz                       
  25,228   22,450    (3,408)     M.cycles  #    2.967 GHz                       
=== GIAB ===                                                                    
  43,056   26,989    (7,488)     M.cycles  #    2.972 GHz                       
  43,959   27,237    (7,409)     M.cycles  #    2.975 GHz                       
  47,072   28,113    (7,475)     M.cycles  #    2.979 GHz                       

"GIAB" is the GIAB HG002 truth set. "info" and "fmt" are INFO heavy and FORMAT 1000 genomes heavy files from GNOMAD. Bcftools, Freebayes and GATK are calls made by the repsective tools on SynDip (CHM1/CHM13), so represent real-world single-sample tool outputs.

My test timings were ./test/test_view /tmp/_$i.bcf -p /tmp/_.vcf where the BCF file is an uncompressed BCF, so causing the minimum of read/decode time. The read-portion in the above chart came from test_view -B /tmp/_$i.bcf.

Most files are 10-20% faster at writing, with the GIAB data being 40% faster for some reason (probably due to being dominated by very long strings in INFO). All files were md5sumed to validate the output matched.

@jkbonfield
Copy link
Contributor Author

Checking it again I just spotted another slow down. I obviously hadn't finished my repeated rounds of optimisation and forgot where I was at, but this is still an improvement. I have a tiny new commit to add later though.

@jkbonfield
Copy link
Contributor Author

Latest benchmarks, in cycle counts for the same files

=== bcftools ===                                                                
       19743638639      cycles                    #    2.983 GHz                
       19489797455      cycles                    #    2.973 GHz                
       19395872945      cycles                    #    2.971 GHz                
300f6c790bc5bc725005a2a36f8d67fd  -                                             
=== freebayes ===                                                               
       49194527204      cycles                    #    2.983 GHz                
       50985255922      cycles                    #    2.982 GHz                
       51973400318      cycles                    #    2.984 GHz                
3ab21fcda2287e0cb012c99337010be3  -                                             
=== gatk ===                                                                    
        2401848476      cycles                    #    2.931 GHz                
        1927701416      cycles                    #    2.915 GHz                
        1924751331      cycles                    #    2.937 GHz                
968b6446197e4d132ae382acc24cee71  -                                             
=== info ===                                                                    
        7494270098      cycles                    #    2.984 GHz                
        7702558176      cycles                    #    2.983 GHz                
        7703784698      cycles                    #    2.969 GHz                
690a78c76697c04eeb9faf564a8c78d6  -                                             
=== fmt ===                                                                     
       22428582248      cycles                    #    2.982 GHz                
       22596287183      cycles                    #    2.976 GHz                
       22498706702      cycles                    #    2.984 GHz                
898432fcce471fbfa89065b6befcd459  -                                             
=== GIAB ===                                                                    
       24562360470      cycles                    #    2.977 GHz                
       25045186584      cycles                    #    2.983 GHz                
       25870306585      cycles                    #    2.980 GHz                
de4cb618fbe55bad8268a3e510a28384  -                                             

@jkbonfield
Copy link
Contributor Author

I'll invesitgate the test failures on Monday. Bizarrely it's some index difference - I assume because the VCF is somehow different? All my VCFs I produced matched perfectly, and the locally run test harness also works fine.

Anyway, I can check which commit it is easily by reverting and pushing over this to validate. Would be easier if I could reproduce it locally though so I'll try other systems to start with. Most perplexing.

@jkbonfield
Copy link
Contributor Author

Note: the cause of the test failure was using builtin_clz to check the binary size of the double, after conversion to integer. This function is undefined when the value is zero, meaning d < 1. A simple +1 would have sufficed, but benchmarking it again I only see a couple percent difference so decided it's sufficiently minimal gains to not be worth while, so culled that part of code.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Aug 22, 2023

I've rebased and squashed it back down again. A few small additional tweaks to the vcf_format code, but the main change was more major improvements to kputd. The final loops were culled as we already know the end point (within 1 at least, due to the slight rounding problem of eg 9999.999 to 10000).

Benchmarks on sprintf(buf, "%g", d), kputd in devel and kputd in this PR for values generated by for (d = 0.0001; d <= 999999; d *= 1.0000001) are, in billions of CPU cycles:

sprintf: 252.2
kputd (devel): 37.2
kputd (this): 16.0

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Aug 25, 2023

Added rint in kputd as suggested by Rob. A note to others and self, values like 1.015625 were rounded up instead of down. This was something the original got wrong too.

Also adjusted the layout slightly, so it's more a thing of beauty. :)

vcf.c Outdated
else e |= kputc(*p, s) < 0;
char *p = (char *)data;

// Can bcf_str_missing only occur at the start of a CHAR array?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the spec, I think bcf_str_missing (0x07) comes from the type encoding of an empty string (i.e. type 7, length 0). So the bcf_str_missing value will actually occur before the start of the string, and will be handled by the n == 0 case above.

I therefore suspect the original version of this code was wrong, and note that the equivalent in htsjdk doesn't go looking for 0x07 characters in the actual string.

Copy link
Contributor Author

@jkbonfield jkbonfield Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code layout very much implies this is a value and not a type. Eg the position of it in the header file has it adjacent to the missing integers, which are indeed embedded within the value stream.

The use of 7 as a type is already catered for in the definition of BCF_BT_CHAR.

However the BCF spec clearly details missing value codes for lists of integer and floating point, while omitting them from within strings. It does however state: "Suppose you want to encode the missing value ‘.’. This is simply a string of size 0 = 0x07". I am guessing this is where bcf_str_missing originates, which is a subtle difference as it's the whole string and not an element of a list. The BCF section goes on to elaborate on this by stating that vectors of strings cannot exist - they just become comma-separated strings - and the reader has to tokenise and separate them out itself. Hence there is never an element of a string-list as the concept is alien to BCF.

Hence I agree with your interpretation, but will do something to make this distinction clear in the header as right now it lead me up the garden path somewhat.

vcf.c Show resolved Hide resolved
vcf.c Outdated Show resolved Hide resolved
vcf.c Outdated
// Note bcf_str_missing is already accounted for in n==0 above.
if (n >= 8) {
char *p_end = memchr(p, 0, n);
kputsn(p, p_end ? p_end-p : n, s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kputsn(p, p_end ? p_end-p : n, s);
e |= kputsn(p, p_end ? p_end-p : n, s) < 0;

This is a major improvement as we're not continually calling strlen on
the keys every time we go from a key ID to a string.

We have to do it at the last point (writing) as some tools
(e.g. bcftools annotate) modify the header without setting it to dirty
and without calling bcf_hdr_sync.

Also add a specialised version of bcf_fmt_array for array length of 1,
which is a very common scenario, and change most kputc calls to kputc_
as we don't need to keep nul-terminating while constructing the
string.

Optimised bcf_unpack_info_core1 a bit too.
…nts.

- We handle packed data with INFO/FORMAT only at present.

- The main FORMAT loop checks for GT and also does something special for
  the first time through the loop.  Once both conditions have passed (GT
  is often the first item anyway) we switch to a simpler loop so
  subsequent fields are reported quicker.

  This is 5-6% speedup on gcc 13 and clang 13, but no change on the
  system gcc7.

- Optimise bcf_fmt_array when given long strings.

  Assumption: A "string" is an array of BCF_BT_CHAR.  The VCF spec is a bit
  muddled here as it states strings are arrays of chars, but also has
  lists of strings as arrays of chars too.  It's unclear quite what that
  means, but in practice it looks like they're all strings regardless as
  "A,.,B" in a Number=A,Type=String FORMAT field is just stored as
  "A,.,B" verbatim.

  I'm not even sure bcf_str_missing can ever occur, as the spec states
  missing strings have length zero, and apparently lists of strings use
  "." already.  I could find no code that writes this value.  Picard
  doesn't support BCF, so I know of no other trivial way of creating
  BCFs to test either.  A spec mystery.
We were on a reasonable track before, but given this is emulating an
sprintf %g format which has 6 significant digits, converting the full
"i" integer to ASCII and then only printing the first 6 is redundant.

Instead of multiplying d by a fixed amount and adding varying amounts
for rounding, we multiply by a varying amount and add a fixed .5 for
rounding, in doing so also forcing the integer to now always be 6
digits (bar some very rare rounding issues of i == 1000000).

I've tested this with some 23 million numbers from 0.0001 to 999999
and found no differences.

The use of __builtin_clz can speed this up.  We'd need to check
"if (__builtin_clz(1+(int)d) < 31)" and jump past the fractions to the
"d < 10" (&& >= 1) case, but in tests it looks to only alter speed by
a couple of percent, so it's probably not worth the extra complexity
of compiler conditionals.
Also changed the header file to be clearer in the distinction of
missing elements of a list vs the type code implying a missing
value (single/array/string).
@daviesrob
Copy link
Member

Squashed and rebased...

@daviesrob daviesrob merged commit 7018ee7 into samtools:develop Sep 20, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants