-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate strings in binlogs #6017
Conversation
The latest viewer version (2.1.300) has already been updated to read the binlog format version 10. |
Hey @mmitche your binlog goes down in size from 590 MB to 157 MB. @aolszowka I'm probably too late with this, but your binlog goes down from 23 GB to 172 MB! You get the first prize! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing PR. Just some small things I saw while reading through.
This is a HUGE win, I'll make sure to pass it along and keep it in my toolbox :) Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm so excited about this! 🙌🏻
Should we add some test cases around deliberately malformed logs (misordered, unused indexes, that sort of thing)?
A few nits but looking great. My only big question is tactical: should we guard this behind a change wave? If we fork now and put this in the very beginning of 16.10 we can probably just make it unconditional, since making the write side conditioned would be a bit inconvenient.
I'm personally not in a rush for 16.9 so I'd rather just wait until we fork for 16.9 and then just merge into master so we have lots of runway to dogfood for 16.10. This way if we have to tweak the file format we'll have more time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it!
I actually discovered that the CLSCompliant warnings have disappeared. Maybe in early iterations the types I added were public and that's why I added the NoWarns? But just removing them now seems to be fine. |
When writing out a binary log we now deduplicate strings and dictionaries. This results in a significant performance increase and binlog size reduction. Performance increase is about 2x on average, size reduction is about 4x on average, but up to 30x for large binlogs. Add two new record kinds: String and NameValueList. A string record is written the first time we encounter a string we need to serialize. The next time we see the string we only write its index in the total list of strings. Similarly, NameValueList is a list of key and value strings, used for Properties, environment variables and Item metadata. The first time we're writing out a list we write a record, and subsequent times just the index. This keeps the binlog format streaming, so if the logging is interrupted midway, we will be able to read everything up to that point. We do not hold on to strings we encountered. Instead we hash them and only preserve the hash code. We rely on the hash function resulting in no collisions, otherwise it could happen that a random string in a binlog would be substituted instead of another one. The hashtables do not significantly add to MSBuild memory usage (20-30 MB for our largest binlogs). FNV-1a (64-bit hash size) was a carefully chosen hash function for its simplicity, performance, and lack of collisions on all binlogs tested so far. 32-bit hash size (such as string.GetHashCode() was not sufficient and resulted in ~800 collisions for our largest binlog with 2.7 million strings. This change includes other performance improvements such as inserting a BufferedStream around the stream we're reading or writing. This results in a significant performance improvement. We introduce a new StringStorage data structure in the binlog reader, for storing the strings on disk instead of reading them all into memory. Strings are loaded into memory on demand. This prevents OOM in 32-bit MSBuild processes when playing back large binlogs. This keeps the memory usage relatively flat when reading.
198f85a
to
aa6fbab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very quick skim, submitting small nits. The results look great 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome wins and shame on GZipStream that it leaves so much redundant data in it the compressed stream.
Not having encountered hash code collisions in test data doesn't mean that they don't exist. I understand that this is "just logging" and if, with an extremely low probability, the log lies, we are kind of fine with that. Still, isn't it fundamentally a bug? It's unusual for a component to have this probabilistic correctness characteristics in an otherwise precise system. Do you happen to know how much of the wins would be lost if the string lookup was implemented "correctly"? Thank you!
@ladipro I totally agree on your point about hash collisions. It's a very valid concern. Without relying on hashes though, we'd have to hold on to all strings we saw because that's the only way to be absolutely sure the strings match. Holding on to strings is absolutely not an option in a 32-bit process (as we have gigabytes of them), even if we choose a single string per hashcode as a "sample". Writing them to a page file like I do with Reader is an option, but you'd need to constantly seek to read out a string from disk for every new string you encounter. It would absolutely kill perf and defeat the purpose. I'm afraid that if we want the perf wins, we have to allow a small possibility of a collision, where in a random binlog (1) a random string you happened to be looking at (2) is switched to another random string. You only look at a tiny small fraction of all strings in the binlog (there are millions), and chances are that you happened to look at the corrupt one are infinitesimally small (given that I haven't seen any collisions in our top 10 largest binlogs). I say it's worth the risk! |
I guess we could also store the string length and the first character (in addition to hash), to further drastically reduce the chance of collision. But honestly I don't think it's even worth it. |
Regarding the hash problem and the large memory required to store the strings, wouldn't using msbuild's string interner (maybe using @ladipro's rewritten version of it) fix the problem? The interner also dedupes strings, so the in-memory string size should be the same as the deduped string size from the binlog |
If there was a big concern about a collision what about a post process step? So you opt in to "do it fast with a hash" but allow a slower pass that captured all the strings then performed this process, at that point if there was a collision you could bark, but if there was no collision you reap the gains when sharing the binlog? Does nothing for the perf though which is what I assume was driving a lot of this? I suspect the big concern is trying to capture this one off, realistically though by the time we're digging though the bin logs its probably reproducible on our end so its not like we're totally out of luck. |
The interner holds on to strings only as long as something else is keeping them alive so it wouldn't guarantee correctness either. It's basically a cache of recently seen and not yet collected strings, not of all strings. |
Agreed that it wouldn't be worth it. It's basically like using a wider hashcode but with poorer collision characteristics.
Let's do it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, mostly nits.
Reduce maximum strings allocated in memory to 2GB (1 billion chars).
This avoids manually switching from currentRecordWriter to originalBinaryWriter in three different places. It's also easier this way to find the places where the switch happens.
I think I've answered or resolved all comments, but let me know if you have more or disagree with some of my resolutions. Meanwhile I'll do some more final testing on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
I'm happy to report a nice performance improvement from this PR.
(all single-core build). We see that /bl makes it twice as slow, but diag text log is almost 4x as slow. I think we can reduce binlog overhead further. |
Shall we merge this? |
Yep! I'm still trying to merge things in a gradual way, but I'll get to your other PR in the next few days. |
When writing out a binary log we now deduplicate strings and dictionaries. This results in a significant performance increase and binlog size reduction. Performance increase is about 2x on average, size reduction is about 4x on average, but up to 30x for large binlogs.
Add two new record kinds: String and NameValueList. A string record is written the first time we encounter a string we need to serialize. The next time we see the string we only write its index in the total list of strings.
Similarly, NameValueList is a list of key and value strings, used for Properties, environment variables and Item metadata. The first time we're writing out a list we write a record, and subsequent times just the index.
This keeps the binlog format streaming, so if the logging is interrupted midway, we will be able to read everything up to that point.
We do not hold on to strings we encountered. Instead we hash them and only preserve the hash code. We rely on the hash function resulting in no collisions, otherwise it could happen that a random string in a binlog would be substituted instead of another one. The hashtables do not significantly add to MSBuild memory usage (20-30 MB for our largest binlogs).
FNV-1a (64-bit hash size) was a carefully chosen hash function for its simplicity, performance, and lack of collisions on all binlogs tested so far. 32-bit hash size (such as string.GetHashCode() was not sufficient and resulted in ~800 collisions for our largest binlog with 2.7 million strings.
This change includes other performance improvements such as inserting a BufferedStream around the stream we're reading or writing. This results in a significant performance improvement.
We introduce a new StringStorage data structure in the binlog reader, for storing the strings on disk instead of reading them all into memory. Strings are loaded into memory on demand. This prevents OOM in 32-bit MSBuild processes when playing back large binlogs. This keeps the memory usage relatively flat when reading.
Testing
I've done extensive testing for correctness, performance, and hash collisions and performance.
Hash function selection is documented here:
https://github.com/KirillOsenkov/MSBuildStructuredLog/wiki/String-Hashing
In summary, Fnv-1a provides no collisions with 64-bit hash size, and very decent performance (11 seconds to hash all the strings in our largest binlog, 2.7 million strings). Additionally it's very simple and minimal code.
Playback:
I've tested various scenarios for playback:
The data looks good:
Binlog size reduction:
Playback of binlog into binlog, perf improvement:
Playback of legacy vswin.binlog by legacy: 6:54
Playback of legacy vswin.binlog by new: 2:56
Playback of new vswin.binlog on 32-bit: 3:17
Playback of new vswin.binlog on 64-bit: 2:13
In summary, I expect builds to be faster (up to 4 minutes improvement for really large builds) and binlogs to be smaller (up to 50% for small binlogs, 2-4x for medium size, and 131x (from 23 GB to 172 MB) for our most extreme binlog.