Deduplicate strings in binlogs #6017

KirillOsenkov · 2021-01-10T23:56:00Z

When writing out a binary log we now deduplicate strings and dictionaries. This results in a significant performance increase and binlog size reduction. Performance increase is about 2x on average, size reduction is about 4x on average, but up to 30x for large binlogs.

Add two new record kinds: String and NameValueList. A string record is written the first time we encounter a string we need to serialize. The next time we see the string we only write its index in the total list of strings.

Similarly, NameValueList is a list of key and value strings, used for Properties, environment variables and Item metadata. The first time we're writing out a list we write a record, and subsequent times just the index.

This keeps the binlog format streaming, so if the logging is interrupted midway, we will be able to read everything up to that point.

We do not hold on to strings we encountered. Instead we hash them and only preserve the hash code. We rely on the hash function resulting in no collisions, otherwise it could happen that a random string in a binlog would be substituted instead of another one. The hashtables do not significantly add to MSBuild memory usage (20-30 MB for our largest binlogs).

FNV-1a (64-bit hash size) was a carefully chosen hash function for its simplicity, performance, and lack of collisions on all binlogs tested so far. 32-bit hash size (such as string.GetHashCode() was not sufficient and resulted in ~800 collisions for our largest binlog with 2.7 million strings.

This change includes other performance improvements such as inserting a BufferedStream around the stream we're reading or writing. This results in a significant performance improvement.

We introduce a new StringStorage data structure in the binlog reader, for storing the strings on disk instead of reading them all into memory. Strings are loaded into memory on demand. This prevents OOM in 32-bit MSBuild processes when playing back large binlogs. This keeps the memory usage relatively flat when reading.

Testing

I've done extensive testing for correctness, performance, and hash collisions and performance.

Hash function selection is documented here:
https://github.com/KirillOsenkov/MSBuildStructuredLog/wiki/String-Hashing
In summary, Fnv-1a provides no collisions with 64-bit hash size, and very decent performance (11 seconds to hash all the strings in our largest binlog, 2.7 million strings). Additionally it's very simple and minimal code.

Playback:
I've tested various scenarios for playback:

binlog format

playing back a legacy binlog (format version < 10), with no String or NameValueList records
playing back version 10 binlog

process bitness

on 32-bit we need to use the StringStorage helper class when reading new binlogs, to avoid OOM
on 32-bit we're fine when reading legacy binlogs as they're fully streaming
on 64-bit we just store all strings in memory, and it results in much faster performance

The data looks good:

Binlog size reduction:

Strings:	Binlog:	Legacy binlog size (KB)	New binlog size (KB)
2,744,490	vswin.binlog	703,863	274,714
2,711,714	790.binlog	791,488	283,364
1,633,734	23GB.binlog	22,558,820	172,617
681,346	mmitche.binlog	590,938	157,915
302,685	roslyn.binlog	194,354	49,277
288,136	dotnet.binlog	64,577	24,599
220,177	vsmac.binlog	143,178	35,095
42,658	Ide.binlog	20,753	12,334

Playback of binlog into binlog, perf improvement:
Playback of legacy vswin.binlog by legacy: 6:54
Playback of legacy vswin.binlog by new: 2:56
Playback of new vswin.binlog on 32-bit: 3:17
Playback of new vswin.binlog on 64-bit: 2:13

In summary, I expect builds to be faster (up to 4 minutes improvement for really large builds) and binlogs to be smaller (up to 50% for small binlogs, 2-4x for medium size, and 131x (from 23 GB to 172 MB) for our most extreme binlog.

KirillOsenkov · 2021-01-11T00:32:25Z

The latest viewer version (2.1.300) has already been updated to read the binlog format version 10.

KirillOsenkov · 2021-01-11T00:38:43Z

Hey @mmitche your binlog goes down in size from 590 MB to 157 MB.

@aolszowka I'm probably too late with this, but your binlog goes down from 23 GB to 172 MB! You get the first prize!

drewnoakes

Amazing PR. Just some small things I saw while reading through.

src/Build/Logging/BinaryLogger/BuildEventArgsWriter.cs

src/Build/Microsoft.Build.csproj

aolszowka · 2021-01-11T13:13:23Z

@aolszowka I'm probably too late with this, but your binlog goes down from 23 GB to 172 MB! You get the first prize!

This is a HUGE win, I'll make sure to pass it along and keep it in my toolbox :) Thank you!

rainersigwald

I'm so excited about this! 🙌🏻

Should we add some test cases around deliberately malformed logs (misordered, unused indexes, that sort of thing)?

A few nits but looking great. My only big question is tactical: should we guard this behind a change wave? If we fork now and put this in the very beginning of 16.10 we can probably just make it unconditional, since making the write side conditioned would be a bit inconvenient.

src/Build/Logging/BinaryLogger/BinaryLogReplayEventSource.cs

src/Build/Microsoft.Build.csproj

src/Build/Logging/BinaryLogger/BinaryLogger.cs

src/Build/Logging/BinaryLogger/BuildEventArgsWriter.cs

src/Build/Logging/BinaryLogger/BuildEventArgsReader.cs

KirillOsenkov · 2021-01-11T17:58:51Z

I'm personally not in a rush for 16.9 so I'd rather just wait until we fork for 16.9 and then just merge into master so we have lots of runway to dogfood for 16.10. This way if we have to tweak the file format we'll have more time.

Forgind

Love it!

src/Build/Logging/BinaryLogger/BuildEventArgsReader.cs

KirillOsenkov · 2021-01-11T19:57:46Z

I actually discovered that the CLSCompliant warnings have disappeared. Maybe in early iterations the types I added were public and that's why I added the NoWarns? But just removing them now seems to be fine.

When writing out a binary log we now deduplicate strings and dictionaries. This results in a significant performance increase and binlog size reduction. Performance increase is about 2x on average, size reduction is about 4x on average, but up to 30x for large binlogs. Add two new record kinds: String and NameValueList. A string record is written the first time we encounter a string we need to serialize. The next time we see the string we only write its index in the total list of strings. Similarly, NameValueList is a list of key and value strings, used for Properties, environment variables and Item metadata. The first time we're writing out a list we write a record, and subsequent times just the index. This keeps the binlog format streaming, so if the logging is interrupted midway, we will be able to read everything up to that point. We do not hold on to strings we encountered. Instead we hash them and only preserve the hash code. We rely on the hash function resulting in no collisions, otherwise it could happen that a random string in a binlog would be substituted instead of another one. The hashtables do not significantly add to MSBuild memory usage (20-30 MB for our largest binlogs). FNV-1a (64-bit hash size) was a carefully chosen hash function for its simplicity, performance, and lack of collisions on all binlogs tested so far. 32-bit hash size (such as string.GetHashCode() was not sufficient and resulted in ~800 collisions for our largest binlog with 2.7 million strings. This change includes other performance improvements such as inserting a BufferedStream around the stream we're reading or writing. This results in a significant performance improvement. We introduce a new StringStorage data structure in the binlog reader, for storing the strings on disk instead of reading them all into memory. Strings are loaded into memory on demand. This prevents OOM in 32-bit MSBuild processes when playing back large binlogs. This keeps the memory usage relatively flat when reading.

benvillalobos

Very quick skim, submitting small nits. The results look great 😀

src/Build/Logging/BinaryLogger/BinaryLogger.cs

src/Build/Logging/BinaryLogger/BuildEventArgsReader.cs

ladipro

Awesome wins and shame on GZipStream that it leaves so much redundant data in it the compressed stream.

Not having encountered hash code collisions in test data doesn't mean that they don't exist. I understand that this is "just logging" and if, with an extremely low probability, the log lies, we are kind of fine with that. Still, isn't it fundamentally a bug? It's unusual for a component to have this probabilistic correctness characteristics in an otherwise precise system. Do you happen to know how much of the wins would be lost if the string lookup was implemented "correctly"? Thank you!

src/Build/Logging/BinaryLogger/BuildEventArgsReader.cs

KirillOsenkov · 2021-01-13T20:18:36Z

@ladipro I totally agree on your point about hash collisions. It's a very valid concern.

Without relying on hashes though, we'd have to hold on to all strings we saw because that's the only way to be absolutely sure the strings match. Holding on to strings is absolutely not an option in a 32-bit process (as we have gigabytes of them), even if we choose a single string per hashcode as a "sample". Writing them to a page file like I do with Reader is an option, but you'd need to constantly seek to read out a string from disk for every new string you encounter. It would absolutely kill perf and defeat the purpose.

I'm afraid that if we want the perf wins, we have to allow a small possibility of a collision, where in a random binlog (1) a random string you happened to be looking at (2) is switched to another random string. You only look at a tiny small fraction of all strings in the binlog (there are millions), and chances are that you happened to look at the corrupt one are infinitesimally small (given that I haven't seen any collisions in our top 10 largest binlogs). I say it's worth the risk!

KirillOsenkov · 2021-01-13T20:20:57Z

I guess we could also store the string length and the first character (in addition to hash), to further drastically reduce the chance of collision. But honestly I don't think it's even worth it.

cdmihai · 2021-01-14T00:25:47Z

Regarding the hash problem and the large memory required to store the strings, wouldn't using msbuild's string interner (maybe using @ladipro's rewritten version of it) fix the problem? The interner also dedupes strings, so the in-memory string size should be the same as the deduped string size from the binlog

aolszowka · 2021-01-14T00:26:41Z

Writing them to a page file like I do with Reader is an option, but you'd need to constantly seek to read out a string from disk for every new string you encounter. It would absolutely kill perf and defeat the purpose.

If there was a big concern about a collision what about a post process step? So you opt in to "do it fast with a hash" but allow a slower pass that captured all the strings then performed this process, at that point if there was a collision you could bark, but if there was no collision you reap the gains when sharing the binlog?

Does nothing for the perf though which is what I assume was driving a lot of this?

I suspect the big concern is trying to capture this one off, realistically though by the time we're digging though the bin logs its probably reproducible on our end so its not like we're totally out of luck.

ladipro · 2021-01-14T08:09:15Z

Regarding the hash problem and the large memory required to store the strings, wouldn't using msbuild's string interner (maybe using @ladipro's rewritten version of it) fix the problem?

The interner holds on to strings only as long as something else is keeping them alive so it wouldn't guarantee correctness either. It's basically a cache of recently seen and not yet collected strings, not of all strings.

ladipro · 2021-01-14T08:12:40Z

I guess we could also store the string length and the first character (in addition to hash), to further drastically reduce the chance of collision. But honestly I don't think it's even worth it.

Agreed that it wouldn't be worth it. It's basically like using a wider hashcode but with poorer collision characteristics.

I'm afraid that if we want the perf wins, we have to allow a small possibility of a collision, where in a random binlog (1) a random string you happened to be looking at (2) is switched to another random string. You only look at a tiny small fraction of all strings in the binlog (there are millions), and chances are that you happened to look at the corrupt one are infinitesimally small (given that I haven't seen any collisions in our top 10 largest binlogs). I say it's worth the risk!

Let's do it!

ladipro

Left a few comments, mostly nits.

src/Build/Logging/BinaryLogger/BuildEventArgsReader.cs

src/Build/Logging/BinaryLogger/BuildEventArgsWriter.cs

Reduce maximum strings allocated in memory to 2GB (1 billion chars).

This avoids manually switching from currentRecordWriter to originalBinaryWriter in three different places. It's also easier this way to find the places where the switch happens.

KirillOsenkov · 2021-01-17T03:24:49Z

I think I've answered or resolved all comments, but let me know if you have more or disagree with some of my resolutions. Meanwhile I'll do some more final testing on it.

ladipro

Thank you!

KirillOsenkov · 2021-01-26T06:13:24Z

I'm happy to report a nice performance improvement from this PR.
master is 32s with /bl, this PR is 26s.
Pure MSBuild without any logging is 15s, with text file diagnostic log 55s.

master	#6017	no /bl	/flp1:v=diag
32.377	27.737	15.476	55.61
31.758	26.561	15.416	56.145
31.481	26.486	15.536	55.72

(all single-core build). We see that /bl makes it twice as slow, but diag text log is almost 4x as slow. I think we can reduce binlog overhead further.

KirillOsenkov · 2021-02-08T05:39:40Z

Shall we merge this?

Forgind · 2021-02-08T17:01:32Z

Yep!

I'm still trying to merge things in a gradual way, but I'll get to your other PR in the next few days.

drewnoakes reviewed Jan 11, 2021

View reviewed changes

src/Build/Logging/BinaryLogger/BuildEventArgsWriter.cs Outdated Show resolved Hide resolved

src/Build/Logging/BinaryLogger/BuildEventArgsWriter.cs Outdated Show resolved Hide resolved

src/Build/Microsoft.Build.csproj Outdated Show resolved Hide resolved

rainersigwald reviewed Jan 11, 2021

View reviewed changes

Forgind reviewed Jan 11, 2021

View reviewed changes

src/Build/Logging/BinaryLogger/BuildEventArgsReader.cs Show resolved Hide resolved

src/Build/Logging/BinaryLogger/BuildEventArgsReader.cs Show resolved Hide resolved

KirillOsenkov force-pushed the dev/kirillo/dedup branch from 198f85a to aa6fbab Compare January 11, 2021 20:12

rainersigwald added this to the MSBuild 16.10 milestone Jan 11, 2021

benvillalobos reviewed Jan 12, 2021

View reviewed changes

src/Build/Logging/BinaryLogger/BinaryLogger.cs Show resolved Hide resolved

src/Build/Logging/BinaryLogger/BuildEventArgsReader.cs Show resolved Hide resolved

ladipro reviewed Jan 12, 2021

View reviewed changes

src/Build/Logging/BinaryLogger/BuildEventArgsReader.cs Outdated Show resolved Hide resolved

KirillOsenkov added 2 commits January 14, 2021 16:26

Add some comments.

b324bad

Throw exception for invalid string and NameValueTable index

e6ee9a5

ladipro reviewed Jan 15, 2021

View reviewed changes

KirillOsenkov added 3 commits January 16, 2021 19:14

Make some fields readonly.

9d8f857

Reduce maximum strings allocated in memory to 2GB (1 billion chars).

Introduce a RedirectionScope in BuildEventArgsWriter

9ba9c9e

This avoids manually switching from currentRecordWriter to originalBinaryWriter in three different places. It's also easier this way to find the places where the switch happens.

Clear the list unconditionally.

77a4ff3

ladipro approved these changes Jan 17, 2021

View reviewed changes

Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Jan 22, 2021

benvillalobos added the performance label Feb 2, 2021

Forgind merged commit 62f5140 into master Feb 8, 2021

Forgind deleted the dev/kirillo/dedup branch February 8, 2021 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate strings in binlogs #6017

Deduplicate strings in binlogs #6017

KirillOsenkov commented Jan 10, 2021

KirillOsenkov commented Jan 11, 2021

KirillOsenkov commented Jan 11, 2021

drewnoakes left a comment

aolszowka commented Jan 11, 2021

rainersigwald left a comment

KirillOsenkov commented Jan 11, 2021

Forgind left a comment

KirillOsenkov commented Jan 11, 2021

benvillalobos left a comment

ladipro left a comment

KirillOsenkov commented Jan 13, 2021

KirillOsenkov commented Jan 13, 2021

cdmihai commented Jan 14, 2021

aolszowka commented Jan 14, 2021

ladipro commented Jan 14, 2021

ladipro commented Jan 14, 2021

ladipro left a comment

KirillOsenkov commented Jan 17, 2021

ladipro left a comment

KirillOsenkov commented Jan 26, 2021 •

edited

Loading

KirillOsenkov commented Feb 8, 2021

Forgind commented Feb 8, 2021

Deduplicate strings in binlogs #6017

Deduplicate strings in binlogs #6017

Conversation

KirillOsenkov commented Jan 10, 2021

Testing

Binlog size reduction:

KirillOsenkov commented Jan 11, 2021

KirillOsenkov commented Jan 11, 2021

drewnoakes left a comment

Choose a reason for hiding this comment

aolszowka commented Jan 11, 2021

rainersigwald left a comment

Choose a reason for hiding this comment

KirillOsenkov commented Jan 11, 2021

Forgind left a comment

Choose a reason for hiding this comment

KirillOsenkov commented Jan 11, 2021

benvillalobos left a comment

Choose a reason for hiding this comment

ladipro left a comment

Choose a reason for hiding this comment

KirillOsenkov commented Jan 13, 2021

KirillOsenkov commented Jan 13, 2021

cdmihai commented Jan 14, 2021

aolszowka commented Jan 14, 2021

ladipro commented Jan 14, 2021

ladipro commented Jan 14, 2021

ladipro left a comment

Choose a reason for hiding this comment

KirillOsenkov commented Jan 17, 2021

ladipro left a comment

Choose a reason for hiding this comment

KirillOsenkov commented Jan 26, 2021 • edited Loading

KirillOsenkov commented Feb 8, 2021

Forgind commented Feb 8, 2021

KirillOsenkov commented Jan 26, 2021 •

edited

Loading