use different hash algorithm than SHA1 in checksum for better performance #33411

heejaechang · 2019-02-15T01:39:54Z

@sharwell believe SHA1 is too slow for our checksum. so he wants us to use a different algorithm than SHA1 to make hash faster.

sharwell · 2019-02-15T01:42:33Z

We are reviewing MurmurHash 3 internally. The implementation I'm most familiar with is the one I used in ANTLR 4:
https://github.com/antlr/antlr4/blob/1c6c62afc7cbef464ba71d332c8c1ae50d54e8d5/runtime/Java/src/org/antlr/v4/runtime/misc/MurmurHash.java
https://github.com/tunnelvisionlabs/antlr4cs/blob/4874e010bafe828192b6bf436a2a7d02bab161ec/runtime/CSharp/Antlr4.Runtime/Misc/MurmurHash.cs

/cc @VladimirReshetnikov

sharwell · 2019-02-15T01:50:18Z

📝 I'll also review the keys for collision resilience and set up an A/B test that uses a very poor hash to ensure collisions occur periodically during our testing so we understand the impact it would have at less frequent occurrences in the wild.

davidwengier · 2019-02-15T01:53:27Z

It would be worthwhile to validate assumptions around SHA256 at the same time to settle that argument once and for all at least.

sharwell · 2019-02-15T02:43:45Z

Alternative proposal: unique ID generation. This would have essentially the runtime behavior of lazily calculating the hash of an object by simply calling Guid.NewGuid(), but could be done in a much faster manner with fewer stored bits since we only need the number to be unique within the devenv+OOP processes.

heejaechang · 2019-02-15T10:13:12Z

checksum is designed to be content based so it can be stateless. we already have unique id (VersionStamp) that is only valid within same VS session and attempt to re-use between session using file stamp.

having such system at the lower layer that is hard to guarantee data integrity between processes or between VS sessions making any system built on top of it hard to build/maintain. basically it requires something to gurantee validness of the data through holding some info making it stateful system.

content based checksum let one to pick up data such diagnostics/bloom filter and etc by recalculating checksum of source content itself and doesn't require who generated the data as long as checksum is same making it easy to be distributed (no central control). or content cached by checksum can be validated by re-calculating checksum again. making whole system stateless.

heejaechang · 2019-02-15T10:18:40Z

I already know the impact of a collision. we use checksum as a key of data. it can be content itself or it can be calculated information off content.

having collision means wrong content (source, metadata, options and any data solution is using) is used to build solution. or calculated information such as symbol index or syntex tree index for wrong content is used for features.

user facing impact will be things like wrong diagnostics shows up, FAR showing the wrong result, Add using not showing up, go to not showing match and etc. or crash if completely wrong data got picked up from cache.

the system assumes there is no collision.

...

by the way, if you want to see things yourself, change checksum type to have the same value for everything and see what happens, no reason to have a very poor hash to see the impact?

heejaechang · 2019-02-15T10:24:32Z

we only need the number to be unique within the devenv+OOP processes.

this is not true. that is just one of usage case for the checksum. for example, once things are synced between devenv+OOP, it uses the checksum to verify the content is correct.

sharwell · 2019-02-15T14:12:50Z

for example, once things are synced between devenv+OOP, it uses the checksum to verify the content is correct.

Right, that's what I meant. Collisions are bad if they happen in either process, but maybe don't matter if they happen in an unrelated process.

sharwell · 2019-02-15T15:38:52Z

checksum is designed to be content based so it can be stateless.

Is there any correctness requirement that the checksum be based on content, or is that just an implementation detail? Using a unique ID assignment to avoid the recalculation of checksums from content could save us a substantial amount of processing.

having such system at the lower layer that is hard to guarantee data integrity between processes or between VS sessions making any system built on top of it hard to build/maintain. basically it requires something to gurantee validness of the data through holding some info making it stateful system.

Failure to correctly keep track of unique IDs would result in cache misses across processes, but would not result in the use of invalid data (or more specifically, the incorrect reuse of data from one context in a different context).

by the way, if you want to see things yourself, change checksum type to have the same value for everything and see what happens, no reason to have a very poor hash to see the impact?

Sometimes this degrades performance to the point that the system becomes unusable. Also, some systems that produce incorrect results in a collision scenario self-correct when the checksums no longer collide. These systems rely on the infrequent occurrence of collisions and are not able to handle cases where collisions are constant. It's easy to ensure collisions are rare but difficult to make them non-existent.

heejaechang · 2019-02-15T19:00:41Z

the whole system is built on top of the promise of collision resistance strength of SHA. what you are saying is like, what is so wrong when 2 digital certificates having the same signature.

gafter · 2019-02-15T22:37:40Z

What usage of SHA1 are we talking about? In the compiler?

davidwengier · 2019-02-16T04:10:53Z

@gafter There are uses in the compiler and IDE side, though this task is specifically for IDE. The compiler work item is at https://devdiv.visualstudio.com/DevDiv/_workitems/edit/788330

CyrusNajmabadi · 2019-02-18T08:19:06Z

@sharwell believe SHA1 is too slow for our checksum

Do we have any data for that? How much time do we spend on sha1 hashing currently?

sharwell · 2019-02-18T18:55:18Z

Do we have any data for that? How much time do we spend on sha1 hashing currently?

It's shown up several times internally, but hasn't risen to the point that it took priority over other perf work. A specialized testing harness will need to be created that covers the checksum process (this issue) and serialization process (#23331).

blowdart · 2019-02-19T17:53:14Z

Also please come talk to @GrabYourPitchforks and I about this. Given we use marvin32 in strings, it'd be an interesting discussion around why mumur and not marvin.

CyrusNajmabadi · 2019-02-19T18:12:34Z

Note: roslyn already has an impl of murmur hash (v2 though) here: https://github.com/dotnet/roslyn/blob/master/src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

I tested it originally, and found that on all of roslyn's inputs we never collided with a suitably high number of bits (which i think was less than the 160 used by sha1). Note that for Roslyn's needs, we're not going for cryptographic strength. Nor do we have any sort of attacker/adversary. All we're trying to do is hash fast with an extremely low chance of collisions. We then use that hash for content lookup. An inappropriate collision is undesirable, but isn't something that would be disastrous.

blowdart · 2019-02-19T18:19:24Z

A few things I care about here,

the system assumes there is no collision.

That's not true for any hashing algorithm. A good algorithm makes it unlikely, but it's never impossible.

And then the phrase guarantee data integrity

Both of which become security flags :) Admittedly this could be a misunderstanding based on wording, and given what rosyln does probably is, but still, I'd like to get a clear concept of what the hashes are used for.

CyrusNajmabadi · 2019-02-19T18:35:42Z

Both of which become security flags

This is not a security domain. We are just using this for content hashing. A simple way to have our editor refer to a snapshot of a file. We've gone through the threat modeling here in the past. The use of a hashing algorithm here is just to get a content ID, it's not for a cyptographic security purpose. Please don't make it into one.

ut still, I'd like to get a clear concept of what the hashes are used for.

It's very simple. We often persist harmless data to disk (like an index of all the names used in a file). We want to know if we should update that index. So we store a hash of hte original file that was used to compute the index. If that hash changes, we recompute the index. If it doesn't we keep the old index around.

The worst case here is that we somehow get a staggeringly unlikely collision. In that case, our index will just be a little incorrect. And all that means is that if you do something like run "find all references" in VS, you get a slightly wrong answer. Note, to even get a collision would mean the user was somehow even trying to subvert us, as any sort of normal edit/change just has less chance of colliding than the world spontaneously imploding tomorrow.

CyrusNajmabadi · 2019-02-19T18:37:32Z

That's not true for any hashing algorithm. A good algorithm makes it unlikely, but it's never impossible.

It's fine for our purposes. The chance of collisions is so low under any sort of realistic use case, as to be the same as impossible. There are better chances of ram bits being flipped by cosmic rays than tehre is of a collision here, and we don't worry or try to deal with the former.

blowdart · 2019-02-19T18:38:21Z

Ah now I remember. SHA1 just puts things on my radar.

Moving away from sha1 or even sha256 would remove the need for the endless exemption. Still wondering why murmur over marvin, might be interesting to compare performance there.

(But you know if you say SHA three times in front of a mirror then either Levi or I will appear)

CyrusNajmabadi · 2019-02-19T19:12:44Z

Ah now I remember. SHA1 just puts things on my radar.

:)

Still wondering why murmur over marvin, might be interesting to compare performance there.

I odn't remember the history (even though i did the murmur impl). Murmur was just extremely simple to get right, doesn't allocate, and included lots of available information on the net, as well as public domain code to refer to. I'm looking for info on marvin hashing, and not much is coming up. So it probably was just a decision to use something that fit our needs and would be supremely easy to use.

If marvin is better, i would have no problem with Roslyn switching to it. Is there a nice .net version available somewhere?

CyrusNajmabadi · 2019-02-19T19:18:30Z

Roslyn critical requirements for hash function:

Cannot be MD5 or SHA1. These are banned for all purposes (even non-cryptographic scenarios).
non-allocating.
can provide a reasonably uniform distribution of hash results from inputs that are System.Strings (not bytes/arrays).
needs to handle strings in both a case sensitive and insensitive manner (again, without allocating).
fast. we need to hash millions of strings potentially. and we are constantly rehashing as users edit.
deterministic across sessions. i.e. we need people to be able to close VS, log in/out, reboot, etc. and get the exact same result on the same inputs.

Non-goals of hte hash function:

cryptographically secure.
handle non-strings.

murmur2 fit all of the above, and asking the bcl for an equivalent at the time yielded nothing built-in that we could use.

blowdart · 2019-02-19T19:22:47Z

We use marvin for string hash codes I believe in Core.

Not really fussed here, it's not crypto related :)

CyrusNajmabadi · 2019-02-19T19:24:11Z

We use marvin for string hash codes I believe in Core.

Do you know if it would satisfy all of the above requirements? For example, could you get a Case-Insensitive hash using marvin in a non-allocating fashion?

Not really fussed here, it's not crypto related :)

No worries! i'm just trying to learn more myself. I would be happy to get rid of our hand-written code if there is framework code that can do it for us!

blowdart · 2019-02-19T19:24:46Z

I honestly don't know, because that dips into implementation details, and I'm just a PM :)

@GrabYourPitchforks might.

heejaechang · 2019-02-20T08:30:17Z

for collision, regardless of theoretical strength, we want no collision ever happening in practice. not just in one repo but among repos as long as someone is not trying to purposely make things collide. and that is why we choose SHA1.

..

I just ran this on my roslyn src folder

$foo = (Get-ChildItem -path "c:\your\file\path" -recurse | measure-object | select -expand Count)
$bar = ((Get-ChildItem -path "c:\your\file\path" -recurse | Measure-Object -property length -sum).sum /1MB)
$avrg = $bar / $foo
$avrg

and it gives 16KB as average.

...

but I dont think we want a hash that show different characteristic based on size of input.

sharwell · 2019-02-20T15:16:31Z

Currently the implementation assumes there will not be collisions, and does not implement collision detection or handling in the synchronization algorithm. The most likely observed side effect following a collision would be an application crash. However, assuming this was corrected, the characteristics would change substantially. The application would be highly tolerant of short-lived collisions (e.g. a checksum generated while typing that changes the next time a character is entered), but less tolerant of long-lived collisions (e.g. a user completes typing while the document is in a colliding state). In addition, the observed impact of collisions between objects of different types could be restricted to performance degradation proportional to the serialized size of the objects.

The majority of hashes generated by the system are used for synchronization of transient assets between processes on the same system. All of these hashes could be replaced with an object tagging system (rather than a content-based hash) to substantially reduce the startup cost and guarantee collisions are avoided.

CyrusNajmabadi · 2019-02-20T19:21:03Z

Currently the implementation assumes there will not be collisions, and does not implement collision detection or handling in the synchronization algorithm.

Right. All of that would be unnecessary (given that we don't have hostile actors we have to be worried about). The chance of a collision is so low as to be practically meaningless. As mentioned above, it would make more sense for us to worry about bitflips and how to deal with them versus needing to be concerned with hash collisions here.

GrabYourPitchforks · 2019-02-20T22:57:02Z

Right. All of that would be unnecessary (given that we don't have hostile actors we have to be worried about). The chance of a collision is so low as to be practically meaningless.

Are the hashes computed over source documents? If so, are they persisted anywhere? (You mentioned needing something stable across reboots.) The scenario I'm thinking is that somebody opens a .cs file provided by an adversary, and it persists something to the file system that causes VS to misbehave from then forward due to a collision.

CyrusNajmabadi · 2019-02-21T00:00:25Z

The scenario I'm thinking is that somebody opens a .cs file provided by an adversary, and it persists something to the file system that causes VS to misbehave from then forward due to a collision.

This is already covered by hte VS threat model. Opening a .cs file from an adversary has never been safe (for many reasons**). So VS even warns in this case to only open things trusted.

--

** For example, things like designers will start up and will literally examine and just execute code in the .cs file. The thread from code execution is staggeringly higher than any sort of concern about producing slightly incorrect IDE assistance because we got a file that collided.

GrabYourPitchforks · 2019-02-25T21:05:34Z

Hi all, spent some time digging around. There's support on the runtime side for exposing non-cryptographic hash algorithms via their own APIs (see https://github.com/dotnet/corefx/issues/25666 for one proposal). Just pulling things out of thin air my gut tells me that if this API gets off the ground we'd probably start with CRC32, Marvin32/64, and xxHash32/64 as a first pass, since the worker code to all of those already exists in Core in some fashion. I'm not optimistic about this making it in to 3.0, but we could ask @terrajobst and others to fast-track it when 3.1 opens up.

tmat · 2019-02-25T21:24:27Z

@GrabYourPitchforks @terrajobst Would it be possible to publish the implementation as a source package? Also, we need the impl to work on .NET 4.7.2. At least once it has a hash alg that we can use as a replacement for SHA1.

terrajobst · 2019-04-10T17:04:26Z

Would it be possible to publish the implementation as a source package?

Possible yes, but not desirable. We just ran through this excercise for JSON. It's too fragile, even for internal use.

bcronce · 2019-04-22T17:41:46Z

I was doing some prototyping on a system where I needed a uniqueness guarantee, but didn't really care about crypto. I found a pure native implementation of Blake2 from their official site that out performed the .Net SHA1 for me.

Blake2 was a SHA3 finalist and only lost because it's more difficult to make in hardware, but it's really fast in software. It was made by the best cryptographers in the business and have a track record of creating crypto and finding faults in other crypto.

I consider Blake2 my go to hashing algorithm when I want the properties of crypto strength, but where I don't care about attackers.

CyrusNajmabadi · 2020-03-08T21:39:01Z

@sharwell do we need this issue? We have moved to SHA256 here:

roslyn/src/Workspaces/Core/Portable/Workspace/Solution/Checksum_Factory.cs

Line 24 in 153fec7

    
           new ObjectPool<IncrementalHash>(() => IncrementalHash.CreateHash(HashAlgorithmName.SHA256), size: 20);

If we have an actual scenario where we see hash perf showing up, we can use that as a driving function. However, this issue seems speculative and not necessary to just have floating around.

saucecontrol · 2020-03-08T22:39:32Z

SHA256 is still quite slow. If this is a bottleneck, I've published a BLAKE2 implementation that's roughly 5x faster than Windows CNG's SHA256: https://github.com/saucecontrol/Blake2Fast#blake2fast-vs-net-in-box-algorithms-md5-and-sha2

That's making use of AVX2 intrinsics in netcoreapp3.x, but the scalar fallback is still at least twice as fast as SHA256, even on netfx 32-bit.

CyrusNajmabadi · 2020-03-08T23:14:51Z

@saucecontrol does this work for these requirements: #33411 (comment)

?

saucecontrol · 2020-03-08T23:25:12Z

It works as is for everything except case insensitive string hashing. That can be done by pushing individual case-normalized characters (or perhaps fixed windows of case-normalized substrings) into the hash state. The API is Span-based, so it's non-allocating and flexible.

saucecontrol · 2020-03-08T23:26:59Z

I assume you don't care about endianness if you're wanting to hash System.String values directly? Or would you want something that normalizes both case and byte order?

CyrusNajmabadi · 2020-03-09T00:00:49Z

That can be done by pushing individual case-normalized characters

Could you link me to where it supports pushing characters? Thanks!

I assume you don't care about endianness if you're wanting to hash System.String values directly?

Correct. :)

saucecontrol · 2020-03-09T00:09:04Z

I've updated the API since the last version published to NuGet, but you can try it out with the latest CI build

You can use Blake2b.CreateIncrementalHasher(), which will return the hash state struct. That has an Update() that accepts a value or Span of value:

https://github.com/saucecontrol/Blake2Fast/blob/master/src/Blake2Fast/Blake2b/Blake2bHashState.cs#L155-L183

So you can call that with aString.AsSpan() or you could case-normalize a string a chunk at a time into a fixed buffer, or just grab a character at a time to update the hash state. Updating the state simply pushes new bytes into a buffer until a block is full at which point the actual hash state is updated, so it's very lightweight.

GrabYourPitchforks · 2020-03-09T06:16:54Z

FYI on 64-bit platforms, SHA512 tends to outperform SHA256. Especially for larger inputs (anything over a few dozen bytes), as is the case here. If you're going for raw speed and you need to use something built-in, it could be a stop-gap measure.

saucecontrol · 2020-03-09T18:39:31Z

For that matter, MD5 is faster than both SHA2 variants on both platforms if you don't require cryptographic security.

on 64-bit platforms, SHA512 tends to outperform SHA256

BLAKE2 has similar characteristics. With scalar implementations, the 256-bit BLAKE2s variant runs faster on 32-bit while 512-bit BLAKE2b runs faster on 64-bit. With SIMD implementations, BLAKE2b is always faster.

CyrusNajmabadi · 2020-03-09T19:50:36Z

I'm fine with any system that meets the requirements stated in #33411 (comment). We don't need cryptographic security. We just want a reasonable hasher.

tmat · 2020-03-09T23:06:36Z

@CyrusNajmabadi In addition to those requirements, it can't be MD5 or SHA1.

CyrusNajmabadi · 2020-03-09T23:12:40Z

interesting, is that an external requirement/mandate @tmat? Nothing about the scenrios where we uses these hashes seems like it would preclude those (at least from Roslyn's perspective).

CyrusNajmabadi · 2020-03-09T23:45:44Z

Thansk @tmat . Have added that to our criteria.

jack-pappas · 2020-11-24T19:26:13Z

May be worth taking a look at BLAKE3 here -- it's much faster than SHA-1, SHA-256, MD5, and blake2.

https://github.com/BLAKE3-team/BLAKE3

CyrusNajmabadi · 2024-10-21T21:30:58Z

Closing. We moved to xxhash128

heejaechang mentioned this issue Feb 15, 2019

made Checksum type to truncate byte array given to it if it is bigger than HashSize #33363

Merged

gafter added the Area-IDE label Feb 18, 2019

jinujoseph added Bug Investigation Required labels Feb 19, 2019

jinujoseph added this to the 16.1.P1 milestone Feb 19, 2019

jinujoseph modified the milestones: 16.1.P1, 16.2 Apr 24, 2019

jinujoseph modified the milestones: 16.2, Backlog Jun 9, 2019

CyrusNajmabadi closed this as not planned Won't fix, can't repro, duplicate, stale Oct 21, 2024

dotnet locked and limited conversation to collaborators Oct 21, 2024

use different hash algorithm than SHA1 in checksum for better performance #33411

use different hash algorithm than SHA1 in checksum for better performance #33411

Comments

heejaechang commented Feb 15, 2019

sharwell commented Feb 15, 2019 • edited Loading

sharwell commented Feb 15, 2019

davidwengier commented Feb 15, 2019

sharwell commented Feb 15, 2019

heejaechang commented Feb 15, 2019

heejaechang commented Feb 15, 2019 • edited Loading

heejaechang commented Feb 15, 2019

sharwell commented Feb 15, 2019

sharwell commented Feb 15, 2019

heejaechang commented Feb 15, 2019

gafter commented Feb 15, 2019

davidwengier commented Feb 16, 2019

CyrusNajmabadi commented Feb 18, 2019

sharwell commented Feb 18, 2019

blowdart commented Feb 19, 2019

CyrusNajmabadi commented Feb 19, 2019

blowdart commented Feb 19, 2019 • edited Loading

CyrusNajmabadi commented Feb 19, 2019

CyrusNajmabadi commented Feb 19, 2019

blowdart commented Feb 19, 2019 • edited Loading

CyrusNajmabadi commented Feb 19, 2019

CyrusNajmabadi commented Feb 19, 2019 • edited Loading

blowdart commented Feb 19, 2019

CyrusNajmabadi commented Feb 19, 2019

blowdart commented Feb 19, 2019

heejaechang commented Feb 20, 2019 • edited Loading

sharwell commented Feb 20, 2019

CyrusNajmabadi commented Feb 20, 2019

GrabYourPitchforks commented Feb 20, 2019

CyrusNajmabadi commented Feb 21, 2019

GrabYourPitchforks commented Feb 25, 2019

tmat commented Feb 25, 2019 • edited Loading

terrajobst commented Apr 10, 2019

bcronce commented Apr 22, 2019

CyrusNajmabadi commented Mar 8, 2020

saucecontrol commented Mar 8, 2020

CyrusNajmabadi commented Mar 8, 2020

saucecontrol commented Mar 8, 2020

saucecontrol commented Mar 8, 2020 • edited Loading

CyrusNajmabadi commented Mar 9, 2020

saucecontrol commented Mar 9, 2020

GrabYourPitchforks commented Mar 9, 2020

saucecontrol commented Mar 9, 2020

CyrusNajmabadi commented Mar 9, 2020

tmat commented Mar 9, 2020

CyrusNajmabadi commented Mar 9, 2020

CyrusNajmabadi commented Mar 9, 2020

jack-pappas commented Nov 24, 2020

CyrusNajmabadi commented Oct 21, 2024

sharwell commented Feb 15, 2019 •

edited

Loading

heejaechang commented Feb 15, 2019 •

edited

Loading

blowdart commented Feb 19, 2019 •

edited

Loading

blowdart commented Feb 19, 2019 •

edited

Loading

CyrusNajmabadi commented Feb 19, 2019 •

edited

Loading

heejaechang commented Feb 20, 2019 •

edited

Loading

tmat commented Feb 25, 2019 •

edited

Loading

saucecontrol commented Mar 8, 2020 •

edited

Loading