Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use different hash algorithm than SHA1 in checksum for better performance #33411

Closed
heejaechang opened this issue Feb 15, 2019 · 60 comments
Closed

Comments

@heejaechang
Copy link
Contributor

@sharwell believe SHA1 is too slow for our checksum. so he wants us to use a different algorithm than SHA1 to make hash faster.

@sharwell
Copy link
Member

📝 I'll also review the keys for collision resilience and set up an A/B test that uses a very poor hash to ensure collisions occur periodically during our testing so we understand the impact it would have at less frequent occurrences in the wild.

@davidwengier
Copy link
Contributor

It would be worthwhile to validate assumptions around SHA256 at the same time to settle that argument once and for all at least.

@sharwell
Copy link
Member

Alternative proposal: unique ID generation. This would have essentially the runtime behavior of lazily calculating the hash of an object by simply calling Guid.NewGuid(), but could be done in a much faster manner with fewer stored bits since we only need the number to be unique within the devenv+OOP processes.

@heejaechang
Copy link
Contributor Author

checksum is designed to be content based so it can be stateless. we already have unique id (VersionStamp) that is only valid within same VS session and attempt to re-use between session using file stamp.

having such system at the lower layer that is hard to guarantee data integrity between processes or between VS sessions making any system built on top of it hard to build/maintain. basically it requires something to gurantee validness of the data through holding some info making it stateful system.

content based checksum let one to pick up data such diagnostics/bloom filter and etc by recalculating checksum of source content itself and doesn't require who generated the data as long as checksum is same making it easy to be distributed (no central control). or content cached by checksum can be validated by re-calculating checksum again. making whole system stateless.

@heejaechang
Copy link
Contributor Author

heejaechang commented Feb 15, 2019

I already know the impact of a collision. we use checksum as a key of data. it can be content itself or it can be calculated information off content.

having collision means wrong content (source, metadata, options and any data solution is using) is used to build solution. or calculated information such as symbol index or syntex tree index for wrong content is used for features.

user facing impact will be things like wrong diagnostics shows up, FAR showing the wrong result, Add using not showing up, go to not showing match and etc. or crash if completely wrong data got picked up from cache.

the system assumes there is no collision.

...

by the way, if you want to see things yourself, change checksum type to have the same value for everything and see what happens, no reason to have a very poor hash to see the impact?

@heejaechang
Copy link
Contributor Author

we only need the number to be unique within the devenv+OOP processes.

this is not true. that is just one of usage case for the checksum. for example, once things are synced between devenv+OOP, it uses the checksum to verify the content is correct.

@sharwell
Copy link
Member

for example, once things are synced between devenv+OOP, it uses the checksum to verify the content is correct.

Right, that's what I meant. Collisions are bad if they happen in either process, but maybe don't matter if they happen in an unrelated process.

@sharwell
Copy link
Member

checksum is designed to be content based so it can be stateless.

Is there any correctness requirement that the checksum be based on content, or is that just an implementation detail? Using a unique ID assignment to avoid the recalculation of checksums from content could save us a substantial amount of processing.

having such system at the lower layer that is hard to guarantee data integrity between processes or between VS sessions making any system built on top of it hard to build/maintain. basically it requires something to gurantee validness of the data through holding some info making it stateful system.

Failure to correctly keep track of unique IDs would result in cache misses across processes, but would not result in the use of invalid data (or more specifically, the incorrect reuse of data from one context in a different context).

by the way, if you want to see things yourself, change checksum type to have the same value for everything and see what happens, no reason to have a very poor hash to see the impact?

Sometimes this degrades performance to the point that the system becomes unusable. Also, some systems that produce incorrect results in a collision scenario self-correct when the checksums no longer collide. These systems rely on the infrequent occurrence of collisions and are not able to handle cases where collisions are constant. It's easy to ensure collisions are rare but difficult to make them non-existent.

@heejaechang
Copy link
Contributor Author

the whole system is built on top of the promise of collision resistance strength of SHA. what you are saying is like, what is so wrong when 2 digital certificates having the same signature.

@gafter
Copy link
Member

gafter commented Feb 15, 2019

What usage of SHA1 are we talking about? In the compiler?

@davidwengier
Copy link
Contributor

@gafter There are uses in the compiler and IDE side, though this task is specifically for IDE. The compiler work item is at https://devdiv.visualstudio.com/DevDiv/_workitems/edit/788330

@CyrusNajmabadi
Copy link
Member

@sharwell believe SHA1 is too slow for our checksum

Do we have any data for that? How much time do we spend on sha1 hashing currently?

@sharwell
Copy link
Member

Do we have any data for that? How much time do we spend on sha1 hashing currently?

It's shown up several times internally, but hasn't risen to the point that it took priority over other perf work. A specialized testing harness will need to be created that covers the checksum process (this issue) and serialization process (#23331).

@blowdart
Copy link

Also please come talk to @GrabYourPitchforks and I about this. Given we use marvin32 in strings, it'd be an interesting discussion around why mumur and not marvin.

@CyrusNajmabadi
Copy link
Member

Note: roslyn already has an impl of murmur hash (v2 though) here: https://github.com/dotnet/roslyn/blob/master/src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

I tested it originally, and found that on all of roslyn's inputs we never collided with a suitably high number of bits (which i think was less than the 160 used by sha1). Note that for Roslyn's needs, we're not going for cryptographic strength. Nor do we have any sort of attacker/adversary. All we're trying to do is hash fast with an extremely low chance of collisions. We then use that hash for content lookup. An inappropriate collision is undesirable, but isn't something that would be disastrous.

@blowdart
Copy link

blowdart commented Feb 19, 2019

A few things I care about here,

the system assumes there is no collision.

That's not true for any hashing algorithm. A good algorithm makes it unlikely, but it's never impossible.

And then the phrase guarantee data integrity

Both of which become security flags :) Admittedly this could be a misunderstanding based on wording, and given what rosyln does probably is, but still, I'd like to get a clear concept of what the hashes are used for.

@CyrusNajmabadi
Copy link
Member

Both of which become security flags

This is not a security domain. We are just using this for content hashing. A simple way to have our editor refer to a snapshot of a file. We've gone through the threat modeling here in the past. The use of a hashing algorithm here is just to get a content ID, it's not for a cyptographic security purpose. Please don't make it into one.

ut still, I'd like to get a clear concept of what the hashes are used for.

It's very simple. We often persist harmless data to disk (like an index of all the names used in a file). We want to know if we should update that index. So we store a hash of hte original file that was used to compute the index. If that hash changes, we recompute the index. If it doesn't we keep the old index around.

The worst case here is that we somehow get a staggeringly unlikely collision. In that case, our index will just be a little incorrect. And all that means is that if you do something like run "find all references" in VS, you get a slightly wrong answer. Note, to even get a collision would mean the user was somehow even trying to subvert us, as any sort of normal edit/change just has less chance of colliding than the world spontaneously imploding tomorrow.

@CyrusNajmabadi
Copy link
Member

That's not true for any hashing algorithm. A good algorithm makes it unlikely, but it's never impossible.

It's fine for our purposes. The chance of collisions is so low under any sort of realistic use case, as to be the same as impossible. There are better chances of ram bits being flipped by cosmic rays than tehre is of a collision here, and we don't worry or try to deal with the former.

@blowdart
Copy link

blowdart commented Feb 19, 2019

Ah now I remember. SHA1 just puts things on my radar.

Moving away from sha1 or even sha256 would remove the need for the endless exemption. Still wondering why murmur over marvin, might be interesting to compare performance there.

(But you know if you say SHA three times in front of a mirror then either Levi or I will appear)

@CyrusNajmabadi
Copy link
Member

Ah now I remember. SHA1 just puts things on my radar.

:)

Still wondering why murmur over marvin, might be interesting to compare performance there.

I odn't remember the history (even though i did the murmur impl). Murmur was just extremely simple to get right, doesn't allocate, and included lots of available information on the net, as well as public domain code to refer to. I'm looking for info on marvin hashing, and not much is coming up. So it probably was just a decision to use something that fit our needs and would be supremely easy to use.

If marvin is better, i would have no problem with Roslyn switching to it. Is there a nice .net version available somewhere?

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Feb 19, 2019

Roslyn critical requirements for hash function:

  1. Cannot be MD5 or SHA1. These are banned for all purposes (even non-cryptographic scenarios).
  2. non-allocating.
  3. can provide a reasonably uniform distribution of hash results from inputs that are System.Strings (not bytes/arrays).
  4. needs to handle strings in both a case sensitive and insensitive manner (again, without allocating).
  5. fast. we need to hash millions of strings potentially. and we are constantly rehashing as users edit.
  6. deterministic across sessions. i.e. we need people to be able to close VS, log in/out, reboot, etc. and get the exact same result on the same inputs.

Non-goals of hte hash function:

  1. cryptographically secure.
  2. handle non-strings.

murmur2 fit all of the above, and asking the bcl for an equivalent at the time yielded nothing built-in that we could use.

@blowdart
Copy link

We use marvin for string hash codes I believe in Core.

Not really fussed here, it's not crypto related :)

@CyrusNajmabadi
Copy link
Member

We use marvin for string hash codes I believe in Core.

Do you know if it would satisfy all of the above requirements? For example, could you get a Case-Insensitive hash using marvin in a non-allocating fashion?

Not really fussed here, it's not crypto related :)

No worries! i'm just trying to learn more myself. I would be happy to get rid of our hand-written code if there is framework code that can do it for us!

@blowdart
Copy link

I honestly don't know, because that dips into implementation details, and I'm just a PM :)

@GrabYourPitchforks might.

@heejaechang
Copy link
Contributor Author

heejaechang commented Feb 20, 2019

for collision, regardless of theoretical strength, we want no collision ever happening in practice. not just in one repo but among repos as long as someone is not trying to purposely make things collide. and that is why we choose SHA1.

..

I just ran this on my roslyn src folder

$foo = (Get-ChildItem -path "c:\your\file\path" -recurse | measure-object | select -expand Count)
$bar = ((Get-ChildItem -path "c:\your\file\path" -recurse | Measure-Object -property length -sum).sum /1MB)
$avrg = $bar / $foo
$avrg

and it gives 16KB as average.

...

but I dont think we want a hash that show different characteristic based on size of input.

@sharwell
Copy link
Member

Currently the implementation assumes there will not be collisions, and does not implement collision detection or handling in the synchronization algorithm. The most likely observed side effect following a collision would be an application crash. However, assuming this was corrected, the characteristics would change substantially. The application would be highly tolerant of short-lived collisions (e.g. a checksum generated while typing that changes the next time a character is entered), but less tolerant of long-lived collisions (e.g. a user completes typing while the document is in a colliding state). In addition, the observed impact of collisions between objects of different types could be restricted to performance degradation proportional to the serialized size of the objects.

The majority of hashes generated by the system are used for synchronization of transient assets between processes on the same system. All of these hashes could be replaced with an object tagging system (rather than a content-based hash) to substantially reduce the startup cost and guarantee collisions are avoided.

@CyrusNajmabadi
Copy link
Member

Currently the implementation assumes there will not be collisions, and does not implement collision detection or handling in the synchronization algorithm.

Right. All of that would be unnecessary (given that we don't have hostile actors we have to be worried about). The chance of a collision is so low as to be practically meaningless. As mentioned above, it would make more sense for us to worry about bitflips and how to deal with them versus needing to be concerned with hash collisions here.

@GrabYourPitchforks
Copy link
Member

Right. All of that would be unnecessary (given that we don't have hostile actors we have to be worried about). The chance of a collision is so low as to be practically meaningless.

Are the hashes computed over source documents? If so, are they persisted anywhere? (You mentioned needing something stable across reboots.) The scenario I'm thinking is that somebody opens a .cs file provided by an adversary, and it persists something to the file system that causes VS to misbehave from then forward due to a collision.

@CyrusNajmabadi
Copy link
Member

The scenario I'm thinking is that somebody opens a .cs file provided by an adversary, and it persists something to the file system that causes VS to misbehave from then forward due to a collision.

This is already covered by hte VS threat model. Opening a .cs file from an adversary has never been safe (for many reasons**). So VS even warns in this case to only open things trusted.

--

** For example, things like designers will start up and will literally examine and just execute code in the .cs file. The thread from code execution is staggeringly higher than any sort of concern about producing slightly incorrect IDE assistance because we got a file that collided.

@GrabYourPitchforks
Copy link
Member

Hi all, spent some time digging around. There's support on the runtime side for exposing non-cryptographic hash algorithms via their own APIs (see https://github.com/dotnet/corefx/issues/25666 for one proposal). Just pulling things out of thin air my gut tells me that if this API gets off the ground we'd probably start with CRC32, Marvin32/64, and xxHash32/64 as a first pass, since the worker code to all of those already exists in Core in some fashion. I'm not optimistic about this making it in to 3.0, but we could ask @terrajobst and others to fast-track it when 3.1 opens up.

@tmat
Copy link
Member

tmat commented Feb 25, 2019

@GrabYourPitchforks @terrajobst Would it be possible to publish the implementation as a source package? Also, we need the impl to work on .NET 4.7.2. At least once it has a hash alg that we can use as a replacement for SHA1.

@terrajobst
Copy link
Member

Would it be possible to publish the implementation as a source package?

Possible yes, but not desirable. We just ran through this excercise for JSON. It's too fragile, even for internal use.

@bcronce
Copy link

bcronce commented Apr 22, 2019

I was doing some prototyping on a system where I needed a uniqueness guarantee, but didn't really care about crypto. I found a pure native implementation of Blake2 from their official site that out performed the .Net SHA1 for me.

Blake2 was a SHA3 finalist and only lost because it's more difficult to make in hardware, but it's really fast in software. It was made by the best cryptographers in the business and have a track record of creating crypto and finding faults in other crypto.

I consider Blake2 my go to hashing algorithm when I want the properties of crypto strength, but where I don't care about attackers.

@jinujoseph jinujoseph modified the milestones: 16.1.P1, 16.2 Apr 24, 2019
@jinujoseph jinujoseph modified the milestones: 16.2, Backlog Jun 9, 2019
@CyrusNajmabadi
Copy link
Member

@sharwell do we need this issue? We have moved to SHA256 here:

new ObjectPool<IncrementalHash>(() => IncrementalHash.CreateHash(HashAlgorithmName.SHA256), size: 20);

If we have an actual scenario where we see hash perf showing up, we can use that as a driving function. However, this issue seems speculative and not necessary to just have floating around.

@saucecontrol
Copy link
Member

SHA256 is still quite slow. If this is a bottleneck, I've published a BLAKE2 implementation that's roughly 5x faster than Windows CNG's SHA256: https://github.com/saucecontrol/Blake2Fast#blake2fast-vs-net-in-box-algorithms-md5-and-sha2

That's making use of AVX2 intrinsics in netcoreapp3.x, but the scalar fallback is still at least twice as fast as SHA256, even on netfx 32-bit.

@CyrusNajmabadi
Copy link
Member

@saucecontrol does this work for these requirements: #33411 (comment)

?

@saucecontrol
Copy link
Member

It works as is for everything except case insensitive string hashing. That can be done by pushing individual case-normalized characters (or perhaps fixed windows of case-normalized substrings) into the hash state. The API is Span-based, so it's non-allocating and flexible.

@saucecontrol
Copy link
Member

saucecontrol commented Mar 8, 2020

I assume you don't care about endianness if you're wanting to hash System.String values directly? Or would you want something that normalizes both case and byte order?

@CyrusNajmabadi
Copy link
Member

That can be done by pushing individual case-normalized characters

Could you link me to where it supports pushing characters? Thanks!

I assume you don't care about endianness if you're wanting to hash System.String values directly?

Correct. :)

@saucecontrol
Copy link
Member

I've updated the API since the last version published to NuGet, but you can try it out with the latest CI build

You can use Blake2b.CreateIncrementalHasher(), which will return the hash state struct. That has an Update() that accepts a value or Span of value:

https://github.com/saucecontrol/Blake2Fast/blob/master/src/Blake2Fast/Blake2b/Blake2bHashState.cs#L155-L183

So you can call that with aString.AsSpan() or you could case-normalize a string a chunk at a time into a fixed buffer, or just grab a character at a time to update the hash state. Updating the state simply pushes new bytes into a buffer until a block is full at which point the actual hash state is updated, so it's very lightweight.

@GrabYourPitchforks
Copy link
Member

FYI on 64-bit platforms, SHA512 tends to outperform SHA256. Especially for larger inputs (anything over a few dozen bytes), as is the case here. If you're going for raw speed and you need to use something built-in, it could be a stop-gap measure.

@saucecontrol
Copy link
Member

For that matter, MD5 is faster than both SHA2 variants on both platforms if you don't require cryptographic security.

on 64-bit platforms, SHA512 tends to outperform SHA256

BLAKE2 has similar characteristics. With scalar implementations, the 256-bit BLAKE2s variant runs faster on 32-bit while 512-bit BLAKE2b runs faster on 64-bit. With SIMD implementations, BLAKE2b is always faster.

@CyrusNajmabadi
Copy link
Member

I'm fine with any system that meets the requirements stated in #33411 (comment). We don't need cryptographic security. We just want a reasonable hasher.

@tmat
Copy link
Member

tmat commented Mar 9, 2020

@CyrusNajmabadi In addition to those requirements, it can't be MD5 or SHA1.

@CyrusNajmabadi
Copy link
Member

interesting, is that an external requirement/mandate @tmat? Nothing about the scenrios where we uses these hashes seems like it would preclude those (at least from Roslyn's perspective).

@CyrusNajmabadi
Copy link
Member

Thansk @tmat . Have added that to our criteria.

@jack-pappas
Copy link

May be worth taking a look at BLAKE3 here -- it's much faster than SHA-1, SHA-256, MD5, and blake2.

https://github.com/BLAKE3-team/BLAKE3

@CyrusNajmabadi
Copy link
Member

Closing. We moved to xxhash128

@CyrusNajmabadi CyrusNajmabadi closed this as not planned Won't fix, can't repro, duplicate, stale Oct 21, 2024
@dotnet dotnet locked and limited conversation to collaborators Oct 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests