-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use different hash algorithm than SHA1 in checksum for better performance #33411
Comments
We are reviewing MurmurHash 3 internally. The implementation I'm most familiar with is the one I used in ANTLR 4: |
📝 I'll also review the keys for collision resilience and set up an A/B test that uses a very poor hash to ensure collisions occur periodically during our testing so we understand the impact it would have at less frequent occurrences in the wild. |
It would be worthwhile to validate assumptions around SHA256 at the same time to settle that argument once and for all at least. |
Alternative proposal: unique ID generation. This would have essentially the runtime behavior of lazily calculating the hash of an object by simply calling |
checksum is designed to be content based so it can be stateless. we already have unique id (VersionStamp) that is only valid within same VS session and attempt to re-use between session using file stamp. having such system at the lower layer that is hard to guarantee data integrity between processes or between VS sessions making any system built on top of it hard to build/maintain. basically it requires something to gurantee validness of the data through holding some info making it stateful system. content based checksum let one to pick up data such diagnostics/bloom filter and etc by recalculating checksum of source content itself and doesn't require who generated the data as long as checksum is same making it easy to be distributed (no central control). or content cached by checksum can be validated by re-calculating checksum again. making whole system stateless. |
I already know the impact of a collision. we use checksum as a key of data. it can be content itself or it can be calculated information off content. having collision means wrong content (source, metadata, options and any data solution is using) is used to build solution. or calculated information such as symbol index or syntex tree index for wrong content is used for features. user facing impact will be things like wrong diagnostics shows up, FAR showing the wrong result, Add using not showing up, go to not showing match and etc. or crash if completely wrong data got picked up from cache. the system assumes there is no collision. ... by the way, if you want to see things yourself, change checksum type to have the same value for everything and see what happens, no reason to have a very poor hash to see the impact? |
this is not true. that is just one of usage case for the checksum. for example, once things are synced between devenv+OOP, it uses the checksum to verify the content is correct. |
Right, that's what I meant. Collisions are bad if they happen in either process, but maybe don't matter if they happen in an unrelated process. |
Is there any correctness requirement that the checksum be based on content, or is that just an implementation detail? Using a unique ID assignment to avoid the recalculation of checksums from content could save us a substantial amount of processing.
Failure to correctly keep track of unique IDs would result in cache misses across processes, but would not result in the use of invalid data (or more specifically, the incorrect reuse of data from one context in a different context).
Sometimes this degrades performance to the point that the system becomes unusable. Also, some systems that produce incorrect results in a collision scenario self-correct when the checksums no longer collide. These systems rely on the infrequent occurrence of collisions and are not able to handle cases where collisions are constant. It's easy to ensure collisions are rare but difficult to make them non-existent. |
the whole system is built on top of the promise of collision resistance strength of SHA. what you are saying is like, what is so wrong when 2 digital certificates having the same signature. |
What usage of SHA1 are we talking about? In the compiler? |
@gafter There are uses in the compiler and IDE side, though this task is specifically for IDE. The compiler work item is at https://devdiv.visualstudio.com/DevDiv/_workitems/edit/788330 |
Do we have any data for that? How much time do we spend on sha1 hashing currently? |
It's shown up several times internally, but hasn't risen to the point that it took priority over other perf work. A specialized testing harness will need to be created that covers the checksum process (this issue) and serialization process (#23331). |
Also please come talk to @GrabYourPitchforks and I about this. Given we use marvin32 in strings, it'd be an interesting discussion around why mumur and not marvin. |
Note: roslyn already has an impl of murmur hash (v2 though) here: https://github.com/dotnet/roslyn/blob/master/src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs I tested it originally, and found that on all of roslyn's inputs we never collided with a suitably high number of bits (which i think was less than the 160 used by sha1). Note that for Roslyn's needs, we're not going for cryptographic strength. Nor do we have any sort of attacker/adversary. All we're trying to do is hash fast with an extremely low chance of collisions. We then use that hash for content lookup. An inappropriate collision is undesirable, but isn't something that would be disastrous. |
A few things I care about here,
That's not true for any hashing algorithm. A good algorithm makes it unlikely, but it's never impossible. And then the phrase guarantee data integrity Both of which become security flags :) Admittedly this could be a misunderstanding based on wording, and given what rosyln does probably is, but still, I'd like to get a clear concept of what the hashes are used for. |
This is not a security domain. We are just using this for content hashing. A simple way to have our editor refer to a snapshot of a file. We've gone through the threat modeling here in the past. The use of a hashing algorithm here is just to get a content ID, it's not for a cyptographic security purpose. Please don't make it into one.
It's very simple. We often persist harmless data to disk (like an index of all the names used in a file). We want to know if we should update that index. So we store a hash of hte original file that was used to compute the index. If that hash changes, we recompute the index. If it doesn't we keep the old index around. The worst case here is that we somehow get a staggeringly unlikely collision. In that case, our index will just be a little incorrect. And all that means is that if you do something like run "find all references" in VS, you get a slightly wrong answer. Note, to even get a collision would mean the user was somehow even trying to subvert us, as any sort of normal edit/change just has less chance of colliding than the world spontaneously imploding tomorrow. |
It's fine for our purposes. The chance of collisions is so low under any sort of realistic use case, as to be the same as impossible. There are better chances of ram bits being flipped by cosmic rays than tehre is of a collision here, and we don't worry or try to deal with the former. |
Ah now I remember. SHA1 just puts things on my radar. Moving away from sha1 or even sha256 would remove the need for the endless exemption. Still wondering why murmur over marvin, might be interesting to compare performance there. (But you know if you say SHA three times in front of a mirror then either Levi or I will appear) |
:)
I odn't remember the history (even though i did the murmur impl). Murmur was just extremely simple to get right, doesn't allocate, and included lots of available information on the net, as well as public domain code to refer to. I'm looking for info on marvin hashing, and not much is coming up. So it probably was just a decision to use something that fit our needs and would be supremely easy to use. If marvin is better, i would have no problem with Roslyn switching to it. Is there a nice .net version available somewhere? |
Roslyn critical requirements for hash function:
Non-goals of hte hash function:
murmur2 fit all of the above, and asking the bcl for an equivalent at the time yielded nothing built-in that we could use. |
We use marvin for string hash codes I believe in Core. Not really fussed here, it's not crypto related :) |
Do you know if it would satisfy all of the above requirements? For example, could you get a Case-Insensitive hash using marvin in a non-allocating fashion?
No worries! i'm just trying to learn more myself. I would be happy to get rid of our hand-written code if there is framework code that can do it for us! |
I honestly don't know, because that dips into implementation details, and I'm just a PM :) @GrabYourPitchforks might. |
for collision, regardless of theoretical strength, we want no collision ever happening in practice. not just in one repo but among repos as long as someone is not trying to purposely make things collide. and that is why we choose SHA1. .. I just ran this on my roslyn src folder $foo = (Get-ChildItem -path "c:\your\file\path" -recurse | measure-object | select -expand Count) and it gives 16KB as average. ... but I dont think we want a hash that show different characteristic based on size of input. |
Currently the implementation assumes there will not be collisions, and does not implement collision detection or handling in the synchronization algorithm. The most likely observed side effect following a collision would be an application crash. However, assuming this was corrected, the characteristics would change substantially. The application would be highly tolerant of short-lived collisions (e.g. a checksum generated while typing that changes the next time a character is entered), but less tolerant of long-lived collisions (e.g. a user completes typing while the document is in a colliding state). In addition, the observed impact of collisions between objects of different types could be restricted to performance degradation proportional to the serialized size of the objects. The majority of hashes generated by the system are used for synchronization of transient assets between processes on the same system. All of these hashes could be replaced with an object tagging system (rather than a content-based hash) to substantially reduce the startup cost and guarantee collisions are avoided. |
Right. All of that would be unnecessary (given that we don't have hostile actors we have to be worried about). The chance of a collision is so low as to be practically meaningless. As mentioned above, it would make more sense for us to worry about bitflips and how to deal with them versus needing to be concerned with hash collisions here. |
Are the hashes computed over source documents? If so, are they persisted anywhere? (You mentioned needing something stable across reboots.) The scenario I'm thinking is that somebody opens a .cs file provided by an adversary, and it persists something to the file system that causes VS to misbehave from then forward due to a collision. |
This is already covered by hte VS threat model. Opening a .cs file from an adversary has never been safe (for many reasons**). So VS even warns in this case to only open things trusted. -- ** For example, things like designers will start up and will literally examine and just execute code in the .cs file. The thread from code execution is staggeringly higher than any sort of concern about producing slightly incorrect IDE assistance because we got a file that collided. |
Hi all, spent some time digging around. There's support on the runtime side for exposing non-cryptographic hash algorithms via their own APIs (see https://github.com/dotnet/corefx/issues/25666 for one proposal). Just pulling things out of thin air my gut tells me that if this API gets off the ground we'd probably start with CRC32, Marvin32/64, and xxHash32/64 as a first pass, since the worker code to all of those already exists in Core in some fashion. I'm not optimistic about this making it in to 3.0, but we could ask @terrajobst and others to fast-track it when 3.1 opens up. |
@GrabYourPitchforks @terrajobst Would it be possible to publish the implementation as a source package? Also, we need the impl to work on .NET 4.7.2. At least once it has a hash alg that we can use as a replacement for SHA1. |
Possible yes, but not desirable. We just ran through this excercise for JSON. It's too fragile, even for internal use. |
I was doing some prototyping on a system where I needed a uniqueness guarantee, but didn't really care about crypto. I found a pure native implementation of Blake2 from their official site that out performed the .Net SHA1 for me. Blake2 was a SHA3 finalist and only lost because it's more difficult to make in hardware, but it's really fast in software. It was made by the best cryptographers in the business and have a track record of creating crypto and finding faults in other crypto. I consider Blake2 my go to hashing algorithm when I want the properties of crypto strength, but where I don't care about attackers. |
@sharwell do we need this issue? We have moved to SHA256 here:
If we have an actual scenario where we see hash perf showing up, we can use that as a driving function. However, this issue seems speculative and not necessary to just have floating around. |
SHA256 is still quite slow. If this is a bottleneck, I've published a BLAKE2 implementation that's roughly 5x faster than Windows CNG's SHA256: https://github.com/saucecontrol/Blake2Fast#blake2fast-vs-net-in-box-algorithms-md5-and-sha2 That's making use of AVX2 intrinsics in netcoreapp3.x, but the scalar fallback is still at least twice as fast as SHA256, even on netfx 32-bit. |
@saucecontrol does this work for these requirements: #33411 (comment) ? |
It works as is for everything except case insensitive string hashing. That can be done by pushing individual case-normalized characters (or perhaps fixed windows of case-normalized substrings) into the hash state. The API is Span-based, so it's non-allocating and flexible. |
I assume you don't care about endianness if you're wanting to hash System.String values directly? Or would you want something that normalizes both case and byte order? |
Could you link me to where it supports pushing characters? Thanks!
Correct. :) |
I've updated the API since the last version published to NuGet, but you can try it out with the latest CI build You can use So you can call that with |
FYI on 64-bit platforms, SHA512 tends to outperform SHA256. Especially for larger inputs (anything over a few dozen bytes), as is the case here. If you're going for raw speed and you need to use something built-in, it could be a stop-gap measure. |
For that matter, MD5 is faster than both SHA2 variants on both platforms if you don't require cryptographic security.
BLAKE2 has similar characteristics. With scalar implementations, the 256-bit BLAKE2s variant runs faster on 32-bit while 512-bit BLAKE2b runs faster on 64-bit. With SIMD implementations, BLAKE2b is always faster. |
I'm fine with any system that meets the requirements stated in #33411 (comment). We don't need cryptographic security. We just want a reasonable hasher. |
@CyrusNajmabadi In addition to those requirements, it can't be MD5 or SHA1. |
interesting, is that an external requirement/mandate @tmat? Nothing about the scenrios where we uses these hashes seems like it would preclude those (at least from Roslyn's perspective). |
Thansk @tmat . Have added that to our criteria. |
May be worth taking a look at BLAKE3 here -- it's much faster than SHA-1, SHA-256, MD5, and blake2. |
Closing. We moved to xxhash128 |
@sharwell believe SHA1 is too slow for our checksum. so he wants us to use a different algorithm than SHA1 to make hash faster.
The text was updated successfully, but these errors were encountered: