Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General purpose non-cryptographic hashing API for .NET #24328

Closed
jamesqo opened this issue Dec 3, 2017 · 34 comments · Fixed by #53623
Closed

General purpose non-cryptographic hashing API for .NET #24328

jamesqo opened this issue Dec 3, 2017 · 34 comments · Fixed by #53623
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Numerics Priority:1 Work that is critical for the release, but we could probably ship without
Milestone

Comments

@jamesqo
Copy link
Contributor

jamesqo commented Dec 3, 2017

Context from https://github.com/dotnet/corefx/issues/14354#issuecomment-348003785:

@JonHanna, I'd be interested to hear how your testing goes so we can start thinking about what would be useful in a general-purpose non-cryptographic hashing API.

@morganbr Where would be an appropriate forum to discuss such an API? I expect that such an API would consist of more than just the lowest common denominator, and maybe a good API will also need an improved JIT w.r.t. handling of larger structs. Discussing all that might be better done in a separate issue...

/cc @gimpf, @morganbr, @JonHanna

API Proposal

  • Use a distinct type name for size variants of the same algorithm. e.g. CRC-32 is Crc32 and CRC-64 is Crc64, instead of one CRC class that has a property to adjust the size.
  • Instance methods are based on System.Security.Cryptography.IncrementalHash, and provide for an accumulator model.
  • Static methods provide convenience alternatives for common scenarios where the data to hash is already in a contiguous buffer.
  • An inheritance model is included in this proposal, mainly just so that a single implementation can be provided for things like Append(Stream). It's not generally expected that someone would have an API that takes a parameter of the base class type.
  • The initial proposal includes two major algorithms, with two variants each, but should be viewed as framing the shape of all potential additions.
    • Cyclic Redundancy Check (CRC):
      • CRC-32
      • CRC-64
    • xxHash
      • xxHash32
      • xxHash64

There are a couple of items that are undecided:

  • Namespace
    • Proposed names include:
      • System.Buffers.Hashing
      • System.IO.Hashing
      • System.Hashing
  • DLL and delivery
    • Inbox, new DLL, named the namespace.
    • Inbox, existing DLL (specific one TBD)
    • OOB (new DLL/package, named the namespace)

Low-concept version

namespace System.Buffers.Hashing
{
    public abstract class NonCryptographicHashAlgorithm
    {
        // Proposed alternative names include HashLengthInBytes
        public int Length { get; }

        protected NonCryptographicHashAlgorithm(int length) { }

        public abstract void Append(ReadOnlySpan<byte> source);
        public abstract void Reset();
        protected abstract int GetCurrentHashCore(Span<byte> destination);

        public void Append(byte[] source) { }
        public void Append(Stream stream) { }
        public byte[] GetCurrentHash() => throw null;
        public byte[] GetHashAndReset() => throw null;
    }
    public class Crc32 : NonCryptographicHashAlgorithm
    {
        public Crc32() : base(32) { }
        public override void Append(ReadOnlySpan<byte> source) { }

        public static byte[] Hash(byte[] source) => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten) => throw null;
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination) => throw null;
    }
   // Repeat Crc32 class for every variant, just with a different name.
}

Full proposal

namespace System.Buffers.Hashing
{
    public abstract class NonCryptographicHashAlgorithm
    {
        // Proposed alternative names include HashLengthInBytes
        public int Length { get; }

        protected NonCryptographicHashAlgorithm(int length) { }

        public abstract void Append(ReadOnlySpan<byte> source);
        public abstract void Reset();
        protected abstract int GetCurrentHashCore(Span<byte> destination);

        public void Append(byte[] source) { }
        public void Append(Stream stream) { }
        public Task AppendAsync(Stream stream, CancellationToken cancellationToken = default) { }
        public byte[] GetCurrentHash() => throw null;
        public bool TryGetCurrentHash(Span<byte> destination, out int bytesWritten) => throw null;
        public int GetCurrentHash(Span<byte> destination) => throw null;
        public byte[] GetHashAndReset() => throw null;
        public bool TryGetHashAndReset(Span<byte> destination, out int bytesWritten) => throw null;
        public int GetHashAndReset(Span<byte> destination) => throw null;

        // Virtual in case there's a more efficient way the algorithm can close itself.
        protected virtual int GetHashAndResetCore(Span<byte> destination)
        {
            int ret = GetCurrentHashCore(destination);
            Reset();
            return ret;
        }
    }
    public class XxHash32 : NonCryptographicHashAlgorithm
    {
        public XxHash32() : base(32) { }
        public override void Append(ReadOnlySpan<byte> source) { }
        public override void Reset() { }
        protected override int GetCurrentHashCore(Span<byte> destination) => throw null;

        public static byte[] Hash(byte[] source) => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten) => throw null;
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination) => throw null;
    }
    public class XxHash64 : NonCryptographicHashAlgorithm
    {
        public XxHash64() : base(64) { }
        public override void Append(ReadOnlySpan<byte> source) { }
        public override void Reset() { }
        protected override int GetCurrentHashCore(Span<byte> destination) => throw null;

        public static byte[] Hash(byte[] source) => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten) => throw null;
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination) => throw null;
    }
    public class Crc32 : NonCryptographicHashAlgorithm
    {
        public Crc32() : base(32) { }
        public override void Append(ReadOnlySpan<byte> source) { }
        public override void Reset() { }
        protected override int GetCurrentHashCore(Span<byte> destination) => throw null;

        public static byte[] Hash(byte[] source) => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten) => throw null;
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination) => throw null;
    }
    public class Crc64 : NonCryptographicHashAlgorithm
    {
        public Crc64() : base(64) { }
        public override void Append(ReadOnlySpan<byte> source) { }
        public override void Reset() { }
        protected override int GetCurrentHashCore(Span<byte> destination) => throw null;

        public static byte[] Hash(byte[] source) => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten) => throw null;
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination) => throw null;
    }
}

Optional addendum, static members to hash Streams.

These members should validate that there's sufficient write-space before draining the stream, otherwise data has been lost (for non-seekable streams, at least).

namespace System.Buffers.Hashing
{
    partial class Crc64
    {
        public static byte[] Hash(Stream stream) => throw null;
        public static Task<byte[]> HashAsync(Stream stream, CancellationToken cancellationToken = default) => throw null;
        public static int Hash(Stream stream, Span<byte> destination) => throw null;
        public static Task<int> HashAsync(Stream stream, Memory<byte> destination, CancellationToken cancellationToken = default) => throw null;
    }
}
@JonHanna
Copy link
Contributor

JonHanna commented Dec 3, 2017

TBH, I cloned it, tried to publish, got MSB4062, said to myself "I'm probably building with the wrong version of dotnet and/or msbuild, I'll figure it out this weekend" and then forgot all about it :(

In terms of an API, the feedback I've gotten on SpookilySharp has mostly been people who liked it but didn't like the license (I changed it to MIT this weekend, but it had been EUPL for a long time). It's either too good or too bad to get contributors. From my experience with it and from the little feedback I got I would say:

  1. The Rehash method (which produces a hash of an int) was a mistake, even though it was the original purpose of the whole thing. I really needed it for a particular case, but it isn't a general case, and it's probably been misused more than it has been well-used since. There can be cases where rehashing a hash code gives a benefit, but they are rare and difficult to explain well.

  2. There is definitely an audience for fast, high-quality stream hashing. The people who contacted me because they liked the library but couldn't work with the license were mostly using it for hashing entire files. It's also the use I've made of it professionally myself, though it started as something to do over a winter holiday break (well, the thing I needed Rehash for did, but then I ended up doing that instead).

  3. 128-bit, and perhaps larger, hash codes definitely have a demand.

  4. Specified hash codes definitely have a use. Changing from version to version is fine for hash-based collections, but there's a use-case for being able to hash a file and know you'd have the same result with the next version and/or on a different machine. This might be orthogonal to other needs. I can imagine being happy that a hashing library I was using changed to a newly invented algorithm that e.g. made use of intrinsics that don't even exist right now as far as some uses go, and yet also that it would ruin some other uses.

  5. That said, cases that require avoiding hash-DoSsing require some sort of seeding mechanism or other means to prevent predictability of hashes. This would come in even more if an API became the de facto hasher, since crackers could assume it would likely be used.

  6. Async stream hashing likely has a demand, to judge from that for ages I offered async from a pre-release nuget and not from the stable, and pre-release downloads were higher than the stable (luckily it turned out the bug that had kept it in pre-release was actually a mistake in the test). The only advantage of the pre-release was if you were using that async stuff. Also, I'd expect anything that provided stream hashing to provide an async version anyway, just as I'd expect anything that provided stream anything to have an async version.

  7. Wrapping the stream so that I can do whatever I'm doing with the stream (e.g. receiving from web request and saving to disk) and get the hash code at the end is very useful, and avoids some copying or self-coded wrapping that would be necessary otherwise. This is how I ended up using it professionally.

@gimpf
Copy link

gimpf commented Dec 5, 2017

A first few questions, not thought through in detail yet:

What kinds of APIs are the focus here?

  • Common interfaces for, and implementations of different hash algorithms
  • Common interfaces for "hashable" data (to make hash-algorithms pluggable, so that algorithms can be selected depending on the environment)
  • Utilities around those things, like wrappers around Streams, as @JonHanna mentioned, or providers for seeds (to get high-quality random seeds w/o boxing into byte[] and also easier than manually working w/crypto get-random)
  • Changes to existing uses of hashes in .NET: per-dictionary instance seeds, for instance

A few things that I assume:

  • no unnecessary overhead: hash-algo impls can be used allocation free by default whenever possible; that also means that APIs provide access using Span<T> or unsafe (in addition to "regular" types)
  • provide both streaming and fixed-length interfaces
  • hashes will be needed for persistently checksumming things, as well as for only ephemeral uses

Some stuff that I'm not totally sure about yet (back from when my time ran out on the Haschisch playground):

  • How to provide an "easy and good by default" interface for seedable hashes: both the length of the hash-code and the seed depend on the algorithm (most families provide algorithms optimized for specific hash-lengths), and boxing either of them is not an option. It's easy enough to "just use generics", but it can make the client-code unwieldy; also, a generic type that says "value type of size at least n bytes w/o padding" doesn't really exist, and so implementing the interface for only a specific seed-length restricts the use of the hasher. I don't have a solution handy that I'd be satisfied with.
  • What is abound "rewindable" hashes? (for instance, seahash). They are useful for certain niches (like updating the hash of a very large file where only the last few kB where changed), but it's called "niche" for a reason.

Also, what's next?

@morganbr
Copy link
Contributor

morganbr commented Dec 6, 2017

I propose enabling an API for several non-cryptographic hash algorithms as well as being extensible to future/user-provided algorithms. "Non-cryptographic" includes Spooky or Marvin32, but not SHA256 (we shouldn't have any overlap between System.Numerics.Hashing and System.Security.Cryptography to avoid confusion). The first set of algorithms we build can be determined after deciding on an API (using data gathered in #19621). Reasons to use them include:
a. They are much faster than cryptographic hashes
b. Some formats and specifications call for certain hashes, such as CRC32
c. It helps differentiate non-security code

Enabling multiple algorithms is important because they differ by size, speed, entropy, DoS resistance and general usage. Unlike HashCode, this would be for users who want to pick a particular algorithm for a particular purpose.

Some potentially interesting characteristics could include:

  1. General-purpose usability, which might mean some API differences. Allowing abstractions could be useful.
  2. "Pretty good" performance. We had to make some usability sacrifices in HashCode to achieve performance good enough to make dictionary hash codes that we might not here, but we should still try to have it be fast enough that people don't need to implement their own algorithms.
    a. Initially, I've been thinking of having resettable classes for these rather than using structs since copying or using default with a hash struct can cause unintended behavior (bad hashes) that is very difficult to debug.
    b. Hashes with appropriately sized outputs could return those (e.g. Marvin32 could return an int rather than a span/byte[], but Spooky128 couldn't). That could be both a small performance win and a way to make those hashes a bit easier to use.
  3. Working over both spans and streams is likely desirable. If we handle streams, async overloads would be sensible.
  4. Stream wrapping seems useful but may be the hardest API to get right. We've done it with CryptoStream in the past and found it's frequently misused. We can also have an incremental API similar to IncrementalHash where the user can do their own stream management.

Based on all of that, my straw-man API idea, based on IncrementalHash, is below:

namespace System.Numerics.Hashing
{
    public abstract class HashBase
    {
        /// <summary>
        /// Number of bits produced by the hash
        /// </summary>
        public abstract int HashSize { get; }

        /// <summary>
        /// Core hashing function filled in by implementers. Adds data to the internal
        /// state of the hash.
        /// </summary>
        public abstract void AddIncrementalData(ReadOnlySpan<byte> data);

        /// <summary>
        /// Both returns the result of the hash of incrementally added data and resets
        /// the internal state to be ready to hash again.
        /// </summary>
        public abstract bool TryGetHashAndReset(Span<byte> hash);
        public byte[] GetHashAndReset();

        /// <summary>
        /// Used to indicate that the hash state is currently empty and it can start a new hash.
        /// </summary>
        protected abstract bool StateIsEmpty { get; }

        // Various overloads for hashing a single chunk of data and returning the result. 
        // These are intended to cover most usages. These aren't compatible with incremental hashes,
        // and would throw if !StateIsEmpty (and they'll reset when they're done)
        public byte[] ComputeHash(ReadOnlySpan<byte> data);
        public bool TryComputeHash(ReadOnlySpan<byte> data, Span<byte> hash);
        public byte[] ComputeHash(Stream data);
        public bool TryComputeHash(Stream data, Span<byte> hash);
        public async byte[] ComputeHashAsync(Stream data); // No Span overload for this because it can't be used with async
    }

    public abstract class IntSizedHash : HashBase
    {
        int Get32BitHashAndReset();
        int Compute32BitHash(ReadOnlySpan<byte> data);
        int Compute32BitHash(Stream data);
        async Task<int> Compute32BitHashAsync(Stream data);
    }

    public abstract class LongSizedHash : HashBase
    {
        long Get64BitHashAndReset();
        long Compute64BitHash(ReadOnlySpan<byte> data);
        long Compute64BitHash(Stream data);
        async Task<long> Compute64BitHashAsync(Stream data);
    }
}

An example derived class might look like:

namespace System.Numerics.Hashing
{
    public sealed class Marvin32 : IntSizedHash
    {
        private long _seed;
        private bool _isEmpty = true;
        private long _state;

        public Marvin32(long seed) { _seed = seed; }

        public override int HashSize => 32;

        protected override bool StateIsEmpty => _isEmpty;

        public override void AddIncrementalData(ReadOnlySpan<byte> data)
        {
            _isEmpty = false;
            // Incorporate data into _state
        }

        public override bool TryGetHashAndReset(Span<byte> hash)
        {
            // Finalize the hash based on _state and fill it into the span

            // Reset to be reused
            _state = _seed;
            _isEmpty = true;
            return true;
        }
    }
}

CC @bartonjs, @GrabYourPitchforks, @joshfree

@GrabYourPitchforks
Copy link
Member

@morganbr, I like your proposal, though I'd offer some suggestions. First, I don't see the value of the Try* methods. There's never any ambiguity as to how large the destination buffers should be. The caller can always query the HashSize ahead of time to determine how many bytes will be written to the destination. If the caller specifies a destination buffer that's too short, just throw. (This also stems from my belief that a caller that hasn't queried the required size property ahead of time also wouldn't know what to do with a false return value.)

Second, I don't know if the int / long returning methods are terribly useful. The only real use case I see for them is to make something that can be used for GetHashCode, but as you mentioned earlier we already have a type for that specific use case. We'd also have to dictate the endianness guarantees of the int / long based methods.

Finally, should ComputeHash and Reset be different operations? In the crypto hash algorithms there are scenarios where we run a large amount of common data through the sponge function, then we duplicate the internal state of the hash algorithm and use it to produce different hashes based on different trailing data, all in a very efficient manner. I imagine CRC32 also has similar scenarios. This might also necessitate making the class ICloneable and IDisposable, though I haven't thought through the full repercussions yet.

@bartonjs
Copy link
Member

bartonjs commented Dec 6, 2017

In the crypto hash algorithms there are scenarios where we run a large amount of common data through the sponge function, then we duplicate the internal state of the hash algorithm and use it to produce different hashes based on different trailing data, all in a very efficient manner.

Maybe in the native side, but the managed side doesn't allow state cloning. And "getting the hash" requires a HashFInal-type operation, such writing the number of bytes written; so once a hash has been "gotten" it has been tainted. That's why crypto IncrementalHash has GetHashAndReset().

Second, I don't know if the int / long returning methods are terribly useful.

I agree. Byte -> (U)Int{32|64} requires a choice about endianness. "Machine" is wrong, because then the same hash algorithm produces a different value between x86 and arm32. Little is fine, except for when it isn't. So I'd leave the API as bytes; and if some algorithm says it produces an integer it can add the method itself.


Other feedback I have, disjoint from responding to feedback feedback.

AddIncrementalData: The crypto version calls this AppendData, symmetry seems good unless you think there's a reason that symmetry is bad here.

HashSize: Because so much work is done on ROSpan<byte> or byte[] I occasionally find it confusing about if a thing is talking about size in bytes or bits. So HashSizeInBits seems good to me here. And, heck, why not, public int HashSizeInBytes { get; } = checked(HashSizeInBits + 7) / 8;

ComputeHash and friends: Would it make more sense for these to be static? (Extension seems fine) Then the statics need to check clean, but the instances don't need to care if the async Hash(Stream) operation is still running... because as far as they're concerned it doesn't exist. In fact, it seems like making them statics on the specific algorithms is better, because usage data could then drive us to make an optimized one-shot (e.g. for CRC-32 the TryComputeHash static method could be entirely non-allocating (assuming that the static lookup table has already been initialized)). Making the "eh, I can do this for you" helpers as protected static seems doable:

public abstract partial class HashBase
{
    protected static bool ReferenceTryComputeHash<T>(ReadOnlySpan<byte> data, Span<byte> destination) where T : HashBase, new()
    {
        T hasher = new T();
        
        if (destination.Length < hasher.HashSizeInBytes)
        {
            return false;
        }

        hasher.AppendData(data);
        hasher.TryGetHashAndReset(destination);
        // Should that be asserted? (Doesn't help 3rd party assemblies).  A throw? Seems more expensive than justified.
        return true;
    }

    // and/or.
    protected static bool ReferenceTryComputeHash(HashBase hasher, ReadOnlySpan<byte> data, Span<byte> destination)
    {
        if (hasher == null) throw new ArgumentNullException(nameof(hasher));
        if (!hasher.StateIsEmpty) throw new InvalidOperationException(SR.NoTaintedHashesPlease);

        if (destination.Length < hasher.HashSizeInBytes)
        {
            return false;
        }

        hasher.AppendData(data);
        hasher.TryGetHashAndReset(destination);
        // Should that be asserted? (Doesn't help 3rd party assemblies).  A throw? Seems more expensive than justified.
        return true;
    }
}

public partial class Marvin32 : HashBase
{
    public static bool TryComputeHash(ReadOnlySpan<byte> data, Span<byte> destination)
    {
        return ReferenceTryComputeHash<Marvin32>(data, destination);
    }
}

public partial class Crc32 : HashBase
{
    public static bool TryComputeHash(ReadOnlySpan<byte> data, Span<byte> destination)
    {
        if (destination.Length < sizeof(int))
        {
            return false;
        }

        // implementation of CRC-32 over closed data, because our perf/analysis data says this is
        // a popular routine, and that eliminating the Gen0 temporary HashBase-derived object is worthwhile.
        return true;
    }
}

And, finally, the name HashBase probably doesn't fly with our current naming schemes, because we don't like the Base suffix. NonCryptographicHash seems wordy, but maybe that's exactly what we want.

@GrabYourPitchforks GrabYourPitchforks self-assigned this Dec 14, 2017
@ssg
Copy link

ssg commented Sep 9, 2019

This actually originates from the discussions to make GetHashCode implementations use proven hash algorithms, in order to get better spread of the hash values, to reduce collision and improve performance. There is already an API proposal with methods like "Combine", "Add" to hide mixing logic (xor, bit shift, multiplication etc) behind an interface: https://github.com/dotnet/corefx/issues/14354 . So, people wanted the ability to plug different algorithms into that proposed API and this proposal exists to address that need specifically.

(Let me know if I'm missing anything)

"Mixer API" uses a stateful API, therefore the proposed design here by @morganbr also provides a stateful API. I feel like it isn't as useful as it can be, because:

  1. Cryptographic hashing API (HashAlgorithm) is already a stateful API. I see no reason why itself or its subset can't be used over a new API if we want a stateful API.
  2. Statefulness in the API prevent many optimizations, like using registers for the state machine throughout the hashing process. A huge optimization opportunity gets lost there.

An alternative I propose is a single-shot, in-memory hashing API:

public interface IInMemoryHashCode
{
    int GetHashCode(ReadOnlySpan<T> input);
}

and the "mixer API" implementation would be responsible for accumulating the input in a buffer and passing it along to the hash function. This would make implementations easier and faster for the given purpose. Obviously, growing a buffer is also very expensive, maybe even more expensive than what we gain from hashing in one shot, but it can be tuned to work like a StringBuilder so the new buffer allocations remain minimum.

My proposal:

  • Is compatible with GetHashCode directly
  • Allows thread-safe implementations
  • Removes the need to have a Reset method.
  • Avoids optimization losses due to necessity of keeping state in memory.
  • Is compatible with stateful implementations too.

Endianness isn't an issue because BitConverter is already endian-aware, and implementations can use it.

I see no reason to mix in 64-bit or arbitrary length return values, because this is only for GetHashCode(). I see no reason to take this beyond the intended goal, or just use HashAlgorithm.

Let me know what you think, thanks.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@tannergooding tannergooding modified the milestones: 5.0.0, Future Jun 23, 2020
@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Jul 6, 2020
@LeaFrock
Copy link
Contributor

LeaFrock commented Oct 7, 2020

Oh, another 4-years-old issue huh……

I'm trying to implement a Bloom Filter recently and I want to use a non-cryptographic hash algorithm called MurmurHash3. I find that it has already become a part of Scala's standard library, but doesn't exist in dotnet BCL. Although our community provides different implementations of those hash algorithms, some of them are confusing and unreliable.
As the hash algorithms are famous and commonly useful, I think the BCL should include them. So I created issue #43131 .

I understand all current efforts are put into the release of .NET 5. I just hope, after that, someone will consider on this issue and take some progress...

@tannergooding
Copy link
Member

@GrabYourPitchforks, should this issue be converted to a discussion?

@GrabYourPitchforks
Copy link
Member

@tannergooding Probably. Just scrolled back through the earlier discussion, and I don't know if we're close to settling on an API shape.

@bartonjs
Copy link
Member

bartonjs commented Mar 5, 2021

Based on things we've learned from the cryptographic algorithms, I have two strawmen. The first is a common pattern, no inheritance. It could probably be upgraded to inheritance-based later without breaking, because the CLR allows for methods to move to base types and they still get bound at runtime.

The second is the inheritance based one, which has the unfortunate problem of needing to name the base class.

They both follow the rules:

  • Instances are used as accumulators, like IncrementalHash.
  • Static methods for one-and-done.
  • Hashes produce bytes, not ints or longs.

(No inheritance, simplified (span-explosion of members not shown)):

namespace TBD
{
    public class XxHash32
    {
        public XxHash32() { }
        public void Append(ReadOnlySpan<byte> source) { }
        public void Reset() { }
        public byte[] GetCurrentHash() => throw null;
        public byte[] GetHashAndReset() => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
    }
    public class XxHash64
    {
        public XxHash64() { }
        public void Append(ReadOnlySpan<byte> source) { }
        public void Reset() { }
        public byte[] GetCurrentHash() => throw null;
        public byte[] GetHashAndReset() => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
    }
    public class Crc32
    {
        public Crc32() { }
        public void Append(ReadOnlySpan<byte> source) { }
        public void Reset() { }
        public byte[] GetCurrentHash() => throw null;
        public byte[] GetHashAndReset() => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
    }
    public class Crc64
    {
        public Crc64() { }
        public void Append(ReadOnlySpan<byte> source) { }
        public void Reset() { }
        public byte[] GetCurrentHash() => throw null;
        public byte[] GetHashAndReset() => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
    }
    // I see a pattern here...
}

The second looks an awful lot like the first, but the verbosity of the proposal is increased to show what is/isn't virtual. The int-returning/Span-writing methods use the template method pattern so that destination-too-small gets normalized exceptions, but Append and Reset don't need to since there's no validation to be done. (Unless someone wants HashBase : IDisposable, in which case they should use the TMP to normalize the ObjectDisposedException)

namespace TBD
{
    public abstract class HashBase
    {
        public int Length { get; }
        protected HashBase(int length) { }
        public void Append(byte[] source) { }
        public abstract void Append(ReadOnlySpan<byte> source);
        public abstract void Reset();
        public byte[] GetCurrentHash() => throw null;
        public bool TryGetCurrentHash(Span<byte> destination, out int bytesWritten) => throw null;
        public int GetCurrentHash(Span<byte> destination) => throw null;
        protected abstract int GetCurrentHashCore(Span<byte> destination);
        public byte[] GetHashAndReset() => throw null;
        public bool TryGetHashAndReset(Span<byte> destination, out int bytesWritten) => throw null;
        public int GetHashAndReset(Span<byte> destination) => throw null;

        // Virtual in case there's a more efficient way the algorithm can close itself.
        protected virtual int GetHashAndResetCore(Span<byte> destination)
        {
            int ret = GetCurrentHashCore(destination);
            Reset();
            return ret;
        }
    }
    public class XxHash32 : HashBase
    {
        public XxHash32() : base(32) { }
        public override void Append(ReadOnlySpan<byte> source) { }
        public override void Reset() { }
        protected override int GetCurrentHashCore(Span<byte> destination) => throw null;

        public static byte[] Hash(byte[] source) => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten) => throw null;
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination) => throw null;
    }
    public class XxHash64 : HashBase
    {
        public XxHash64() : base(64) { }
        public override void Append(ReadOnlySpan<byte> source) { }
        public override void Reset() { }
        protected override int GetCurrentHashCore(Span<byte> destination) => throw null;

        public static byte[] Hash(byte[] source) => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten) => throw null;
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination) => throw null;
    }
    public class Crc32 : HashBase
    {
        public Crc32() : base(32) { }
        public override void Append(ReadOnlySpan<byte> source) { }
        public override void Reset() { }
        protected override int GetCurrentHashCore(Span<byte> destination) => throw null;

        public static byte[] Hash(byte[] source) => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten) => throw null;
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination) => throw null;
    }
    public class Crc64 : HashBase
    {
        public Crc64() : base(64) { }
        public override void Append(ReadOnlySpan<byte> source) { }
        public override void Reset() { }
        protected override int GetCurrentHashCore(Span<byte> destination) => throw null;

        public static byte[] Hash(byte[] source) => throw null;
        public static byte[] Hash(ReadOnlySpan<byte> source) => throw null;
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten) => throw null;
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination) => throw null;
    }
}

If we /really/ wanted to we could add some "AsInt32" and "AsInt64" variants on the 32 and 64 bit versions. For either proposal. For the inheritance model it's probably worth adding a HashBase32 and HashBase64 extra layer so that those methods ride for free. e.g.

namespace TBD
{
    public abstract class HashBase64 : HashBase
    {
        protected HashBase64() : base(64) { }
        public long GetCurrentHashAsInt64() => throw null;
        public long GetHashAsInt64AndReset() => throw null;
    }
    public class Crc64 : HashBase64
    {
        // just as it was before, except an easier ctor.
        // it still writes bytes, it can't influence the new methods.
    }
}

These shapes enable adding bigger hashes, like MurmurHash128, without "some are bytes, some are ints, some are longs...". The naming pattern is a problem for algorithms that end in a number, like FNV1, but I hear FNV1a is almost always better, so FNV1a1024 isn't a naming problem (other than "FNV").

If any of the algorithms require knowing the length up front, they could either have special ctors and/or Reset methods, or just buffer from the accumulator. (Or don't do inheritance and only implement the static Hash members)

Personally, I like the simplified implementations and uniformity that the inheritance model brings. But I can't come up with a good name for it.

  • HashBase is bad because someone will eventually take one as a pluggable parameter, and "Base" is "only allowed" when the type is almost guaranteed to be hidden (because there's a better abstraction below or beside it).
  • NonCryptographicHash is wordy, but it should be rare, so maybe it's actually a good enough name.
  • HashAlgorithm is verboten, that is already used for cryptographic hashes
  • Hash is what we want to use as the method name on the statics, it shouldn't also be the type. Also, it feels like a weird IntelliSense experience to see Hash and HashAlgorithm. Too similar.

@bartonjs bartonjs changed the title [Discussion] General purpose non-cryptographic hashing API for .NET General purpose non-cryptographic hashing API for .NET Mar 5, 2021
@blowdart
Copy link
Contributor

blowdart commented Mar 5, 2021

NonCryptographicHashAlgorithm ? Even longer, but keeps aligned to the cryptographic ones, and yea, no-one is going to use the abstract.

@GrabYourPitchforks
Copy link
Member

I agree that if we create a common base class we should address only the common subset of functionality: an algorithm which operates on a byte stream of arbitrary length, a digest length which is known upfront, an instance which has reset capabilities. There should be no need to implement IDisposable because no secrets are present and we don't expect these instances to maintain resources which need releasing. If the caller wants to pool reusable hash instances they're free to do so themselves.

There are some scenarios this doesn't address. This doesn't address a theoretical hash algorithm which operates on WORD or DWORD streams instead of byte streams, or which takes ancillary data alongside the to-be-hashed data. (I'm reminded of AEAD encryption algorithms and how the existing SymmetricAlgorithm base type is a poor fit for them.) It doesn't provide the ability to persist / clone the internal state of a hash so that you can hash multiple megabytes of common data, then tweak just the last few inputs into the system.

If we say that these are specialty scenarios, then it'd make sense to put this functionality (as appropriate) on the derived types rather than trying to slam it on to the base type. And if devs really need something truly custom, nobody's prevented from making a Contoso.HashAlgorithms power toys package. :)

@terrajobst
Copy link
Member

terrajobst commented Mar 5, 2021

Namespace wise, what about these:

  • System.Buffers.Hashing
  • System.IO.Hashing
  • System.Hashing

I'm a fan of (1) because I'm not sure this will be a mainline API. It's for people processing binary data, which is what the buffers namespace is really about. If we believe that's true then I think NonCryptographicHash or NonCryptographicHashAlgorithm are good names for the base class because they clarify the intent.

@blowdart
Copy link
Contributor

blowdart commented Mar 5, 2021

You can hash strings. You can hash files. I don't think people think of those in terms of buffers to be honest, so that doesn't sit well in my mind.

System.IO.Hashing seems better to me, and as much as I'd like System.Hashing I'm not sure it deserves being that high up.

@GrabYourPitchforks
Copy link
Member

You can hash strings. You can hash files.

We can add stringy overloads if needed.

(Thought experiment: What happens if / when we introduce Utf8String? Should hashing a String and the equivalent Utf8String produce the same hash code? Now we're adding policy on top of something that's supposed to be a barebones reference hashing API.)

Not sure if we need proper File-based overloads. I don't see realistic scenarios where people are running MurmurHash over files directly.

@bartonjs
Copy link
Member

bartonjs commented Mar 5, 2021

This doesn't address a theoretical hash algorithm which operates on WORD or DWORD streams instead of byte streams

The one-shots could throw for inappropriately-sized data (and/or have overloads), and the accumulator could buffer

or which takes ancillary data alongside the to-be-hashed data. (I'm reminded of AEAD encryption algorithms and how the existing SymmetricAlgorithm base type is a poor fit for them.)

I'm not sure that really makes sense here. AEAD is (reductio ad absurdum) "I'm hashing some stuff" and "I'm encrypting some stuff", and the ancillary data is "oh, but there's a thing I want to hash but not encrypt". If there was something like "ZipperHash" which was defined as working on interleaved segments of data that could decide it fits the shape by accepting the input as alternating data units... but really that hash would probably be the best AEAD-problem-equivalent of "I don't look like this hierarchy, so I'll use similar names but not share a base type"

It doesn't provide the ability to persist / clone the internal state of a hash so that you can hash multiple megabytes of common data, then tweak just the last few inputs into the system.

Well, I left that in with GetCurrentHash. We didn't add Clone on IncrementalHash, so we only support checkpointing, not variance.

You can hash strings.

The good part of the inheritance model is we can solve that for the accumulators.

public void Append(string text, Encoding encoding = null)

I think we're better off with less-is-more here, though, and leave it off until someone cries for it. This also solves String/Utf8String parity questions.

You can hash files.

Yeah, probably do want to do something for streams.

namespace TBD
{
    partial class NonCryptographicHashAlgorithm
    {
        public void Append(Stream stream) { }
        public Task AppendAsync(Stream stream, CancellationToken cancellationToken = default) { }
    }
}

and if we also want to pattern-support them on the statics

namespace TBD
{
    partial class Crc64
    {
        public static byte[] Hash(Stream stream) => throw null;
        public static Task<byte[]> HashAsync(Stream stream, CancellationToken cancellationToken = default) => throw null;
        public static int Hash(Stream stream, Span<byte> destination) => throw null;
        public static Task<int> HashAsync(Stream stream, Memory<byte> destination, CancellationToken cancellationToken = default) => throw null;

        // Do, or do not.  There is no Async Try.
    }
}

@morganbr
Copy link
Contributor

morganbr commented Mar 6, 2021

It's awesome to see this making progress. I generally think a base class or interface would be useful for modularity/optimization. For example, some algorithms are optimized for x64 and would run slowly in x86, so it would be nice to be able to abstract the choice in a program that isn't too picky. A few other random thoughts:

  • It might be possible to align the Length property to other types -- HashAlgorithm has a HashSizeValue field (representing bits) and IncrementalHash has a HashLengthInBytes property (representing bytes).
  • GetHashCode has to do, uh, something. On System.HashCode, it's marked obsolete and throws NotSupportedException. I'm not sure if that's appropriate here since it prevents adding them to a HashSet, but otoh, it's an attractive nuisance for users.
  • Both System.Buffers and System.IO are a bit of a mismatch --- this does math with your data, regardless of how you got it (which is why I originally suggested System.Numerics.Hashing), but of the two, System.IO contains System.IO.Compression which is vaguely similar, so maybe System.IO.Hashing?
  • What's the purpose of GetCurrentHash when it also contains GetHashAndReset and there's no cloning? In other words, is there a case to get the hash of the first half of something and the hash of the whole thing? If we don't know of a concrete reason, skipping it could avoid confusion on which instance method to call and likely simplify implementation since many hashes have a finalization step that modifies internal state.
  • This should also include a proposal for the set of algorithms to start with. xxhash 32/64 and crc 32/64 are probably a good start since xxhash is already implemented and crc is really common, but it might also be good to choose ~128 and ~256 bit algorithms that can be used as a collision resistant hash (for trusted data). I vaguely recall that the last time I looked at this, spooky looked promising and there's apparently an xxhash 128 that I don't remember looking at. I'd recommend against FNV -- it's the problematic algorithm that got scattered all over the code base that motivated System.HashCode.

@blowdart
Copy link
Contributor

blowdart commented Mar 6, 2021

some algorithms are optimized for x64 and would run slowly in x86

And some even give different results, which isn't a great situation, for example Murmur in its 128bit form.

Optimization is not great idea in that sort of case, I'd rather people ended up being quite specific in their choices.

@bartonjs
Copy link
Member

bartonjs commented Mar 8, 2021

What's the purpose of GetCurrentHash when it also contains GetHashAndReset and there's no cloning?

GetCurrentHash is a partial clone. We added it on IncrementalHash to support file transfer-type protocols where they'll do things like transmit a hash every 100MB or so, which is the cumulative hash up to that point. While I'd just design the protocol to be the hash of every different chunk, they exist, and people asked for them. I figure if we design it in from the beginning here it's easier than trying to add it later.

Arguably a full Clone in the abstraction is more flexible, but we haven't seen requests for "these two files have the same 200MB, then differ at the end, can I seed it with their commonality and then just give the two variants"?

It might be possible to align the Length property to other types

I tried ignoring the crypto types that fed into this when naming this, and looked at it from the array and span perspective: it will produce an amount of data, that data is measured by {arr}.Length or {span}.Length. So instead of HashLengthInBytes, or whatever, how about "Length"?

But I expect this'll get a nontrivial amount of discussion in review.

GetHashCode has to do, uh, something.

That's a good point. Browsable-never, throw, obsolete, and sealed sounds right to me.

@ssg
Copy link

ssg commented Mar 8, 2021

So instead of HashLengthInBytes, or whatever, how about "Length"?

I like that. Could it be confused with “input buffer length”?

@bartonjs bartonjs removed this from the Future milestone Mar 16, 2021
@stephentoub
Copy link
Member

stephentoub commented Mar 27, 2021

It'd be better for such a type to derive from the abstract base, not Crc32, no? They produce fundamentally different results.

@mburbea
Copy link

mburbea commented Mar 27, 2021

It depends on your approach as you can use the same algorithm, with a different constant, so inheritance could be a good solution. The logic is identical otherwise. (here is a library that does so.

Obviously, the most optimal version of crc32c would use the sse4 CRC32 intrinsic.

@stephentoub
Copy link
Member

stephentoub commented Mar 27, 2021

so inheritance could be a good solution

From an implementation point of view, but not for someone consuming an API named one thing that then does something different. Also, the API as defined doesn't support such a change: it would require additional surface area, at which point we could choose to unseal if actually designing for such scenarios. And as you point out, that's not the ideal implementation, anyway, so....

@bartonjs
Copy link
Member

We talked about CRC32c in the meeting. It would be a different class if we added it.

We can certainly seal the types to avoid such confusion.

@GrabYourPitchforks
Copy link
Member

Let's seal the derived types unless there's a really good reason not to.

This specific issue aside, I think this should be our default stance for all new types going forward unless we have explicitly brought to API review a proposal for an extensibility model.

@stephentoub
Copy link
Member

I think this should be our default stance for all new types going forward unless we have explicitly brought to API review a proposal for an extensibility model.

I 100% agree. I think this should be the default guidance for all of the .NET ecosystem, but I lost that battle ;-)

@bartonjs bartonjs added the help wanted [up-for-grabs] Good issue for external contributors label Apr 8, 2021
@jeffhandley jeffhandley added the Priority:1 Work that is critical for the release, but we could probably ship without label Apr 15, 2021
@jeffhandley jeffhandley self-assigned this Apr 15, 2021
@bartonjs bartonjs removed the help wanted [up-for-grabs] Good issue for external contributors label May 19, 2021
@bartonjs
Copy link
Member

bartonjs commented Jun 1, 2021

API Amendment:

Because there was both the property HashLengthInBytes and the template method pattern had GetCurrentHashCore/GetHashAndResetCore returning the number of bytes written, I felt compelled to verify that the Core methods returned the right answer. So I propose keeping the template pattern, but making the Core version be void-returning.

When implementing XxHash32 and XxHash64 I was reminded that they support 32- and 64-bit (respectively) seeds. Rather than limiting the System.IO.Hashing version to the zero seed, I propose overloading the ctor and adding optional parameters for the seed on the static versions.

namespace System.IO.Hashing
{
    public abstract class NonCryptographicHashAlgorithm
    {
-       protected abstract int GetCurrentHashCore(Span<byte> destination);
+       protected abstract void GetCurrentHashCore(Span<byte> destination);
-       protected virtual int GetHashAndResetCore(Span<byte> destination);
+       protected virtual void GetHashAndResetCore(Span<byte> destination);
    }
    public sealed class XxHash32 : NonCryptographicHashAlgorithm
    {
        public XxHash32();
+       public XxHash32(int seed);
-       public static byte[] Hash(byte[] source);
+       public static byte[] Hash(byte[] source, int seed=0);
-       public static byte[] Hash(ReadOnlySpan<byte> source);
+       public static byte[] Hash(ReadOnlySpan<byte> source, int seed=0);
-       public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
+       public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten, int seed=0);
-       public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+       public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination, int seed=0);
    }
    public sealed class XxHash64 : NonCryptographicHashAlgorithm
    {
        public XxHash64();
+       public XxHash64(long seed);
-       public static byte[] Hash(byte[] source);
+       public static byte[] Hash(byte[] source, long seed = 0);
-       public static byte[] Hash(ReadOnlySpan<byte> source);
+       public static byte[] Hash(ReadOnlySpan<byte> source, long seed = 0);
-       public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
+       public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten, long seed = 0);
-       public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
+       public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination, long seed = 0);
    }
}

These changes would result in this as the full, updated proposal:

namespace System.IO.Hashing
{
    public abstract class NonCryptographicHashAlgorithm
    {
        public int HashLengthInBytes { get; }

        protected NonCryptographicHashAlgorithm(int hashLengthInBytes);

        public abstract void Append(ReadOnlySpan<byte> source);
        public abstract void Reset();
        protected abstract void GetCurrentHashCore(Span<byte> destination);

        public void Append(byte[] source);
        public void Append(Stream stream);
        public Task AppendAsync(Stream stream, CancellationToken cancellationToken = default);
        public byte[] GetCurrentHash();
        public bool TryGetCurrentHash(Span<byte> destination, out int bytesWritten);
        public int GetCurrentHash(Span<byte> destination);
        public byte[] GetHashAndReset();
        public bool TryGetHashAndReset(Span<byte> destination, out int bytesWritten);
        public int GetHashAndReset(Span<byte> destination);

        protected virtual void GetHashAndResetCore(Span<byte> destination);

        [EditorBrowsable(EditorBrowsableState.Never)]
        [Obsolete("Use GetCurrentHash() to retrieve the computed hash code.", true)]
        public int GetHashCode();
    }
    public sealed class XxHash32 : NonCryptographicHashAlgorithm
    {
        public XxHash32();
        public XxHash32(int seed);
        public static byte[] Hash(byte[] source, int seed = 0);
        public static byte[] Hash(ReadOnlySpan<byte> source, int seed = 0);
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten, int seed = 0);
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination, int seed = 0);
    }
    public sealed class XxHash64 : NonCryptographicHashAlgorithm
    {
        public XxHash64();
        public XxHash64(long seed);
        public static byte[] Hash(byte[] source, long seed = 0);
        public static byte[] Hash(ReadOnlySpan<byte> source, long seed = 0);
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten, long seed = 0);
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination, long seed = 0);
    }
    public sealed class Crc32 : NonCryptographicHashAlgorithm
    {
        public Crc32();
        public static byte[] Hash(byte[] source);
        public static byte[] Hash(ReadOnlySpan<byte> source);
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
    }
    public sealed class Crc64 : NonCryptographicHashAlgorithm
    {
        public Crc64();
        public static byte[] Hash(byte[] source);
        public static byte[] Hash(ReadOnlySpan<byte> source);
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
    }
}

@bartonjs bartonjs added api-ready-for-review API is ready for review, it is NOT ready for implementation blocking Marks issues that we want to fast track in order to unblock other important work and removed api-approved API was approved in API review, it can be implemented labels Jun 1, 2021
@bartonjs
Copy link
Member

bartonjs commented Jun 1, 2021

Video

  • Since the ctors for XxHash* are being overloaded for usability, also overload the byte[] => byte[] statics for usability. The other versions are complicated enough that the default is OK.
namespace System.IO.Hashing
{
    public abstract class NonCryptographicHashAlgorithm
    {
        public int HashLengthInBytes { get; }

        protected NonCryptographicHashAlgorithm(int hashLengthInBytes);

        public abstract void Append(ReadOnlySpan<byte> source);
        public abstract void Reset();
        protected abstract void GetCurrentHashCore(Span<byte> destination);

        public void Append(byte[] source);
        public void Append(Stream stream);
        public Task AppendAsync(Stream stream, CancellationToken cancellationToken = default);
        public byte[] GetCurrentHash();
        public bool TryGetCurrentHash(Span<byte> destination, out int bytesWritten);
        public int GetCurrentHash(Span<byte> destination);
        public byte[] GetHashAndReset();
        public bool TryGetHashAndReset(Span<byte> destination, out int bytesWritten);
        public int GetHashAndReset(Span<byte> destination);

        protected virtual void GetHashAndResetCore(Span<byte> destination);

        [EditorBrowsable(EditorBrowsableState.Never)]
        [Obsolete("Use GetCurrentHash() to retrieve the computed hash code.", true)]
        public int GetHashCode();
    }
    public sealed class XxHash32 : NonCryptographicHashAlgorithm
    {
        public XxHash32();
        public XxHash32(int seed);
        public static byte[] Hash(byte[] source);
        public static byte[] Hash(byte[] source, int seed = 0);
        public static byte[] Hash(ReadOnlySpan<byte> source, int seed = 0);
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten, int seed = 0);
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination, int seed = 0);
    }
    public sealed class XxHash64 : NonCryptographicHashAlgorithm
    {
        public XxHash64();
        public XxHash64(long seed);
        public static byte[] Hash(byte[] source);
        public static byte[] Hash(byte[] source, long seed);
        public static byte[] Hash(ReadOnlySpan<byte> source, long seed = 0);
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten, long seed = 0);
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination, long seed = 0);
    }
    public sealed class Crc32 : NonCryptographicHashAlgorithm
    {
        public Crc32();
        public static byte[] Hash(byte[] source);
        public static byte[] Hash(ReadOnlySpan<byte> source);
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
    }
    public sealed class Crc64 : NonCryptographicHashAlgorithm
    {
        public Crc64();
        public static byte[] Hash(byte[] source);
        public static byte[] Hash(ReadOnlySpan<byte> source);
        public static bool TryHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
        public static int Hash(ReadOnlySpan<byte> source, Span<byte> destination);
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation blocking Marks issues that we want to fast track in order to unblock other important work labels Jun 1, 2021
@danmoseley
Copy link
Member

As discussed above, there are several possible choices of CRC polynomial, and the choice matters eg., if you are adhering to a specification. Presumably we will choose one for our Crc32/Crc64 classes here (based on what's fastest and/or most useful) but how will we offer other polynomials now or in the future? Also would it be helpful for the API names to indicate which polynomial it uses (this may relate to the other question)

@bartonjs
Copy link
Member

bartonjs commented Jun 1, 2021

"CRC-32", as far as I can tell, is the version used by Ethernet. (width=32 poly=0x04c11db7 init=0xffffffff refin=false refout=false xorout=0xffffffff check=0xfc891918). For "CRC-64", the consensus I found is the one also known as CRC-64-ECMA (width=64 poly=0x42f0e1eba9ea3693 init=0x0000000000000000 refin=false refout=false xorout=0x0000000000000000 check=0x6c40df5f0b497347). Those are what I've implemented.

    /// <summary>
    ///   Provides an implementation of the CRC-32 algorithm, as used in
    ///   ITU-T V.42 and IEEE 802.3.
    /// </summary>
    /// <remarks>
    ///   <para>
    ///     This implementation emits the answer in the Little Endian byte order so that
    ///     the CRC residue relationship (CRC(message concat CRC(message))) is a fixed value) holds.
    ///     For CRC-32 this stable output is the byte sequence <c>{ 0x1C, 0xDF, 0x44, 0x21 }</c>,
    ///     the Little Endian representation of <c>0x2144DF1C</c>.
    ///   </para>
    ///   <para>
    ///     There are multiple, incompatible, definitions of a 32-bit cyclic redundancy
    ///     check (CRC) algorithm. When interoperating with another system, ensure that you
    ///     are using the same definition. The definition used by this implementation is not
    ///     compatible with the cyclic redundancy check described in ITU-T I.363.5.
    ///   </para>
    /// </remarks>
    public sealed partial class Crc32 : NonCryptographicHashAlgorithm { ... }

    /// <summary>
    ///   Provides an implementation of the CRC-64 algorithm as described in ECMA-182, Annex B.
    /// </summary>
    /// <remarks>
    ///   <para>
    ///     This implementation emits the answer in the Big Endian byte order so that
    ///     the CRC residue relationship (CRC(message concat CRC(message))) is a fixed value) holds.
    ///     For CRC-64 this stable output is the byte sequence
    ///     <c>{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }</c>.
    ///   </para>
    ///   <para>
    ///     There are multiple, incompatible, definitions of a 64-bit cyclic redundancy
    ///     check (CRC) algorithm. When interoperating with another system, ensure that you
    ///     are using the same definition. The definition used by this implementation is not
    ///     compatible with the cyclic redundancy check described in ISO 3309.
    ///   </para>
    /// </remarks>
    public sealed partial class Crc64 : NonCryptographicHashAlgorithm { ... }

@danmoseley
Copy link
Member

Ah, you're right, I had overlooked that the other polynomials have different suffixes.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 2, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 9, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jul 9, 2021
@bartonjs bartonjs removed their assignment Jul 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Numerics Priority:1 Work that is critical for the release, but we could probably ship without
Projects
None yet
Development

Successfully merging a pull request may close this issue.