Skip to content

Conversation

@gchatelet
Copy link
Contributor

A proposal of addition of murmurhash3 to phobos.
Maybe it should go in experimental first. Please let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace Block with uint. Block is private.

@gchatelet
Copy link
Contributor Author

Thx for the comments @9il.
I still need to spend some time on the documentation.
I'll add some tests to check if there's an alignment issue when pushing consecutive blocks in the Piecewise struct.

@andralex
Copy link
Member

LGTM from 30K feet. Any experts on board to vet this?

@9il
Copy link
Member

9il commented Jan 12, 2016

LGTM from 30K feet. Any experts on board to vet this?

MurmurHash implementation looks good. SMurmurHash3_XXX_XXx would be very useful for fast user defined hash functions (especially for containers or ndslice, where strides between elements can be greater than 1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably all private functions should be marked with pragma(inline, true):

@gchatelet
Copy link
Contributor Author

@9il I added the test about alignement you were worried about.
I still need to add some documentation.

@jpf91
Copy link
Contributor

jpf91 commented Jan 13, 2016

I added the test about alignement you were worried about.

Your test/code works on X86 but it won't work on architectures which do not properly support unaligned loads. This includes older ARM devices so I'd say this is a blocker. Instead of this code:

hasher.putBlocks(cast(const(Block)[]) data[0 .. consecutiveBlocks]);

you'll have to copy n bytes into a block/ubyte union:

union
{
    Block block;
    ubyte[Block.sizeof] rawData;
} buf;

foreach(...)
{
    buf.rawData[] = data[n..n+Block.sizeof];
    hasher.putBlock(...);
}

If benchmarking shows that the copying makes the algorithm much slower you could do a runtime check whether data.ptr is properly aligned and then skip the copying.

@gchatelet
Copy link
Contributor Author

Thx @jpf91 this is very helpful. I'll test different versions but I expect the performance to drop dramatically. According to the original C++ implementation: that's about 1.68 bytes per cycle, or about 9.5 cycles per 16-byte chunk. this allows to achieve 5GB/sec for MurmurHash3_x64_128.
The current implementation is not as fast but I achieved 4.3GB/sec (different computer so not necessarily comparable). I'll report here my findings but copying is not free so performance will suffer.

On an architecture note, the implementation is currently suboptimal for big-endian.
See comment on endianness on the wikipedia page. I need to fix this also.

@9il
Copy link
Member

9il commented Jan 14, 2016

I'll test different versions but I expect the performance to drop dramatically.

You can use version(X86) and others to determine if algorithm can use unaligned loads.

@jpf91
Copy link
Contributor

jpf91 commented Jan 14, 2016

You can use version(X86) and others to determine if algorithm can use unaligned loads.

That certainly should be done. It would still be nice to have a fast-path for aligned data on architectures without unaligned loads as well. Here are some possible solutions:

  1. Your put method isn't marked as safe or trusted. So you could probably argue that your function isn't memory safe and document that the programmer is responsible for passing only aligned data.
  2. You can check data.ptr and select the codepath for aligned or unaligned data at runtime. This will probably still have some performance impact (depending on the user code / data size passed to put). But you could combine this approach with version statements.
  3. You could also add an put(const(uint)[]) overload. All data passed to this overload would be guaranteed to be aligned. But:
    • I'm not sure how well this actually works with range composition.
    • It doesn't work that well for the object-oriented interface. You could add it as an additional method but then it still couldn't be used with generic code using the Digest interface. Adding a new method to the interface could break user code. As it's safe to cast uint[] to ubyte[] the overload for aligned data could always pass the data to the overload for unaligned data. So we'd have to make Digest an abstract class instead of an interface and provide a default implementation.

Pinging some people who have worked on std.digest for more opinions: @schuetzm @CyberShadow @klickverbot

@schuetzm
Copy link
Contributor

  1. Your put method isn't marked as safe or trusted. So you could probably argue that your function isn't memory safe and document that the programmer is responsible for passing only aligned data.

I don't think @safe makes any alignment guarantees. But an in contract should be used here, if this solution is chosen.

  1. You can check data.ptr and select the codepath for aligned or unaligned data at runtime. This will probably still have some performance impact (depending on the user code / data size passed to put). But you could combine this approach with version statements.

AFAIK even on x86, unaligned loads come with a performance penalty, although it's likely to be smaller for consecutive unaligned loads because of caching. Maybe provide a default implementation that can handle unaligned data, and an optimized one that requires aligned input?

@jpf91
Copy link
Contributor

jpf91 commented Jan 14, 2016

I don't think @safe makes any alignment guarantees. But an in contract should be used here, if this solution is chosen.

Sorry, I should have added a more detailed explanation: Safe doesn't make any guarantees about alignment, but it guarantees memory safety. Unaligned loads can cause data and memory corruption on some architectures, therefore a function which might produce unaligned loads depending on input parameters could never be safe (or trusted). As this put function is not marked safe, it could be argued the API user is responsible for proper alignment (similar to the user being responsible for passing only zero terminated strings to C str functions). I don't think it's a good solution, but I think it's something that could be argued for. But you're right, an in contract is really necessary when using this approach.

AFAIK even on x86, unaligned loads come with a performance penalty, although it's likely to be smaller for consecutive unaligned loads because of caching.

That's correct, but articles on that always compare unaligned loads vs. aligned loads (of pre-aligned data). But in this case we can't require pre-aligned data and the more interesting question is unaligned loads vs. first copying data to an aligned buffer + aligned loads. I haven't found any info on the performance of that. (In fact, a clever compiler might see that your just copying bytes to an aligned buffer and performing a load. So if the compiler concludes an unaligned load is faster it might actually rewrite the code.)

And as X86 uses the same instructions for unaligned vs. aligned loads IIRC there's shouldn't be any performance loss on X86 for aligned data when using the same codepath as for unaligned data. But this is just guessing. It's probably something that really needs benchmarks ;-)

Maybe provide a default implementation that can handle unaligned data, and an optimized one that requires aligned input?

That's basically my solution 3. I think it's the best solution but I don't know how well it works with range composition (e.g. is file.byChunk(1024) an aligned buffer? It probably needs to go through a map(a => cast(uint[])a stage, etc)).

@gchatelet
Copy link
Contributor Author

Some numbers FYI based on code from this repo.

It's comparing the original C++ implementation with the optimized D version and finally the digest version with the copy for alignment. I tested with the 3 compilers (help me tweak compilation flags if necessary).

Note: The benchmark uses a 256KiB buffer so it fits in the cache and we don't end up measuring the memory bus.

Looking at the fastest implementation MurmurHash3_x64_128:

  • For the optimized version: LDC runs at 93% of C++, GDC at 65%, DMD at 30%.
  • edit For the aligned version: LDC runs at 13% of C++, GDC at 13%, DMD at 8%.

  • DMD DMD64 D Compiler v2.069.2
 % make clean && make
rm -Rf build *.o
mkdir -p build
g++ -I/smasher/src -O3 -Wall -Werror -c smasher/src/MurmurHash3.cpp -o build/CMurmurHash3.o
dmd -O -inline -release benchmark.d murmurhash3.d build/CMurmurHash3.o -ofbuild/benchmark
./build/benchmark
Please wait while benchmarking MurmurHash3, running 4096*hash(256KiB) = 1GiB
C++ MurmurHash3_x64_128        - 19598 GiB/s
D SMurmurHash3_x64_128         - 5491 GiB/s
D digest MurmurHash3_x64_128   - 1561 GiB/s

C++ MurmurHash3_x86_128        - 12190 GiB/s
D SMurmurHash3_x86_128         - 3703 GiB/s
D digest MurmurHash3_x86_128   - 1352 GiB/s

C++ MurmurHash3_x86_32         - 9246 GiB/s
D SMurmurHash3_x86_32          - 3385 GiB/s
D digest MurmurHash3_x86_32    - 575 GiB/s

  • GDC gcc version 5.2.0 (crosstool-NG crosstool-ng-1.20.0-232-gc746732 - 20150830-2.066.1-dadb5a3784)
 % make clean && make
rm -Rf build *.o
mkdir -p build
g++ -I/smasher/src -O3 -Wall -Werror -c smasher/src/MurmurHash3.cpp -o build/CMurmurHash3.o
~/Downloads/gdc_x86_64-pc-linux-gnu/bin/gdc -O3 -frelease -Wall -Werror benchmark.d murmurhash3.d build/CMurmurHash3.o -obuild/benchmark
./build/benchmark
Please wait while benchmarking MurmurHash3, running 4096*hash(256KiB) = 1GiB
C++ MurmurHash3_x64_128        - 19787 GiB/s
D SMurmurHash3_x64_128         - 12840 GiB/s
D digest MurmurHash3_x64_128   - 2586 GiB/s

C++ MurmurHash3_x86_128        - 12047 GiB/s
D SMurmurHash3_x86_128         - 8000 GiB/s
D digest MurmurHash3_x86_128   - 2279 GiB/s

C++ MurmurHash3_x86_32         - 9288 GiB/s
D SMurmurHash3_x86_32          - 7394 GiB/s
D digest MurmurHash3_x86_32    - 705 GiB/s

  • LDC LDC - the LLVM D compiler (0.16.1): based on DMD v2.067.1 and LLVM 3.7.0
 % make clean && make
rm -Rf build *.o
mkdir -p build
g++ -I/smasher/src -O3 -Wall -Werror -c smasher/src/MurmurHash3.cpp -o build/CMurmurHash3.o
~/Downloads/ldc2-0.16.1-linux-x86_64/bin/ldc2 -O -release -inline benchmark.d murmurhash3.d build/CMurmurHash3.o -ofbuild/benchmark
./build/benchmark
Please wait while benchmarking MurmurHash3, running 4096*hash(256KiB) = 1GiB
C++ MurmurHash3_x64_128        - 18963 GiB/s
D SMurmurHash3_x64_128         - 17732 GiB/s
D digest MurmurHash3_x64_128   - 2609 GiB/s

C++ MurmurHash3_x86_128        - 12190 GiB/s
D SMurmurHash3_x86_128         - 10114 GiB/s
D digest MurmurHash3_x86_128   - 2346 GiB/s

C++ MurmurHash3_x86_32         - 9288 GiB/s
D SMurmurHash3_x86_32          - 8847 GiB/s
D digest MurmurHash3_x86_32    - 751 GiB/s

@dnadlinger
Copy link
Contributor

For the aligned version: LDC runs at 8% of C++, GDC at 13%, DMD at 13%.

Based on your numbers, this should read:

  • For the aligned version: LDC runs at 14% of C++, GDC at 13%, DMD at 8%.

@gchatelet
Copy link
Contributor Author

From your numbers, this should be 14%, right?

Good catch @klickverbot , I mismatch between LDC and DMD : )
For the aligned version: LDC runs at 13% of C++, GDC at 13%, DMD at 8%.

I'll fix the original post.

@9il
Copy link
Member

9il commented Jan 14, 2016

@gchatelet Can manual inlining of update and shufle improve performance? There is no reason why GDC is slower than G++, but inlining may be bad.

@jpf91
Copy link
Contributor

jpf91 commented Jan 14, 2016

Use -fno-bounds-check and gdc is infinitely fast ;-) It seems without bounds checks all side effects in the benchmark gone and gdc removes the hasher.putBlocks and similar calls (That's what you get for marking everything pure ;-) Actually a quite nice result). So add something like this:

//dummy.d
extern(C) void foo(ubyte[])
{
}
//benchmark.d
extern(C) void foo(ubyte[]);

auto useHasher(H)() {
    H hasher;
    hasher.putBlocks(cast(const(H.Block)[])buffer);
    hasher.finalize();
    foo(hasher.getBytes()[]); //<=========================
    return hasher.getBytes();
}
$(BUILD_DIR)/benchmark: benchmark.d murmurhash3.d $(BUILD_DIR)/CMurmurHash3.o
    gdc dummy.d -c -o dummy.o
    gdc -O3 -frelease -Wall -Werror -fno-bounds-check dummy.o $^ -o$@

On my system results seem to be quite random for some reason, but -fno-bounds-check seems to improve performance for SMurmurHash3_*. So you might want to get rid of these boundchecks.

I'm disappointed the compiler can't optimize this though (search for _d_arraybounds):

// gdc murmurhash3.d -fdump-tree-original -c
// cat murmurhash3.d.003t.original
;; Function putBlocks (_D11murmurhash320SMurmurHash3_x64_1289putBlocksMFNaNbNiNfMAxG2mXv)
;; enabled by -tree-original

{
  if (this != 0)
    {
      <<< Unknown tree: void_cst >>>
    }
  else
    {
      _d_assert_msg ({.length=9, .ptr="null this"}, {.length=13, .ptr="murmurhash3.d"}, 59);
    }
  {
    ulong block[2];
    ulong __key1526;
    struct const(ulong[2])[] __aggr1525;

    __aggr1525 = {.length=blocks.length, .ptr=blocks.ptr};
    __key1526 = 0;
    while (1)
      {
        if (!(__key1526 < __aggr1525.length)) break;
        block = *(__aggr1525.ptr + (__key1526 < __aggr1525.length ? __key1526 * 16 : _d_arraybounds ({.length=13, .ptr="murmurhash3.d"}, 61)));
        update ((ulong &) &this->h1, *(const ulong *) &block, this->h2, 9782798678568883157, 5545529020109919103, 31, 27, 1390208809);
        update ((ulong &) &this->h2, *((const ulong *) &block + 8), this->h1, 5545529020109919103, 9782798678568883157, 33, 31, 944331445);
        __key1526 = __key1526 + 1;
      }
  }
  this->size = this->size + blocks.length * 16;
}

@jpf91
Copy link
Contributor

jpf91 commented Jan 14, 2016

For the digest version it's definitely Piecewise.put which needs to be optimized. Changing put to only do hasher.putBlocks(cast(const(Block)[])data); for testing:

C++ MurmurHash3_x64_128        - 20898 GiB/s
D SMurmurHash3_x64_128         - 21904 GiB/s
D digest MurmurHash3_x64_128   - 21223 GiB/s

C++ MurmurHash3_x86_128        - 14949 GiB/s
D SMurmurHash3_x86_128         - 13128 GiB/s
D digest MurmurHash3_x86_128   - 15004 GiB/s

C++ MurmurHash3_x86_32         - 9225 GiB/s
D SMurmurHash3_x86_32          - 9288 GiB/s
D digest MurmurHash3_x86_32    - 9330 GiB/s

(As already said test results on my machine seem to be kinda weird...)

@schuetzm
Copy link
Contributor

Unaligned loads can cause data and memory corruption on some architectures

I wasn't aware of that, I thought it would always trap.

Maybe provide a default implementation that can handle unaligned data, and an optimized one that requires aligned input?

That's basically my solution 3. I think it's the best solution but I don't know how well it works with range composition (e.g. is file.byChunk(1024) an aligned buffer? It probably needs to go through a map(a => cast(uint[])a stage, etc)).

I was thinking of MurmurHash3 vs MurmurHash3Aligned, both with the usual void put(scope const(ubyte)[]) methods.

@gchatelet
Copy link
Contributor Author

Thx @jpf91 , I updated the benchmark.

GDC even beats the C++ implementation (more inlining possibilities I guess)

rm -Rf build *.o
mkdir -p build
g++ -I/smasher/src -O3 -Wall -Werror -c smasher/src/MurmurHash3.cpp -o build/CMurmurHash3.o
~/Downloads/gdc_x86_64-pc-linux-gnu/bin/gdc -O3 -frelease -fno-bounds-check -Wall -Werror benchmark.d murmurhash3.d build/CMurmurHash3.o -obuild/gdc_benchmark
~/Downloads/ldc2-0.16.1-linux-x86_64/bin/ldc2 -O5 -release -inline benchmark.d murmurhash3.d build/CMurmurHash3.o -ofbuild/ldc_benchmark
dmd -O -inline -release benchmark.d murmurhash3.d build/CMurmurHash3.o -ofbuild/dmd_benchmark
build/gdc_benchmark
Please wait while benchmarking MurmurHash3, running 4096*hash(256KiB) = 1GiB
x64_128 C++                    -  97% - 19321 GiB/s
x64_128 D                      - 100% - 19787 GiB/s
x64_128 D digest               -  99% - 19598 GiB/s
x86_128 C++                    -  61% - 12154 GiB/s
x86_128 D                      -  61% - 12154 GiB/s
x86_128 D digest               -  61% - 12083 GiB/s
x86_32  C++                    -  46% - 9288 GiB/s
x86_32  D                      -  45% - 9062 GiB/s
x86_32  D digest               -  45% - 9082 GiB/s
build/ldc_benchmark
Please wait while benchmarking MurmurHash3, running 4096*hash(256KiB) = 1GiB
x64_128 C++                    - 100% - 19321 GiB/s
x64_128 D                      -  91% - 17655 GiB/s
x64_128 D digest               -  90% - 17579 GiB/s
x86_128 C++                    -  63% - 12190 GiB/s
x86_128 D                      -  52% - 10089 GiB/s
x86_128 D digest               -  51% - 9990 GiB/s
x86_32  C++                    -  48% - 9288 GiB/s
x86_32  D                      -  45% - 8828 GiB/s
x86_32  D digest               -  45% - 8847 GiB/s
build/dmd_benchmark
Please wait while benchmarking MurmurHash3, running 4096*hash(256KiB) = 1GiB
x64_128 C++                    - 100% - 19412 GiB/s
x64_128 D                      -  28% - 5461 GiB/s
x64_128 D digest               -  28% - 5498 GiB/s
x86_128 C++                    -  61% - 11977 GiB/s
x86_128 D                      -  19% - 3693 GiB/s
x86_128 D digest               -  18% - 3667 GiB/s
x86_32  C++                    -  47% - 9163 GiB/s
x86_32  D                      -  16% - 3269 GiB/s
x86_32  D digest               -  17% - 3352 GiB/s

@jpf91
Copy link
Contributor

jpf91 commented Jan 15, 2016

@schuetzm

Unaligned loads can cause data and memory corruption on some architectures

I wasn't aware of that, I thought it would always trap.

I had to debug such a problem for GDC once ;-) See for example: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/Bcfihdhj.html

Address alignment for word transfers
In most circumstances, you must ensure that addresses for 32-bit transfers are 32-bit word-aligned.
[...]
If your system does not have a system coprocessor (cp15), or alignment checking is disabled:

  • For STR, the specified address is rounded down to a multiple of four. [effectively overwriting data at a lower address => data corruption]

The same behavior always happens on old ARMv5 chips. For loads it's more complicated. A load will not corrupt memory, but will load invalid data. If this data is then somehow used for memory address calculation you can still end up with corrupted memory.
If you want to read more about this, see http://www.heyrick.co.uk/armwiki/Unaligned_data_access and http://stackoverflow.com/a/16549366/471401 .

I was thinking of MurmurHash3 vs MurmurHash3Aligned, both with the usual void put(scope const(ubyte)[]) methods.

That might be a good idea. Some testing shows that on X86_64 alignment is less of a problem than data length though, and ubyte[] allows any data length. I think MurmurHash3Aligned with put(Block[]) is more useful. But then there's no big difference to SMurmurHash3_*.

@gchatelet

  • LDC and DMD would also probably perform better when using the noboundscheck option. However, for phobos the code will have to be compiled with boundschecks, so you'll have to change the code to workaround bounds checking code:

    void putBlocks(scope const(Block)[] blocks...) pure nothrow @nogc @trusted
    {
       const(Block)* start = blocks.ptr;
       const(Block)* end = blocks.ptr + blocks.length; 
    
       for(auto ptr = start; ptr < end; ptr++)
       {
           update(h1, (*ptr)[0], h2, c1, c2, 31, 27, 0x52dce729);
           update(h2, (*ptr)[1], h1, c2, c1, 33, 31, 0x38495ab5);
       }
       size += blocks.length * Block.sizeof;
    }
  • Aligned vs. unaligned data seems to perform more or less the same for me on x86_64 when using putBlocks

  • With GDC there's no slowdown on X86_64 when first reading into a buffer. This code is safe for architectures without unaligned loads. For DMD there's a 3% slowdown.

    struct Piecewise(Hasher)
    {
       union BufferUnion
       {
           Block block;
           ubyte[Block.sizeof] blockData;
       }
       BufferUnion buffer;
    
       void put(scope const(ubyte)[] data...) pure nothrow @trusted
       {
           // Buffer should never be full while entring this function.
           assert(bufferSize < Block.sizeof);
           const(ubyte)* start = data.ptr;
    
           size_t preprocessed = 0;
           /*
            * Do main work: process chunks of Block.sizeof bytes
            */
           size_t numBlocks = (data.length - preprocessed) / Block.sizeof;
           const(ubyte)* end = start + numBlocks * Block.sizeof;
    
           for(; start < end; start += Block.sizeof)
           {
               buffer.blockData = start[0 .. Block.sizeof];
               hasher.putBlock(buffer.block);
           }
           hasher.size += numBlocks * Block.sizeof;
       }
    }
    
    struct SMurmurHash3_x64_128
    {
       void putBlock(scope const Block block) pure nothrow @nogc
       {
           update(h1, block[0], h2, c1, c2, 31, 27, 0x52dce729);
           update(h2, block[1], h1, c2, c1, 33, 31, 0x38495ab5);
       }
    }
  • There is a 10-20% slowdown when supporting data buffers with lengths which are not multiples of the block size. Here's my code: Note: This code is untested and needs proof-reading as well as many tests, especially for corner cases. (data.length < Block.sizeof, data.length == Block.sizeof, ...). The if (data.length + bufferSize < Block.sizeof) statement seems to be the main slowdown source.

    struct Piecewise(Hasher)
    {
       union BufferUnion
       {
           Block block;
           ubyte[Block.sizeof] blockData;
       }
       BufferUnion buffer;
    
       void put(scope const(ubyte)[] data...) pure nothrow @trusted
       {
           // Buffer should never be full while entring this function.
           assert(bufferSize < Block.sizeof);
           const(ubyte)* start = data.ptr;
    
    
           /*
            * Check if we have some leftover data in the buffer.
            * Then fill the first block buffer.
            */
           if (data.length + bufferSize < Block.sizeof)
           {
               buffer.blockData[bufferSize .. $] = start[0 .. data.length];
               bufferSize += data.length;
               return;
           }
           auto preprocessed = Block.sizeof - bufferSize;
           buffer.blockData[bufferSize .. $] = start[0 .. preprocessed];
           hasher.putBlock(buffer.block);
           start += preprocessed;
    
           /*
            * Do main work: process chunks of Block.sizeof bytes
            */
           size_t numBlocks = (data.length - preprocessed) / Block.sizeof;
           const(ubyte)* end = start + numBlocks * Block.sizeof;
    
           for(; start < end; start += Block.sizeof)
           {
               buffer.blockData = start[0 .. Block.sizeof];
               hasher.putBlock(buffer.block);
           }
           // +1 for preprocessed Block
           hasher.size += (numBlocks + 1) * Block.sizeof;
    
           /**
            * Now add remaining data to buffer
            */
           bufferSize = data.length - preprocessed - numBlocks * Block.sizeof;
           buffer.blockData[0 .. bufferSize] = end[0 .. bufferSize];
       }
    }
    
    struct SMurmurHash3_x64_128
    {
       void putBlock(scope const Block block) pure nothrow @nogc
       {
           update(h1, block[0], h2, c1, c2, 31, 27, 0x52dce729);
           update(h2, block[1], h1, c2, c1, 33, 31, 0x38495ab5);
       }
    }
  • At least for X86_64 we can ignore the effects of unaligned data and the overhead caused by copying. So the copying version which will be safe for all architectures has no negative effects on X86_64. A much bigger problem is processing data buffers with lengths which are not multiples of the Blocksize. So maybe a MurmurHash3Aligned digest which only accepts put(ulong[Block.sizeof/8][]) is a good idea for those use cases where we have proper data sizes. (I'm not actually sure what type we should use for that. ubyte[n] would be nice, but ubyte[n].alignof == 1 so that's not what we really want)

  • Here are ARM results for ARMv5 (no unaligned load support) with linux kernel configured to fix up these accesses (very slow):

    C++ MurmurHash3_x64_128, aligned data -  97% - 529 GiB/s
    C++ MurmurHash3_x64_128, unaligned data -   6% - 33 GiB/s
    D SMurmurHash3_x64_128, aligned data - 100% - 544 GiB/s
    D SMurmurHash3_x64_128, unaligned data -   5% - 33 GiB/s
    D digest MurmurHash3_x64_128, aligned data -  57% - 311 GiB/s
    D digest MurmurHash3_x64_128, unaligned data -  53% - 293 GiB/s
    

    Please note that the C++ MurmurHash3_x64_128 and SMurmurHash3_x64_128 generate invalid results for unaligned data with the standard linux kernel configuration. The digest version works fine. So an overload for a fast version for aligned data is even more important for ARM. I couldn't find a specific reason for the 40% slowdown for the digest version. Even simply forwarding the data by calling putBlocks is 10% slower. GCC probably can't optimize ARM code as well as X86_64 code.

My conclusion would be to use the unaligned-data-safe but slightly slower approach I've posted above for the default MurmurHash3_x86_32 std.digest digests. Then either introduce MurmurHash3_x86_32Aligned with a put function which only takes a Block[] parameter. Or even better: maybe make the SMurmurHash3_* structs compatible with the digest API (alias put => putBlocks, add start and finish functions).

@gchatelet
Copy link
Contributor Author

Here are the numbers with the latest version. I used @jpf91 's suggestions and added more tests. dmd is still super slow but ldc and gdc are now as fast as the C++ version with bounds-checking. Also ldc is now benchmarked against C++ code compiled with clang++ to be fair. Last thing I fixed the benchmark's results to display correct throughput (should have been MiB instead of GiB).

I'm now working on improving the documentation, stay tuned !
Feedback is still welcome of course :)

 % make clean && make -j12
rm -Rf build *.o
mkdir -p build
clang++ -I/smasher/src -O3 -Wall -Werror -c smasher/src/MurmurHash3.cpp -o build/CMurmurHash3_clang.o
g++ -I/smasher/src -O3 -Wall -Werror -c smasher/src/MurmurHash3.cpp -o build/CMurmurHash3_g++.o
dmd -O -release -inline benchmark.d murmurhash3.d build/CMurmurHash3_g++.o -ofbuild/dmd_benchmark
~/Downloads/gdc_x86_64-pc-linux-gnu/bin/gdc -O3 -frelease -Wall -Werror -march=native benchmark.d murmurhash3.d build/CMurmurHash3_g++.o -obuild/gdc_benchmark
~/Downloads/ldc2-0.16.1-linux-x86_64/bin/ldc2 -O3 -release -inline benchmark.d murmurhash3.d build/CMurmurHash3_clang.o -ofbuild/ldc_benchmark
build/gdc_benchmark
Please wait while benchmarking MurmurHash3, running 4096*hash(256KiB) = 1GiB
x64_128 C++          - 100% - 19.51 GiB/s
x64_128 D            - 105% - 20.49 GiB/s
x64_128 D digest     -  89% - 17.53 GiB/s

x86_128 C++          - 100% - 12.02 GiB/s
x86_128 D            -  93% - 11.31 GiB/s
x86_128 D digest     -  89% - 10.75 GiB/s

x86_32 C++           - 100% - 9.26 GiB/s
x86_32 D             -  97% - 9.04 GiB/s
x86_32 D digest      - 102% - 9.46 GiB/s

build/ldc_benchmark
Please wait while benchmarking MurmurHash3, running 4096*hash(256KiB) = 1GiB
x64_128 C++          - 100% - 15.57 GiB/s
x64_128 D            - 113% - 17.64 GiB/s
x64_128 D digest     -  92% - 14.32 GiB/s

x86_128 C++          - 100% - 9.14 GiB/s
x86_128 D            - 110% - 10.11 GiB/s
x86_128 D digest     -  97% - 8.89 GiB/s

x86_32 C++           - 100% - 8.79 GiB/s
x86_32 D             -  99% - 8.74 GiB/s
x86_32 D digest      - 104% - 9.20 GiB/s

build/dmd_benchmark
Please wait while benchmarking MurmurHash3, running 4096*hash(256KiB) = 1GiB
x64_128 C++          - 100% - 19.74 GiB/s
x64_128 D            -  23% - 4.59 GiB/s
x64_128 D digest     -  23% - 4.66 GiB/s

x86_128 C++          - 100% - 12.15 GiB/s
x86_128 D            -  25% - 3.04 GiB/s
x86_128 D digest     -  25% - 3.06 GiB/s

x86_32 C++           - 100% - 9.28 GiB/s
x86_32 D             -  36% - 3.37 GiB/s
x86_32 D digest      -  36% - 3.36 GiB/s

@jpf91
Copy link
Contributor

jpf91 commented Jan 18, 2016

@gchatelet That looks great! Thanks for incorporating the suggestions and for contributing this module.

Some small comments:

  • Maybe add a comment like this to putBlocks and Piecewise.put:

    /*
    * Use pointer manipulation instead of foreach to avoid bounds checking (see @@@BUG@@@ 15581).
    * This function heavily affects the performance of [Piecewise/hash calculation] so make sure to
    * benchmark all changes!
    */
  • Maybe rename the module to murmurhash instead of murmurhash3. All other modules do not include the algorithm number. (If we add murmurhash2 at some point we could still put these into extra files and use package.d for std.digest.murmurhash).

  • You need to have another look at the windows Makefiles and add murmurhash.d to some more rules. Just search for crc.d and add the same rules.

I did a quick review and apart from these small issues and the missing documentation this module definitely gets a 👍 from me. The code looks quite nice. The checkResult unittest helper could be useful for other std.digest modules. Maybe also use the dmd -cov coverage check and remember to squash your commits when this pull request is ready for merging.

// Do main work: process chunks of Block.sizeof bytes.
const numBlocks = data.length / Block.sizeof;
const remainderStart = numBlocks * Block.sizeof;
foreach (const Block block; cast(const(Block[]))(data[0 .. remainderStart]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref?

@andralex
Copy link
Member

I added a few nits, and have a high-level question. So we have separate structures defined for "optimized for 32 bits systems" and "optimized for 64 bits systems". Why aren't those hidden and selected automatically depending on the platform? Would someone be interested in running the opt64 thing on a 32-bit system or vice versa?

uint h1;

public:
alias Block = uint; /// The element type for 32-bit implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name Block is really confusing. I've wondered what that is all through the code review. Would Unit fare better?

Copy link
Contributor Author

@gchatelet gchatelet May 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andralex I reused the terminology from the original implementation. I can go for Unit if you prefer.
How about Element ?

@andralex
Copy link
Member

Naming-wise, I think we should go with one only:

struct MurmurHash(uint size /* 32, 64, or 128 */, uint opt = size_t.sizeof == 8 ? 64 : 32);

@9il
Copy link
Member

9il commented May 24, 2016

Would someone be interested in running the opt64 thing on a 32-bit system or vice versa?

Yes, this is common for distributed databases.

@andralex
Copy link
Member

@9il thanks, cool. Then it seems the template with the defaulted second argument is appropriate.

@gchatelet
Copy link
Contributor Author

gchatelet commented May 27, 2016

I posted a new version based on @andralex suggestions. It's more compact and clearer.
One drawback is that we can't distinguish between MurmurHash2 32 bit (unimplemented) and MurmurHash3 32 bit.
Maybe we just don't care...
Let's just hope there won't be a MurmurHash4 : )

Destroy!
(Ah the documentation must be updated...I'll do it later)

@schuetzm
Copy link
Contributor

The name should still be MurmurHash3, because as you said MurmurHash2 is a thing, too. The module can still be called murmurhash without a version number.

@aappleby
Copy link

I have ideas for a MurmurHash4, but I don't know if I'll ever publish it. :)

On Fri, May 27, 2016 at 2:57 AM, Guillaume Chatelet <
notifications@github.com> wrote:

I posted a new version based on @andralex https://github.com/andralex
suggestions. It's more compact and clearer.
One drawback is that we can't distinguish between MurmurHash2 32 bit
(unimplemented) and MurmurHash3 32 bit.
Maybe we just don't care...
Let's just hope there won't be a MurmurHash4 : )

Destroy!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#3916 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAvgn0V3BGiI-t5WE78tvlhmBgzF0WpHks5qFsAQgaJpZM4HB6m-
.

@gchatelet
Copy link
Contributor Author

Changed MurmurHash to MurmurHash3, fixed the documentation rebased + squashed history.

@aappleby good to know :) I'm curious now!

import std.conv;

static assert(false,
"MurmurHash3(" ~ size.to!string ~ "," ~ opt.to!string ~ ") is not implemented");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not need to import std.conv: size.stringof and opt.stringof should work

@gchatelet gchatelet force-pushed the murmurhash3 branch 2 times, most recently from ad17a86 to e6dfddc Compare May 28, 2016 14:33

testUnalignedHash!(MurmurHash3!32)();
testUnalignedHash!(MurmurHash3!(128, 32))();
testUnalignedHash!(MurmurHash3!(128, 64))();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know about static foreach in D?

import std.meta: AliasSeq:
alias seq = AliasSeq;
foreach (H; seq!(32, seq!(128, 32), seq!(128, 64)))
{
 ....
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only 3 lines without static foreach

@wilzbach
Copy link
Contributor

wilzbach commented Jun 6, 2016

@gchatelet please ignore my nitpicks - if we want we can still improve docs afterwards.
So can we go with murmurhash or does it need to go through the std.experimental hop?

@gchatelet
Copy link
Contributor Author

@wilzbach about experimental: since the interface is already standardized I don't think it makes sense.
I'd just go for it and improve. This PR has been initiated 6 months ago, let's merge and iterate.
I don't see any show stopper any more.

@9il
Copy link
Member

9il commented Jun 6, 2016

Auto-merge toggled on

@9il 9il merged commit dc0f811 into dlang:master Jun 6, 2016
@9il
Copy link
Member

9il commented Jul 23, 2016

PR for Slice to use MurmurHash3 #4639

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.