-
-
Notifications
You must be signed in to change notification settings - Fork 749
Adding murmurhash3 digest to phobos #3916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
std/digest/murmurhash3.d
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you replace Block with uint. Block is private.
|
Thx for the comments @9il. |
|
LGTM from 30K feet. Any experts on board to vet this? |
MurmurHash implementation looks good. |
std/digest/murmurhash3.d
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably all private functions should be marked with pragma(inline, true):
|
@9il I added the test about alignement you were worried about. |
Your test/code works on X86 but it won't work on architectures which do not properly support unaligned loads. This includes older ARM devices so I'd say this is a blocker. Instead of this code: hasher.putBlocks(cast(const(Block)[]) data[0 .. consecutiveBlocks]);you'll have to copy n bytes into a block/ubyte union: union
{
Block block;
ubyte[Block.sizeof] rawData;
} buf;
foreach(...)
{
buf.rawData[] = data[n..n+Block.sizeof];
hasher.putBlock(...);
}If benchmarking shows that the copying makes the algorithm much slower you could do a runtime check whether data.ptr is properly aligned and then skip the copying. |
|
Thx @jpf91 this is very helpful. I'll test different versions but I expect the performance to drop dramatically. According to the original C++ implementation: that's about 1.68 bytes per cycle, or about 9.5 cycles per 16-byte chunk. this allows to achieve 5GB/sec for MurmurHash3_x64_128. On an architecture note, the implementation is currently suboptimal for big-endian. |
You can use |
That certainly should be done. It would still be nice to have a fast-path for aligned data on architectures without unaligned loads as well. Here are some possible solutions:
Pinging some people who have worked on |
I don't think
AFAIK even on x86, unaligned loads come with a performance penalty, although it's likely to be smaller for consecutive unaligned loads because of caching. Maybe provide a default implementation that can handle unaligned data, and an optimized one that requires aligned input? |
Sorry, I should have added a more detailed explanation: Safe doesn't make any guarantees about alignment, but it guarantees memory safety. Unaligned loads can cause data and memory corruption on some architectures, therefore a function which might produce unaligned loads depending on input parameters could never be safe (or trusted). As this
That's correct, but articles on that always compare And as X86 uses the same instructions for unaligned vs. aligned loads IIRC there's shouldn't be any performance loss on X86 for aligned data when using the same codepath as for unaligned data. But this is just guessing. It's probably something that really needs benchmarks ;-)
That's basically my solution 3. I think it's the best solution but I don't know how well it works with range composition (e.g. is |
|
Some numbers FYI based on code from this repo. It's comparing the original C++ implementation with the optimized D version and finally the digest version with the copy for alignment. I tested with the 3 compilers (help me tweak compilation flags if necessary). Note: The benchmark uses a 256KiB buffer so it fits in the cache and we don't end up measuring the memory bus. Looking at the fastest implementation MurmurHash3_x64_128:
|
Based on your numbers, this should read:
|
Good catch @klickverbot , I mismatch between LDC and DMD : ) I'll fix the original post. |
|
@gchatelet Can manual inlining of |
|
Use //dummy.d
extern(C) void foo(ubyte[])
{
}//benchmark.d
extern(C) void foo(ubyte[]);
auto useHasher(H)() {
H hasher;
hasher.putBlocks(cast(const(H.Block)[])buffer);
hasher.finalize();
foo(hasher.getBytes()[]); //<=========================
return hasher.getBytes();
}$(BUILD_DIR)/benchmark: benchmark.d murmurhash3.d $(BUILD_DIR)/CMurmurHash3.o
gdc dummy.d -c -o dummy.o
gdc -O3 -frelease -Wall -Werror -fno-bounds-check dummy.o $^ -o$@On my system results seem to be quite random for some reason, but I'm disappointed the compiler can't optimize this though (search for // gdc murmurhash3.d -fdump-tree-original -c
// cat murmurhash3.d.003t.original
;; Function putBlocks (_D11murmurhash320SMurmurHash3_x64_1289putBlocksMFNaNbNiNfMAxG2mXv)
;; enabled by -tree-original
{
if (this != 0)
{
<<< Unknown tree: void_cst >>>
}
else
{
_d_assert_msg ({.length=9, .ptr="null this"}, {.length=13, .ptr="murmurhash3.d"}, 59);
}
{
ulong block[2];
ulong __key1526;
struct const(ulong[2])[] __aggr1525;
__aggr1525 = {.length=blocks.length, .ptr=blocks.ptr};
__key1526 = 0;
while (1)
{
if (!(__key1526 < __aggr1525.length)) break;
block = *(__aggr1525.ptr + (__key1526 < __aggr1525.length ? __key1526 * 16 : _d_arraybounds ({.length=13, .ptr="murmurhash3.d"}, 61)));
update ((ulong &) &this->h1, *(const ulong *) &block, this->h2, 9782798678568883157, 5545529020109919103, 31, 27, 1390208809);
update ((ulong &) &this->h2, *((const ulong *) &block + 8), this->h1, 5545529020109919103, 9782798678568883157, 33, 31, 944331445);
__key1526 = __key1526 + 1;
}
}
this->size = this->size + blocks.length * 16;
} |
|
For the digest version it's definitely (As already said test results on my machine seem to be kinda weird...) |
I wasn't aware of that, I thought it would always trap.
I was thinking of |
|
Thx @jpf91 , I updated the benchmark. GDC even beats the C++ implementation (more inlining possibilities I guess) |
I had to debug such a problem for GDC once ;-) See for example: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/Bcfihdhj.html
The same behavior always happens on old ARMv5 chips. For loads it's more complicated. A load will not corrupt memory, but will load invalid data. If this data is then somehow used for memory address calculation you can still end up with corrupted memory.
That might be a good idea. Some testing shows that on
My conclusion would be to use the unaligned-data-safe but slightly slower approach I've posted above for the default |
|
Here are the numbers with the latest version. I used @jpf91 's suggestions and added more tests. I'm now working on improving the documentation, stay tuned ! |
|
@gchatelet That looks great! Thanks for incorporating the suggestions and for contributing this module. Some small comments:
I did a quick review and apart from these small issues and the missing documentation this module definitely gets a 👍 from me. The code looks quite nice. The |
a8be275 to
1d60afc
Compare
std/digest/murmurhash.d
Outdated
| // Do main work: process chunks of Block.sizeof bytes. | ||
| const numBlocks = data.length / Block.sizeof; | ||
| const remainderStart = numBlocks * Block.sizeof; | ||
| foreach (const Block block; cast(const(Block[]))(data[0 .. remainderStart])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref?
|
I added a few nits, and have a high-level question. So we have separate structures defined for "optimized for 32 bits systems" and "optimized for 64 bits systems". Why aren't those hidden and selected automatically depending on the platform? Would someone be interested in running the opt64 thing on a 32-bit system or vice versa? |
std/digest/murmurhash.d
Outdated
| uint h1; | ||
|
|
||
| public: | ||
| alias Block = uint; /// The element type for 32-bit implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name Block is really confusing. I've wondered what that is all through the code review. Would Unit fare better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andralex I reused the terminology from the original implementation. I can go for Unit if you prefer.
How about Element ?
|
Naming-wise, I think we should go with one only: struct MurmurHash(uint size /* 32, 64, or 128 */, uint opt = size_t.sizeof == 8 ? 64 : 32); |
Yes, this is common for distributed databases. |
|
@9il thanks, cool. Then it seems the template with the defaulted second argument is appropriate. |
|
I posted a new version based on @andralex suggestions. It's more compact and clearer. Destroy! |
|
The name should still be |
|
I have ideas for a MurmurHash4, but I don't know if I'll ever publish it. :) On Fri, May 27, 2016 at 2:57 AM, Guillaume Chatelet <
|
|
Changed @aappleby good to know :) I'm curious now! |
std/digest/murmurhash.d
Outdated
| import std.conv; | ||
|
|
||
| static assert(false, | ||
| "MurmurHash3(" ~ size.to!string ~ "," ~ opt.to!string ~ ") is not implemented"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not need to import std.conv: size.stringof and opt.stringof should work
ad17a86 to
e6dfddc
Compare
|
|
||
| testUnalignedHash!(MurmurHash3!32)(); | ||
| testUnalignedHash!(MurmurHash3!(128, 32))(); | ||
| testUnalignedHash!(MurmurHash3!(128, 64))(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know about static foreach in D?
import std.meta: AliasSeq:
alias seq = AliasSeq;
foreach (H; seq!(32, seq!(128, 32), seq!(128, 64)))
{
....
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only 3 lines without static foreach
|
@gchatelet please ignore my nitpicks - if we want we can still improve docs afterwards. |
|
@wilzbach about experimental: since the interface is already standardized I don't think it makes sense. |
|
Auto-merge toggled on |
|
PR for Slice to use MurmurHash3 #4639 |
A proposal of addition of murmurhash3 to phobos.
Maybe it should go in experimental first. Please let me know what you think.