Add experimental -hash-threshold option to hash very long symbol names. #1445

JohanEngelen · 2016-04-18T15:40:58Z

This adds MD5 hashing of symbol names that are larger than threshold set by -hashthres.

What is very unfortunate is that std.traits depends on the mangled name, doing string parsing of the mangled name of symbols to obtain symbol traits. This means that mangling cannot be changed (dramatically, like hashing) at a high level, and the hashing has to be done on a lower level.

Hashed symbols look like this:
_D3one3two5three3L3433_46a82aac733d8a4b3588d7fa8937aad66Result3fooZ
ddemangle gives:
one.two.three.L34._46a82aac733d8a4b3588d7fa8937aad6.Result.foo
Meaning: this symbol is defined in module one.two.three on line 34. The identifier is foo and is contained in the struct or class Result.

Symbols that may be hashed:

functions
struct/class initializer
vtable
typeinfo (needed surgery inside FE code)

The feature is experimental, and has been tested on Weka.io's codebase. Compilation with -hashthres=1000 results in a binary that is half the size of the original (201MB vs. 461MB). I did not observe a significant difference in total build times. Hash threshold of 8000 gives 229MB, 800 gives 195MB binary size: there is not much gain after a certain hash threshold.
Linking Weka's code fails with a threshold of 500: phobos contains a few large symbols (one larger than 8kb!) and this PR currently does not disable hashing of symbols that are inside phobos, hence "experimental". Future work could try to figure out whether a symbol is inside phobos or not.

JohanEngelen · 2016-04-20T08:41:59Z

phobos contains a few large symbols (one larger than 8kb!)

Correction: it is ~5322 bytes large

wilzbach · 2016-04-21T18:00:52Z

-hashthres

I was just confused by this, maybe you don't mind the extra 4 chars and make it -hashthreshold, s.t. it's name is clearer?

JohanEngelen · 2016-04-21T18:06:03Z

Hm, reading through our other options, I guess -hash-threshold would be best.

JohanEngelen · 2016-04-22T12:06:01Z

ping @klickverbot @redstar

dnadlinger · 2016-04-22T18:01:47Z

Why didn't you implement it in D? That way, there would be a chance of submitting it to upstream… ;)

dnadlinger · 2016-04-22T18:05:47Z

ddmd/mtype.d

+            name = namebuf.ptr;
+            sprintf(name, "_D%lluTypeInfo_%.*s6__initZ", cast(ulong)9 + hashedname.length, hashedname.length, hashedname.ptr);
+        }
+        else


This seems rather unfortunate. Perhaps we should ditch IN_LLVM for such cases, or replace it by an enum set from the version (so you can do if (IN_LLVM && …). It doesn't seem like we would ever want to try using LDC's front end sources to build DMD…

OK. We already have the enum IN_LLVM, so I'll use that. I also think the copying is ugly/stupid.
I will indent the DDMD source, so that we are notified of (perhaps relevant) changes by merge errors.

Personally, I'd probably keep the indentation as it is, so that the diff is kept tidy and the LDC-specific part is made painfully obvious when browsing the source. But I guess one could always use diff -w for the former…

JohanEngelen · 2016-04-22T18:06:18Z

Many (aggravating) reasons for it.

dnadlinger · 2016-04-22T18:07:29Z

If the switch is marked as experimental and it is made clear that it might disappear/behave differently in the future (release notes, …), it should probably be fine to add it as-is. I would very much hope that in the long term this remains a quick band-aid fix for Weka, though, until we get a proper upstream solution.

JohanEngelen · 2016-05-24T09:48:35Z

Merging when green on testers.

Ideas for future work:

use faster hasher
make a hashing OutBuffer that immediately hashes incoming characters when the length passes a certain threshold, and pass that outbuffer to the Mangler (see dmangle.d). This would prevent the need for allocating a lot of memory storage for large symbol names.

dnadlinger · 2016-05-24T13:11:13Z

Would it make sense to use DMD's approach of only hashing the names before emitting them to object files? This way, the Phobos code relying on .mangleof wouldn't be affected.

JohanEngelen · 2016-05-24T15:20:20Z

Would it make sense to use DMD's approach of only hashing the names before emitting them to object files? This way, the Phobos code relying on .mangleof wouldn't be affected.

Hm?
That is what this PR has been doing from day 1. ;)

dnadlinger · 2016-05-24T15:23:52Z

Ah, sorry, I had misremembered that (and the location of the diff in mtype.d).

JohanEngelen force-pushed the hashing branch from 0c57b31 to f630934 Compare April 21, 2016 18:08

dnadlinger reviewed Apr 22, 2016
View reviewed changes

JohanEngelen mentioned this pull request May 20, 2016

compress symbol names in object file dlang/dmd#5793

Closed

JohanEngelen force-pushed the hashing branch from f630934 to 09c9a48 Compare May 24, 2016 09:32

Add experimental -hash-threshold option to hash very long symbol names.

776e32d

JohanEngelen force-pushed the hashing branch from 09c9a48 to 776e32d Compare May 24, 2016 09:41

JohanEngelen changed the title ~~Add experimental -hashthres option to hash very long symbol names.~~ Add experimental -hash-threshold option to hash very long symbol names. May 24, 2016

dnadlinger merged commit b5a7f1b into ldc-developers:master May 24, 2016

JohanEngelen deleted the hashing branch May 24, 2016 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental -hash-threshold option to hash very long symbol names. #1445

Add experimental -hash-threshold option to hash very long symbol names. #1445

JohanEngelen commented Apr 18, 2016

JohanEngelen commented Apr 20, 2016

wilzbach commented Apr 21, 2016

JohanEngelen commented Apr 21, 2016

JohanEngelen commented Apr 22, 2016

dnadlinger commented Apr 22, 2016

dnadlinger Apr 22, 2016

JohanEngelen Apr 22, 2016

dnadlinger Apr 22, 2016

JohanEngelen commented Apr 22, 2016

dnadlinger commented Apr 22, 2016 •

edited

Loading

JohanEngelen commented May 24, 2016

dnadlinger commented May 24, 2016

JohanEngelen commented May 24, 2016

dnadlinger commented May 24, 2016

Add experimental -hash-threshold option to hash very long symbol names. #1445

Add experimental -hash-threshold option to hash very long symbol names. #1445

Conversation

JohanEngelen commented Apr 18, 2016

JohanEngelen commented Apr 20, 2016

wilzbach commented Apr 21, 2016

JohanEngelen commented Apr 21, 2016

JohanEngelen commented Apr 22, 2016

dnadlinger commented Apr 22, 2016

dnadlinger Apr 22, 2016

Choose a reason for hiding this comment

JohanEngelen Apr 22, 2016

Choose a reason for hiding this comment

dnadlinger Apr 22, 2016

Choose a reason for hiding this comment

JohanEngelen commented Apr 22, 2016

dnadlinger commented Apr 22, 2016 • edited Loading

JohanEngelen commented May 24, 2016

dnadlinger commented May 24, 2016

JohanEngelen commented May 24, 2016

dnadlinger commented May 24, 2016

dnadlinger commented Apr 22, 2016 •

edited

Loading