Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental -hash-threshold option to hash very long symbol names. #1445

Merged
merged 1 commit into from
May 24, 2016

Conversation

JohanEngelen
Copy link
Member

This adds MD5 hashing of symbol names that are larger than threshold set by -hashthres.

What is very unfortunate is that std.traits depends on the mangled name, doing string parsing of the mangled name of symbols to obtain symbol traits. This means that mangling cannot be changed (dramatically, like hashing) at a high level, and the hashing has to be done on a lower level.

Hashed symbols look like this:
_D3one3two5three3L3433_46a82aac733d8a4b3588d7fa8937aad66Result3fooZ
ddemangle gives:
one.two.three.L34._46a82aac733d8a4b3588d7fa8937aad6.Result.foo
Meaning: this symbol is defined in module one.two.three on line 34. The identifier is foo and is contained in the struct or class Result.

Symbols that may be hashed:

  • functions
  • struct/class initializer
  • vtable
  • typeinfo (needed surgery inside FE code)

The feature is experimental, and has been tested on Weka.io's codebase. Compilation with -hashthres=1000 results in a binary that is half the size of the original (201MB vs. 461MB). I did not observe a significant difference in total build times. Hash threshold of 8000 gives 229MB, 800 gives 195MB binary size: there is not much gain after a certain hash threshold.
Linking Weka's code fails with a threshold of 500: phobos contains a few large symbols (one larger than 8kb!) and this PR currently does not disable hashing of symbols that are inside phobos, hence "experimental". Future work could try to figure out whether a symbol is inside phobos or not.

@JohanEngelen
Copy link
Member Author

phobos contains a few large symbols (one larger than 8kb!)

Correction: it is ~5322 bytes large

@wilzbach
Copy link
Contributor

-hashthres

I was just confused by this, maybe you don't mind the extra 4 chars and make it -hashthreshold, s.t. it's name is clearer?

@JohanEngelen
Copy link
Member Author

Hm, reading through our other options, I guess -hash-threshold would be best.

@JohanEngelen
Copy link
Member Author

ping @klickverbot @redstar

@dnadlinger
Copy link
Member

Why didn't you implement it in D? That way, there would be a chance of submitting it to upstream… ;)

name = namebuf.ptr;
sprintf(name, "_D%lluTypeInfo_%.*s6__initZ", cast(ulong)9 + hashedname.length, hashedname.length, hashedname.ptr);
}
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems rather unfortunate. Perhaps we should ditch IN_LLVM for such cases, or replace it by an enum set from the version (so you can do if (IN_LLVM && …). It doesn't seem like we would ever want to try using LDC's front end sources to build DMD…

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. We already have the enum IN_LLVM, so I'll use that. I also think the copying is ugly/stupid.
I will indent the DDMD source, so that we are notified of (perhaps relevant) changes by merge errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd probably keep the indentation as it is, so that the diff is kept tidy and the LDC-specific part is made painfully obvious when browsing the source. But I guess one could always use diff -w for the former…

@JohanEngelen
Copy link
Member Author

Many (aggravating) reasons for it.

@dnadlinger
Copy link
Member

dnadlinger commented Apr 22, 2016

If the switch is marked as experimental and it is made clear that it might disappear/behave differently in the future (release notes, …), it should probably be fine to add it as-is. I would very much hope that in the long term this remains a quick band-aid fix for Weka, though, until we get a proper upstream solution.

@JohanEngelen JohanEngelen changed the title Add experimental -hashthres option to hash very long symbol names. Add experimental -hash-threshold option to hash very long symbol names. May 24, 2016
@JohanEngelen
Copy link
Member Author

Merging when green on testers.

Ideas for future work:

  • use faster hasher
  • make a hashing OutBuffer that immediately hashes incoming characters when the length passes a certain threshold, and pass that outbuffer to the Mangler (see dmangle.d). This would prevent the need for allocating a lot of memory storage for large symbol names.

@dnadlinger
Copy link
Member

Would it make sense to use DMD's approach of only hashing the names before emitting them to object files? This way, the Phobos code relying on .mangleof wouldn't be affected.

@JohanEngelen
Copy link
Member Author

Would it make sense to use DMD's approach of only hashing the names before emitting them to object files? This way, the Phobos code relying on .mangleof wouldn't be affected.

Hm?
That is what this PR has been doing from day 1. ;)

@dnadlinger dnadlinger merged commit b5a7f1b into ldc-developers:master May 24, 2016
@dnadlinger
Copy link
Member

Ah, sorry, I had misremembered that (and the location of the diff in mtype.d).

@JohanEngelen JohanEngelen deleted the hashing branch May 24, 2016 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants