Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with a more straight-forward template emission scheme #3422

Closed
wants to merge 19 commits into from

Conversation

kinke
Copy link
Member

@kinke kinke commented May 4, 2020

No description provided.

@kinke
Copy link
Member Author

kinke commented May 4, 2020

I've quickly experimented a bit wrt. this, building dub (as presumably Phobos-heavy code) in the repo root dir on Win64 with

ldmd2 -c -of=app.obj -Isource -version=DubUseCurl -version=DubApplication -O -w @build-files.txt

Results:

  • without patch: ~98 secs, .obj size 9.5M
  • with needsCodegen() patch: ~98 secs, .obj size 9.6M
  • with patch and linkonce_odr linkage (instead of weakonce_odr, i.e., now discardable; edit: for dub alone, not druntime/Phobos): ~76 secs, .obj size 5.2M

All 3 versions can be linked with the corresponding druntime/Phobos; sizes are almost identical (~3.3M; stripping is enabled by default), and the time required by the MS linker doesn't really matter here (~320ms for linkonce_odr vs. ~570ms for the other 2).

@JohanEngelen
Copy link
Member

I'll try and test this on weka's codebase this week. (lots of templates and separate compilation)

@kinke
Copy link
Member Author

kinke commented May 6, 2020

I guess I only got 'lucky' with dub when compiling to a single .obj; compiling the defaultlibs with this (=> separate objects per module) shows that it's absolutely not working, so at least the 2nd diff wrt. rootImports() should be utter crap. - I had no idea this would potentially have such an effect; in hindsight of course it makes sense, allowing the optimizer to discard instead of optimize unused stuff (the non-optimized .obj size was ~11.5M for all 3 versions, just differing by some KB).

Quoting Iain from Slack:

Yeah, I'm all too aware of the potential speed-ups. Mike's STM project took around 1m30s to compile, and size was running into the MBs, I then added -fno-weak (makes all templates non-public) and it instead took ~3-5s to compile, and size 64k - basically, the middle/back-end are spending huge amounts of energy trying to optimize code that are ultimately unused, but D's emission strategy is doesn't allow them to remove anything.

@kinke
Copy link
Member Author

kinke commented May 8, 2020

I've experimented a bit to get a better understanding of how this currently works.

Let's use this simple structure:

// --- multi_a.d:
import multi_b;
import multi_c;

void templ_a()() {}

void foo_a()
{
    templ_b();
    templ_c();
}

// --- multi_b.d:
import multi_c;
import multi_a;

void templ_b()() { templ_c(); }

void foo_b()
{
    templ_a();
    templ_c();
}

// --- multi_c.d:
void templ_c()() {}

multi_a: instantiates templ_b and templ_c directly; templ_b instantiates templ_c a 2nd time
multi_b: instantiates templ_a and templ_c directly

Observations when compiling the 3 modules to separate object files (-c) in a single cmdline:

  • TemplateInstances are sort-of uniqued across modules; only the first TemplateInstance is fully analyzed and ready for codegen (populated members), siblings (equivalent instantiations) are apparently lightweight and stored in a linked list (tnext; head is the first/primary TemplateInstance).
  • Only the primary TemplateInstance might make it into Module.members of a root module and might so be codegen'd. The preferred module is the one containing the template declaration.

For the example above, this means that with master, templ_a is emitted once into multi_a, templ_b once into multi_b, and templ_c once into multi_c.

What I'd like to experiment with would be a C++-like emission strategy, emitting all directly and indirectly instantiated templates once per instantiating module (end goal: object file). I.e.:

  • multi_a containing templ_b and (only one instantiation of) templ_c
  • multi_b containing templ_a and templ_c
  • multi_c being empty

Then we should be able to use linkonce_odr and probably fix a bunch of template culling issues along the way.
Edit: And cross-module-inlining shouldn't be required for templates anymore.

I've pushed a little sketch, which adds more TemplateInstances as module members (corresponding to the goal above for the little example). It doesn't work yet of course; a simple Dsymbol.arraySyntaxCopy() of the primary instantiation's members doesn't work. For that, we'd need a way to codegen the primary TemplateInstance's members multiple times.

@kinke
Copy link
Member Author

kinke commented May 8, 2020

Pinging @ibuclaw.

@JohanEngelen
Copy link
Member

fyi: for Weka's large templated codebase, the depth of template instantiations can be very deep, >1000. Determining whether the template needs to be instantiated by needsCodegen actually takes a significant amount of time, and also leads to stack exhaustion upon too deep recursion. The current hack is to simply return false (no codegen) when a recursion depth of 1000 is reached. Simplifying the template culling logic may help a lot.
But I also think a better strategy is needed to exclude function that only have been used in CTFE. For example by a simple onlyCTFE boolean for each function. onlyCTFE is set in CTFE instantiation contexts. When an existing function is called from a non-CTFE function, we do a check if onlyCTFE==true, and if true we set it to false and recurse into the function and do the same for all functions called within that function.

@kinke
Copy link
Member Author

kinke commented May 9, 2020

With the sketch, needsCodegen() in its current form would be absolutely superfluous (except maybe for your proposed CTFE-only check). All TIs (template instantiations), incl. non-primary ones are simply added to the instantiating module (which, for nested TIs, seems to be the instantiating module of the root TI). If it's a root module (I should probably skip non-roots), it will be codegen'd. The parent TIs shouldn't matter, and the linked list of siblings is currently (in the sketch) only used to prevent multiple siblings in a module's members.

@kinke
Copy link
Member Author

kinke commented May 9, 2020

For that, we'd need a way to codegen the primary TemplateInstance's members multiple times.

@dnadlinger et al.: Any ideas how to tackle this? [This would also be interesting for proper cross-module inlining when generating multiple object files at once.] I think IrDsymbol *Dsymbol::ir is one big obstacle, as it assumes at most one IR equivalent per symbol [nope, they are properly reset for each new IR module].

@kinke kinke force-pushed the needsCodegen branch 3 times, most recently from a948851 to 19e1b4d Compare May 9, 2020 23:36
@kinke
Copy link
Member Author

kinke commented May 10, 2020

Multiple codegen (of the primary TI) seems to be solved now; druntime and Phobos can be compiled, both all-at-once and each module separately. For the little example, templ_c is emitted into both multi_a and multi_b objects (now fully working as intended).

But as the countless undefined symbols show, a whole bunch of TIs still seems missing. I've tried without -linkonce-templates, but it only seems to reduce the number of undefined symbols and doesn't get rid of them.

@kinke kinke changed the title Test Experiment with a more straight-forward template emission scheme May 10, 2020
@kinke
Copy link
Member Author

kinke commented May 10, 2020

Okay, this utterly fails:

bool foo(void[] a, void[] b)
{
    return a == b;
}

Putting this into 2 identical files and compiling them both at once to separate object files, only one object gets the full tree of required TIs:

  • bool __equals!(void, void)(void[] lhs, void[] rhs)
    • ubyte[] __equals!(void, void).__equals.trustedCast!(ubyte[], void)(void[] r)
    • bool __equals!(ubyte, ubyte)(ubyte[] lhs, ubyte[] rhs)
      • ref ubyte __equals!(ubyte, ubyte).__equals.at!ubyte(ubyte[] r, size_t i)

The other object only gets (a secondary TI of) the top __equals!(void, void) (and only codegens that), so nested TIs seem not (always?) to be in the parent TI's members. Perhaps we'd need a list of TI children for a TI, not just the tinst parent, and recursively add the whole tree (of the primary TI) when appending a secondary TI to a module's members...

@kinke
Copy link
Member Author

kinke commented May 10, 2020

Pushed another little experiment, trying to add the nested TIs to the parent TI members instead of the module members. Works for the trivial example above wrt. __equals!(void, void), but it's still by no means enough for druntime/Phobos.

@kinke
Copy link
Member Author

kinke commented May 14, 2020

With a Phobos hello-world on Win64, 25 undefined symbols remain (before linker stripping), so it looks like this is slowly getting somewhere. - A TI may now be present multiple times in the members tree of a single module (nested in different parent TIs and/or also added as top-level TI to the module directly) which is obviously suboptimal. We do skip multiple function definitions in DtoDefineFunction(), but still. But this can be tackled later.

kinke added 8 commits June 6, 2020 02:50
…g module

I.e., include children TIs in speculative or non-root modules too,
because a non-primary sibling of the parent TI, in a root module,
requires all children.
The TIs making it here are either top-level TIs in some root module, or
TIs nested therein (possibly in a non-root primary instance though; these
have been skipped before).
@kinke
Copy link
Member Author

kinke commented Jun 6, 2020

I've taken a different route now, inspired by cross-module inlining, and codegen the template functions now whenever they are IR-declared instead of tampering with Module and TemplateInstance members. That's both much simpler and much more promising according to the remaining undefined symbol linker errors (an empty D main now links fine on Windows, a std.stdio hello world just needs a std.variant struct init symbol...). Static data in templates isn't treated the same way yet (i.e., not emitted into each referencing CU), so I try to use weak_odr as a hack for them for now.

The total size of all druntime/Phobos release object files on Win64 goes down from ~10.5 MB (master) to 8.5 MB with linkonce_odr here, although the number of IR function definitions can be very much higher.

@ibuclaw
Copy link
Contributor

ibuclaw commented Jun 6, 2020

Pinging @ibuclaw.

I guess that a new enum is required to denote each template emission strategy - judging from the changes here, they'd be called something like allOwned (dmd default), allInst (-allinst), and allSpeculative (this pull)?

@kinke
Copy link
Member Author

kinke commented Jun 6, 2020

allSpeculative for this would definitely be a misnomer; it shouldn't emit any extra speculative instantiations, but simply emit each referenced function and data inside template instantiations into each referencing (from the codegen layer, so no speculative stuff) compilation unit.

A problem is that the emission strategy should be consistent for all static libraries and objects of a linked project, as e.g. druntime/Phobos compiled with this cannot be linked against user code with the current emission strategy.

@ibuclaw
Copy link
Contributor

ibuclaw commented Jun 6, 2020

allSpeculative for this would definitely be a misnomer; it shouldn't emit any extra speculative instantiations, but simply emit each referenced function and data inside template instantiations into each referencing (from the codegen layer, so no speculative stuff) compilation unit.

OK.

A problem is that the emission strategy should be consistent for all static libraries and objects of a linked project, as e.g. druntime/Phobos compiled with this cannot be linked against user code with the current emission strategy.

But if this is done right, there won't be a need to give the user a choice over which emission strategy to use.

@dnadlinger
Copy link
Member

dnadlinger commented Jun 6, 2020

The total size of all druntime/Phobos release object files on Win64 goes down from ~10.5 MB (master) to 8.5 MB with linkonce_odr here, although the number of IR function definitions can be very much higher.

FWIW, I'd suggest that what actually matters are compile/link times of typical (moderately template-heavy) user applications instead of object file size – Phobos object file size might be a proxy for the typical number of duplicate templates, but I'm not sure how good of one it makes.

The current strategy (at least last time I looked it) is based on the assumption that code generation is expensive – especially for optimised builds –, and duplicate codegen is thus best avoided. As Johan pointed out earlier, the needsCodegen() "tree splitting" checks themselves can be quite expensive, certainly so in the current implementation (and it might be worth making it a bit less precise to make the checks cheaper), but nevertheless this implements a clear principle.

I haven't had a chance to look at the latest changes here yet, but it seems like the basic idea is to emit enough duplicates to be able to use linkonce_odr instead – right?

Is this basically DMD -allinst, but with more culling on the IR level? A bit more explicit documentation would be nice. It might also be interesting to compare this to the current solution but with (Thin)LTO and appropriate internal-ising of symbols before codegen.

@kinke
Copy link
Member Author

kinke commented Jun 7, 2020

Phobos hello-world now links on Windows, even the druntime testrunners (debug + release). For the Phobos debug testrunner, 199 undefined symbols (the vast majority of which confined to std.range.interfaces... - for release, 188 undefined symbols).

@dnadlinger
Copy link
Member

Why is weak_odr pinning for special symbols necessary?

@kinke
Copy link
Member Author

kinke commented Jun 7, 2020

Because I don't emit them into each referencing CU (yet?). If a root module instantiates some class template, it's going to be codegen'd in that some root module and define ClassInfo + interface vtbls there. Other CUs still just declare it, so linkonce_odr doesn't work for now.

@dnadlinger
Copy link
Member

Why treat class templates differently to others, though? (It might of course be that the usage structure between class and function templates differs enough in typical D that one strategy is better for classes and the other for free functions.)

@kinke
Copy link
Member Author

kinke commented Jun 7, 2020

My primary focus is getting this green first, to be able to conduct some performance tests. Then the details can be ironed out; the weak_odr template data hack has IMO probably a higher impact than the ClassInfos.

kinke added 5 commits July 11, 2020 13:25
Analogous to ClassInfos, incl. normal linkage (external for non-
templates, weak_odr for templates).

This enables to get rid of frontend logic wrt. whether to add
TypeInfoStructDeclarations to a module's members tree - previously,
it was defined as linkonce_odr in the owning module and each referencing
module (unless speculative) - and related extra semantic and codegen for
the special member functions.
I've gone a bit further and moved the entire TypeInfo emission for LDC
to the codegen layer; no TypeInfoDeclarations are added to the module
members anymore. Whenever we need a TypeInfo symbol during codegen, it
is declared or defined, and we don't need to rely on brittle frontend
logic with speculative-ness complications.

This might slightly increase compilation speed due to less emitted
TypeInfos and functions (possibly less work for the linker too).

A slight drawback is that the job of stripping unused struct TypeInfos
is fully delegated to the linker, as the TypeInfo is guaranteed to end
up in the owning object file due to no linkonce_odr.

Another theoretical drawback is that the optimizer can definitely not
inline xtoHash/xopEquals/xopCmp/xtoString/xdtor[ti]/xpostblit function
pointer indirections in non-owning CUs without LTO (neither the pointers
nor the special member functions are defined anymore).
These (public) members are probably hardly used directly though, and
instead used by the virtual TypeInfo_Struct methods equals/compare/
getHash/destroy/postblit, which are exclusively defined in druntime's
object.o (incl. the TypeInfo_Struct vtable) and aren't cross-module-
inlined anyway (without LTO).

Re-emitting the struct TypeInfos (and optionally the special member
functions too) into each referencing CU could be handled in our codegen
layer, which should be much simpler and more robust than the upstream
scheme.
Conflicts:
	dmd/dtemplate.d
	gen/declarations.cpp
	gen/typinf.cpp
	ir/iraggr.cpp
	ir/iraggr.h
	ir/irclass.cpp
@kinke
Copy link
Member Author

kinke commented Jul 18, 2020

For the Phobos debug testrunner, 199 undefined symbols

Now down to 19 (apparently all data, no functions anymore):

phobos2-ldc-unittest-debug.lib(iteration.obj) : error LNK2001: unresolved external symbol _D6object__T10RTInfoImplVAmA2i80i346ZQzyG2m
phobos2-ldc-unittest-debug.lib(setops.obj) : error LNK2001: unresolved external symbol _D6object__T10RTInfoImplVAmA2i80i8ZQxyG2m
phobos2-ldc-unittest-debug.lib(setops.obj) : error LNK2001: unresolved external symbol _D6object__T10RTInfoImplVAmA2i184i2048ZQBbyG2m
phobos2-ldc-unittest-debug.lib(setops.obj) : error LNK2001: unresolved external symbol _D6object__T10RTInfoImplVAmA2i176i1024ZQBbyG2m
phobos2-ldc-unittest-debug.lib(setops.obj) : error LNK2001: unresolved external symbol _D6object__T10RTInfoImplVAmA2i232i2048ZQBbyG2m
phobos2-ldc-unittest-debug.lib(setops.obj) : error LNK2001: unresolved external symbol _D6object__T10RTInfoImplVAmA3i640i2306406234045054976i0ZQBsyG3m
phobos2-ldc-unittest-debug.lib(setops.obj) : error LNK2001: unresolved external symbol _D6object__T10RTInfoImplVAmA3i632i1153203117022527488i0ZQBsyG3m
phobos2-ldc-unittest-debug.lib(setops.obj) : error LNK2001: unresolved external symbol _D6object__T10RTInfoImplVAmA3i696i2306406234045054976i0ZQBsyG3m
phobos2-ldc-unittest-debug.lib(curl.obj) : error LNK2019: unresolved external symbol _D3std4conv__T7enumRepTyAaTEQBa3net4curl4HTTP6MethodVQBai0ZQBtyQBn referenced in function _D3std4conv__T6toImplTAyaTEQz3net4curl4HTTP6MethodZQBlFNaNfQBhZQBp
phobos2-ldc-unittest-debug.lib(curl.obj) : error LNK2019: unresolved external symbol _D3std4conv__T7enumRepTyAaTEQBa3net4curl4HTTP6MethodVQBai1ZQBtyQBn referenced in function _D3std4conv__T6toImplTAyaTEQz3net4curl4HTTP6MethodZQBlFNaNfQBhZQBp
phobos2-ldc-unittest-debug.lib(curl.obj) : error LNK2019: unresolved external symbol _D3std4conv__T7enumRepTyAaTEQBa3net4curl4HTTP6MethodVQBai2ZQBtyQBn referenced in function _D3std4conv__T6toImplTAyaTEQz3net4curl4HTTP6MethodZQBlFNaNfQBhZQBp
phobos2-ldc-unittest-debug.lib(curl.obj) : error LNK2019: unresolved external symbol _D3std4conv__T7enumRepTyAaTEQBa3net4curl4HTTP6MethodVQBai3ZQBtyQBn referenced in function _D3std4conv__T6toImplTAyaTEQz3net4curl4HTTP6MethodZQBlFNaNfQBhZQBp
phobos2-ldc-unittest-debug.lib(curl.obj) : error LNK2019: unresolved external symbol _D3std4conv__T7enumRepTyAaTEQBa3net4curl4HTTP6MethodVQBai4ZQBtyQBn referenced in function _D3std4conv__T6toImplTAyaTEQz3net4curl4HTTP6MethodZQBlFNaNfQBhZQBp
phobos2-ldc-unittest-debug.lib(curl.obj) : error LNK2019: unresolved external symbol _D3std4conv__T7enumRepTyAaTEQBa3net4curl4HTTP6MethodVQBai5ZQBtyQBn referenced in function _D3std4conv__T6toImplTAyaTEQz3net4curl4HTTP6MethodZQBlFNaNfQBhZQBp
phobos2-ldc-unittest-debug.lib(curl.obj) : error LNK2019: unresolved external symbol _D3std4conv__T7enumRepTyAaTEQBa3net4curl4HTTP6MethodVQBai6ZQBtyQBn referenced in function _D3std4conv__T6toImplTAyaTEQz3net4curl4HTTP6MethodZQBlFNaNfQBhZQBp
phobos2-ldc-unittest-debug.lib(curl.obj) : error LNK2019: unresolved external symbol _D3std4conv__T7enumRepTyAaTEQBa3net4curl4HTTP6MethodVQBai7ZQBtyQBn referenced in function _D3std4conv__T6toImplTAyaTEQz3net4curl4HTTP6MethodZQBlFNaNfQBhZQBp
phobos2-ldc-unittest-debug.lib(curl.obj) : error LNK2019: unresolved external symbol _D3std4conv__T7enumRepTyAaTEQBa3net4curl4HTTP6MethodVQBai8ZQBtyQBn referenced in function _D3std4conv__T6toImplTAyaTEQz3net4curl4HTTP6MethodZQBlFNaNfQBhZQBp
phobos2-ldc-unittest-debug.lib(curl.obj) : error LNK2019: unresolved external symbol _D3std4conv__T7enumRepTyAaTEQBa3net4curl4HTTP6MethodVQBai9ZQBtyQBn referenced in function _D3std4conv__T6toImplTAyaTEQz3net4curl4HTTP6MethodZQBlFNaNfQBhZQBp
phobos2-ldc-unittest-debug.lib(typecons.obj) : error LNK2019: unresolved external symbol _D3std8typecons__T14sharedToStringS_DQBjQBi__T5TupleTiTOCQCdQCc19__unittest_L1617_C7FNfZ1AZQBt16__expand_field_1OQCfZQDvyAa referenced in function _D3std8typecons__T5TupleTiTOCQBbQBa19__unittest_L1617_C7FNfZ1AZQBt__T8toStringTDFNaNbNfMAxaZvTaZQBbMxFNaNfMQBcMKxSQEi6format__T10FormatSpecTaZQpZv

@kinke
Copy link
Member Author

kinke commented Jul 18, 2020

Should be finally mostly working - the few remaining failures look benign (=> would require test adaptations due to changed codegen order and different IR). Some promising Azure CI timings for Mac (the most stable Azure platform wrt. runtimes), with enabled assertions (for LDC & LLVM):

Step Master This Runtime delta
Build bootstrap LDC 4m 46s 4m 34s
Build LDC, LDC unittests & druntime/Phobos test runners 18m 46s 13m 15s -29%
Run lit-tests 51s 38s -25%
Run dmd-testsuite 5m 21s 3m 47s -29%

@kinke
Copy link
Member Author

kinke commented Jul 18, 2020

On my Win64 box, I'm seeing a big improvement for building the (optimized release) Phobos test runner (ninja -j4 phobos2-test-runner), with enabled assertions again: 8m 13s => 5m 10s (-37%)

@ibuclaw
Copy link
Contributor

ibuclaw commented Jul 19, 2020

On my Win64 box, I'm seeing a big improvement for building the (optimized release) Phobos test runner (ninja -j4 phobos2-test-runner), with enabled assertions again: 8m 13s => 5m 10s (-37%)

That's definitely in line with what I'd expect. I've been testing switching the default template linkage from vague to weak, and I've seen compilation times of a few phobos modules go from 1m 30s to just over 3m with -O2 -funittest (triggering timeout failures).

@kinke
Copy link
Member Author

kinke commented Jul 20, 2020

-unittest affects template emission though, so that comparison, coupled with inlining, should be the best case scenario for this scheme. I've just compared normal Phobos build times and static .lib file sizes on Win64, all object files emitted in a single compiler cmdline:

Ninja target Master This
phobos2-ldc-debug 19.6 s, 17.7 MB 31.9 s (+63%), 31.1 MB (+76%)
phobos2-ldc 63.1 s, 8.5 MB 73.6 s (+17%), 6 MB (-29%)

As expected, the significantly increased number of IR definitions per IR module leads to a heavy compiler slow-down if the optimizer isn't run. So unsurprisingly no good for debug builds.

For release builds, the impact on compile times probably varies a lot from project to project and how it's built (number of object files). As a bonus, template functions are automatically available for cross-module inlining, reducing the need for LTO.

@kinke
Copy link
Member Author

kinke commented Nov 27, 2020

Finally superseded by #3600.

@kinke kinke closed this Nov 27, 2020
@kinke kinke deleted the needsCodegen branch November 27, 2020 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants