Improve static map precalculated basic cases.#7884
Improve static map precalculated basic cases.#7884schveiguy wants to merge 1 commit intodlang:masterfrom
Conversation
|
Thanks for your pull request, @schveiguy! Bugzilla referencesYour PR doesn't reference any Bugzilla issue. If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog. Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "master + phobos#7884" |
|
ping @John-Colvin since Adam told me you were the one who found the 150 threshold to be superior. Maybe this gives some possibility of improvement for your test case? My test case was pretty artificial. |
0569206 to
8d0a614
Compare
and less memory, we can store 50x as many linear expansions for staticMap. Provide a mechanism to set the threshold for avoiding recursion.
8d0a614 to
9c943d7
Compare
|
I don't think an explicit |
I don't exactly agree. Even with a "good default" there may be cases where the default isn't as good as you can get. Honestly, when I started doing this, I expected to just set the default to 5000 and have good performance. But the testing I did shows that there are cases where it consumes way more memory (because of the caching system in DMD). It seems a waste to be able to store 5000 elements in the same space as 150 elements, and not provide a mechanism to use it. So I put it in there as a mechanism to allow people to play with it, maybe to find a case where it makes sense. Perhaps it needs to be behind a version? This is all static parameters, so we can do that reasonably easy. |
|
I think the solution is to have a dumb (some sensible default) staticMap which is an instantiation of a encapsulated parameterized implementation, then expose both properly but separately - the latter could probably have a long enough name to suggest it's not meant to be thrown around without caution?. |
|
staticMap is pretty simple. The actual implementation is only 4 lines because of the requirement to recurse. Maybe I can turn the staticMapBasicCases into a public string, and then people can build what they want with that? |
Why not? You can choose to not use it and go with the default.
I think that we can merge this as is and add this if folks require it. |
|
Does anyone have any objections to merging this? |
|
@CyberShadow any way in which we can restart the documentation tester when it timeouts (besides re-pushing or closing and reopening the PR)? |
My objections to adding a marginally useful or non-useful template parameter are:
The reasons I am not yet convinced this template parameter is useful are:
|
|
@n8sh Those are all valid points. @schveiguy Can you provide any benchmarks or places in phobos that would benefit from this addition? |
|
ping @schveiguy |
|
Sorry, this got lost in my inbox. My code to generate the test code for my use cases was: import std.stdio;
import std.conv;
import std.algorithm;
import std.range;
void main(string[] args)
{
writeln("import std.meta;");
writeln("void main() {}");
writeln("alias pred(T) = const(T);");
foreach(i; 0 .. args[1].to!int)
{
writefln("struct S%s {int v;}", i);
writefln("void foo%s(staticMap!(pred, %-(%s,%))) {}", i, iota(i+1).map!(v => text("S", v)));
}
}This should generate a given number of static map cases which should trigger a lot of caching (when recursion is used). In my tests, the version in this PR used slightly less compile-time than the current phobos code, and less memory. As I said, upping the threshold resulted in slower compile times, probably due to the lack of caching (but I did not prove this out). In real-world cases, I don't know how much this would affect compile times. I would expect less memory usage, but I'm not 100% sure. |
True, but that is a significant problem to solve with these kinds of meta templates. Compile times matter to people. In fact, the current code was used to significantly speed up compile times, and so was considered a useful change, even though it didn't change the result.
My original intention was that "always" would be the answer. But testing shows that it's not always the case. My original intention was not to even have the parameter, as I assumed less CTFE = faster/better. As it stands, I don't want to make things slower, but I also don't want to limit the possibility of people finding places where code can be sped up. An entirely reasonable compromise would be to limit the cases to 150 without a parameter, and just avoid storing so much string data in the compiler.
staticMap with >150 elements is probably not common anyway.
This is true, I don't know of cases where it would speed up compilation. As I said, the original expectation I have was that it would always be faster/better. There are a lot of factors interacting together. My hope was that providing the tuning parameter at least gives a possibility of playing with optimization when it might be important. If it's determined that the parameter shouldn't be provided, perhaps just removing the parameterized version is needed. Let me know what I should do. |
|
cc @atilaneves |
|
I read the whole thing again and still don't know if this speeds up the general case or not. |
|
@atilaneves I think that the point @schveiguy was trying to make was that we don't know if it speeds up the general case, but we offer a possibility to the user to play with the expand factor and choose the best configuration for his scenario. |
|
What I do know is that we use enough compile-time memory to store 5000 elements that can be sliced, to store the same repeated string + 1 more element for 150 elements. So the new code uses no more memory than the original, and the CTFE cost is comparable. What isn't clear is why this doesn't speed up larger static maps. Probably it has to do with the test cases, and caching. But at 150 elements, the new code is comparable, while giving the possibility of trying larger static map mixins. It is difficult to test this because my firm belief is that a straight-up mixin is more performant on an individual case than recursing down to 150 elements, but that when you give the cache a lot of instances to work with (because you have populated it with the 1 to n-1 instances), then more instantiations can be faster than processing a large mixin. However, this is really hard to test because you can't repeat the same thing over and over, it will just hit the cache. Hm... this gives me an idea, maybe instead of testing things that would be cached, I should test things that wouldn't be (like instead of 500 different structs, do a list of 500 of the same structs, but do that over and over for 100 different structs). I'll get back to you. If that doesn't perform better, then I think this probably, while intuitively making sense as being faster, doesn't fit with the current compiler architecture. |
|
Running that test, it's still worse (and noticeably so). I'll close this, as it seems not to be an improvement. I still feel like the compiler internals are having some sort of pathological case with this, but I don't know enough about the compiler to understand why. The tradeoffs don't make sense when you look at the results. Someone mentioned perhaps |
|
Looking at dmd source, I think it is running a CTFE engine for each parameter to mixin to evaluate the expression and append the stringified result to an OutBuffer. So I think, ironically, concatenation is actually faster (only one invocation of the CTFE engine), but I don't know 100%. It's still a mystery to me why this is less performant than the mixin of a single string (but which still needs CTFE to evaluate the index). I have a feeling a better CTFE engine would make the differences between these two solutions negligible. Even the recursive version that stops at 150 elements -- it doesn't take too many steps to get down to 150, even from a really large number (4 levels would need a static map of 2400 elements). So perhaps the 150 is a reasonable compromise. Maybe I can adjust this to just generate 150 elements, and see what happens. |
|
cc @UplinkCoder |
Instead of storing individual mixins for all the cases from 0 to n, store a single string with all the Args from 0 to n, and then just slice off the ones needed.
This allows storing more elements with less space.
I tested with just upping the threshold to 5008 by default instead of 150, and I got some weird results. On code that did subsequent staticMap on 1 to 500 items, the compiler used 50% more memory and took slightly longer to use strictly the mixin instead of recursion. So I left the threshold of 150 by default, and people can use the other overload if they want to experiment with using less/no recursion. I also did not forward from one to the other to avoid adding another defaulted compile-time parameter to existing uses.