-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decrease time for world.gop.reduce #547
Conversation
src/madness/world/worldgop.h
Outdated
auto buf0 = std::unique_ptr<T[]>(new T[nelem_per_maxmsg]); | ||
auto buf1 = std::unique_ptr<T[]>(new T[nelem_per_maxmsg]); | ||
std::unique_ptr<T[]> buf0 = (child0 != -1) | ||
? std::unique_ptr<T[]>(new T[std::min(nelem_per_maxmsg,nelem)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does std::make_unique
not work here? The less new
, the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll test timing with make_unique
, when I ran tests with it by just switching the original code it resulted in longer test times in MPQC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's annoying. For future reference, if there is a slowdown, there's a good chance this is the cause. Not that we can do anything about that as long as we maintain C++17 support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't seen a difference in the timings between the two in the MPQC test cases. I'll push the std::make_unique
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid std::make_unique
due to the issues discussed in Jonathon's link ... no reason to pay for value initialization of primitive types.
In fact, since this function only makes sense when T
has trivial default ctor and trivial copy ctor (otherwise just copying bytes from MPI will not produce valid T
objects) then we should switch to aligned malloc ... This will avoid us having to pay for initialization in the case of T=std::complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should now be accomplished with std::malloc
. I tried std::aligned_alloc
but the alignment constraint using alignof(T)
resulted in a nullptr
return when child process was present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried
std::aligned_alloc
but the alignment constraint usingalignof(T)
resulted in anullptr
return when child process was present.
possibly due to asking for size that is not a multiple of alignment? See: https://en.cppreference.com/w/cpp/memory/c/aligned_alloc#Notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-checked, that doesn't seem to be the issue. From the code below and running h2-sci-x2c-1root-np2.json
std::cout << "Alignment: " << alignof(T) << " \n"
<< "Test: " << (sizeof(T) * std::min(nelem_per_maxmsg, nelem)) % alignof(T) << std::endl;
the output shows that the size is always a multiple of alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem with the current solution is that alignment of T
is not guaranteed ... I pushed changes to reduce and concat0 that hopefully address it, but not tested, please test, report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the most recent changes, the test set of sci-x2c in MPQC is decreased from 27.2s to 13.9s compared to the unaligned malloc.
92aea42
to
30ce5ea
Compare
1bf0405
to
025f80c
Compare
c4112e5
to
430e153
Compare
Decreases time for world.gop.reduce by shortening the buffer length to the minimum number of elements and only forming the buffers if the child process exists. Significantly, speeds up sum reduction for
std::complex
type.@evaleev