Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease time for world.gop.reduce #547

Merged
merged 6 commits into from
Sep 16, 2024

Conversation

EricaCMitchell
Copy link
Collaborator

@EricaCMitchell EricaCMitchell commented Sep 12, 2024

Decreases time for world.gop.reduce by shortening the buffer length to the minimum number of elements and only forming the buffers if the child process exists. Significantly, speeds up sum reduction for std::complex type.
@evaleev

auto buf0 = std::unique_ptr<T[]>(new T[nelem_per_maxmsg]);
auto buf1 = std::unique_ptr<T[]>(new T[nelem_per_maxmsg]);
std::unique_ptr<T[]> buf0 = (child0 != -1)
? std::unique_ptr<T[]>(new T[std::min(nelem_per_maxmsg,nelem)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does std::make_unique not work here? The less new, the better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll test timing with make_unique, when I ran tests with it by just switching the original code it resulted in longer test times in MPQC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's annoying. For future reference, if there is a slowdown, there's a good chance this is the cause. Not that we can do anything about that as long as we maintain C++17 support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't seen a difference in the timings between the two in the MPQC test cases. I'll push the std::make_unique implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid std::make_unique due to the issues discussed in Jonathon's link ... no reason to pay for value initialization of primitive types.

In fact, since this function only makes sense when T has trivial default ctor and trivial copy ctor (otherwise just copying bytes from MPI will not produce valid T objects) then we should switch to aligned malloc ... This will avoid us having to pay for initialization in the case of T=std::complex.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should now be accomplished with std::malloc. I tried std::aligned_alloc but the alignment constraint using alignof(T) resulted in a nullptr return when child process was present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried std::aligned_alloc but the alignment constraint using alignof(T) resulted in a nullptr return when child process was present.

possibly due to asking for size that is not a multiple of alignment? See: https://en.cppreference.com/w/cpp/memory/c/aligned_alloc#Notes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-checked, that doesn't seem to be the issue. From the code below and running h2-sci-x2c-1root-np2.json

std::cout << "Alignment: " << alignof(T) << " \n"
                << "Test: " << (sizeof(T) * std::min(nelem_per_maxmsg, nelem)) % alignof(T) << std::endl;

the output shows that the size is always a multiple of alignment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem with the current solution is that alignment of T is not guaranteed ... I pushed changes to reduce and concat0 that hopefully address it, but not tested, please test, report

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the most recent changes, the test set of sci-x2c in MPQC is decreased from 27.2s to 13.9s compared to the unaligned malloc.

@evaleev evaleev merged commit 0d87255 into m-a-d-n-e-s-s:master Sep 16, 2024
8 of 12 checks passed
evaleev added a commit to ValeevGroup/tiledarray that referenced this pull request Sep 17, 2024
evaleev added a commit to ValeevGroup/tiledarray that referenced this pull request Sep 17, 2024
evaleev added a commit to ValeevGroup/tiledarray that referenced this pull request Sep 17, 2024
evaleev added a commit to ValeevGroup/tiledarray that referenced this pull request Sep 17, 2024
bimalgaudel pushed a commit to ValeevGroup/tiledarray that referenced this pull request Sep 25, 2024
@EricaCMitchell EricaCMitchell deleted the pr-world_reduce branch October 22, 2024 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants