Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm & rocFFT #583

Merged
merged 12 commits into from
Sep 27, 2021
Merged

ROCm & rocFFT #583

merged 12 commits into from
Sep 27, 2021

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Aug 10, 2021

Implement rocFFT from equivalent WarpX routines.
Add HIP CI.

Note: do to an upstream ROCm bug in 4.2.0, you need to build either with ROCm 4.1.0 or ROCm 4.3.0 (and corresponding rocFFT releases), please.

ROCm is still a moving target in AMReX, so don't hesitate to reach out on the AMReX/WarpX channels.
For Cray/HPE software environments with MPI (e.g., OLCF Spock), also take note of this improvement for CMake 3.22+: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/6264 and the work-around posted therein to build in the meantime.

  • Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
  • Tested (describe the tests in the PR description)
  • Runs on GPU (basic: the code compiles and run well with the new module)
  • Contains an automated test (checksum and/or comparison with theory)
  • Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
  • Constified (All that can be const is const)
  • Code is clean (no unwanted comments, )
  • Style and code conventions are respected at the bottom of https://github.com/Hi-PACE/hipace
  • Proper label and GitHub project, if applicable

@ax3l ax3l added component: fields About 3D fields and slices, field solvers etc. GPU Related to GPU acceleration CI Continuous integration, checksum and analysis tests, GitHub Actions, etc. labels Aug 10, 2021
@ax3l ax3l force-pushed the topic-rocFFT branch 5 times, most recently from 96c4e19 to 4217cf3 Compare August 10, 2021 20:47
@ax3l ax3l requested a review from MaxThevenet August 10, 2021 20:49
@ax3l ax3l mentioned this pull request Aug 11, 2021
9 tasks
@ax3l
Copy link
Member Author

ax3l commented Aug 11, 2021

@MaxThevenet @wetzel-desy I updated the WrapRocDST implementation with the content from #584

Please feel free to review. I saw there are a few inline comments left in Execute that might need double-checking.
Also compare this to the WarpCuDST if needed, where we seem to handle forward/backward transformation slightly differently.

@ax3l ax3l changed the title [WIP] rocFFT rocFFT Aug 11, 2021
@ax3l ax3l changed the title rocFFT ROCm & rocFFT Aug 11, 2021
ax3l and others added 4 commits August 11, 2021 10:27
Implement rocFFT from equivalent WarpX routines.
C++ inside `extern C` does not work:
https://github.com/ROCmSoftwarePlatform/rocFFT/blob/rocm-4.3.0/library/include/rocfft.h#L36-L42

Fixed in `develop` of rocFFT, but did not land in 4.3.0
Co-authored-by: Maxence Thevenet <maxence.thevenet@desy.de>
@ax3l
Copy link
Member Author

ax3l commented Aug 13, 2021

@SeverinDiederichs via Slack:
For testing, using hipace.do_small_dst = 0 is a good first step, which only uses the ShrinkC2R and ExpandR2C functions.
That way we do a 2D FFT and don't need the transpose, while the default does 1D DSTs (which need transposes).

Comment on lines 326 to 329
result = rocfft_execute(dst_plan.m_plan,
(void**)&(dst_plan.m_expanded_position_array),
(void**)&(dst_plan.m_expanded_fourier_array),
execinfo);
Copy link
Member

@MaxThevenet MaxThevenet Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code compiles well indeed but when trying with this input file I get a segfault in Execute. The segfault disappears when commenting out these lines. Could also be coming from rocfft_plan_create. Note that this has no I/O, I think I get another segfault when I/O are on.

Copy link
Member Author

@ax3l ax3l Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing!
Looking at the in-code comments by @wetzel-desy before these lines, I guess we need to double-check the API contract for rocfft here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wetzel-desy do you have any idea of what could go wrong here, and how to fix it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly took inspiration from the WarpX implementation, which calls rocfft_execute. So there might be a difference between the FFT and DST with regard to the underlying data types of the two arrays that are supplied as input. But as I am not involved with the overall structure, I wouldn't know how to fix this without testing... Sorry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work?

void* in[2] = {dst_plan.m_expanded_position_array->dataPtr(), nullptr};
void* out[2] = {dst_plan.m_expanded_fourier_array->dataPtr(), nullptr};

result = rocfft_execute(dst_plan.m_plan,
                        in,
                        out,
                        execinfo);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, that did it indeed!! With the correct profile and submission script, I unfortunately didn't manage to get openPMD working yet (get a runtime error) but adding some in-situ diags with

	auto& ptile = m_multi_beam.getBeam(0);
        const auto& aos = ptile.GetArrayOfStructs();
        const auto& pos_structs = aos.begin();
        amrex::Gpu::DeviceVector<amrex::Real> dV(1);
        amrex::Vector<amrex::Real> hV(1);
        auto pdV = dV.begin();
        amrex::ParallelFor(
            ptile.numParticles(),
            [=] AMREX_GPU_DEVICE (long ip) {
		amrex::Real x = pos_structs[ip].pos(0);
                amrex::Gpu::Atomic::Add(pdV, x*x);
            });
        amrex::Gpu::copy(amrex::Gpu::deviceToHost, dV.begin(), dV.end(), hV.begin());
        amrex::Print()<<hV[0]<<'\n';

line 400 in Evolve I could check that the evolution was the same on Spock as on Summit. Thank you all!
Screenshot 2021-09-11 at 18 20 13.

@MaxThevenet
Copy link
Member

MaxThevenet commented Sep 12, 2021

This PR is now ready for review IMO. Compilation and execution work well, and the results (single GPU, 20 time step, comparing the evolution of the beam width) agree with a simulation on an NVIDIA GPU. The main piece missing is openPMD I/O: I currently get a segfault when running this input file with I/O, with the following Backtrace:

Hipace::Evolve() at ??:?
OpenPMDWriter::InitDiagnostics(int, int, int, int) at ??:?
openPMD::Series::Series(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, openPMD::Access, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at ??:?
openPMD::internal::SeriesInternal::SeriesInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, openPMD::Access, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) at ??:?
openPMD::SeriesInterface::init(std::shared_ptr<openPMD::AbstractIOHandler>, std::unique_ptr<openPMD::SeriesInterface::ParsedInput, std::default_delete<openPMD::SeriesInterface::ParsedInput> >) at ??:?
openPMD::SeriesInterface::initDefaults(openPMD::IterationEncoding) at ??:?
bool openPMD::AttributableInterface::setAttribute<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) at ??:?
decltype(auto) mpark::detail::visitation::alt::visit_alt<mpark::detail::dtor, mpark::detail::destructor<mpark::detail::traits<char, unsigned char, short, int, long, long long, unsigned short, unsigned int, unsigned long, unsigned long long, float, double, long double, std::complex<float>, std::complex<double>, std::complex<long double>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<char, std::allocator<char> >, std::vector<short, std::allocator<short> >, std::vector<int, std::allocator<int> >, std::vector<long, std::allocator<long> >, std::vector<long long, std::allocator<long long> >, std::vector<unsigned char, std::allocator<unsigned char> >, std::vector<unsigned short, std::allocator<unsigned short> >, std::vector<unsigned int, std::allocator<unsigned int> >, std::vector<unsigned long, std::allocator<unsigned long> >, std::vector<unsigned long long, std::allocator<unsigned long long> >, std::vector<float, std::allocator<float> >, std::vector<double, std::allocator<double> >, std::vector<long double, std::allocator<long double> >, std::vector<std::complex<float>, std::allocator<std::complex<float> > >, std::vector<std::complex<double>, std::allocator<std::complex<double> > >, std::vector<std::complex<long double>, std::allocator<std::complex<long double> > >, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::array<double, 7ul>, bool>, (mpark::detail::Trait)1>&>(mpark::detail::dtor&&, mpark::detail::destructor<mpark::detail::traits<char, unsigned char, short, int, long, long long, unsigned short, unsigned int, unsigned long, unsigned long long, float, double, long double, std::complex<float>, std::complex<double>, std::complex<long double>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<char, std::allocator<char> >, std::vector<short, std::allocator<short> >, std::vector<int, std::allocator<int> >, std::vector<long, std::allocator<long> >, std::vector<long long, std::allocator<long long> >, std::vector<unsigned char, std::allocator<unsigned char> >, std::vector<unsigned short, std::allocator<unsigned short> >, std::vector<unsigned int, std::allocator<unsigned int> >, std::vector<unsigned long, std::allocator<unsigned long> >, std::vector<unsigned long long, std::allocator<unsigned long long> >, std::vector<float, std::allocator<float> >, std::vector<double, std::allocator<double> >, std::vector<long double, std::allocator<long double> >, std::vector<std::complex<float>, std::allocator<std::complex<float> > >, std::vector<std::complex<double>, std::allocator<std::complex<double> > >, std::vector<std::complex<long double>, std::allocator<std::complex<long double> > >, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::array<double, 7ul>, bool>, (mpark::detail::Trait)1>&) at ??:?

However there is no rush, I/O can be fixed in a subsequent PR.

Copy link
Member

@MaxThevenet MaxThevenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good to me, I think this can be merged. Thank you!

@MaxThevenet MaxThevenet merged commit f933188 into Hi-PACE:development Sep 27, 2021
@ax3l ax3l deleted the topic-rocFFT branch November 30, 2021 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous integration, checksum and analysis tests, GitHub Actions, etc. component: fields About 3D fields and slices, field solvers etc. GPU Related to GPU acceleration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants