Add support for host operations to executor #1232

upsj · 2022-12-08T18:44:44Z

This adds the GKO_REGISTER_HOST_OPERATION macro which can be used to annotate a costly core-side computation for benchmarking and (later on, once #1036 is merged) profiling. Additionally, it adds the OperationLogger to the host executor in GPU executions, which will make host-side operations visible in detailed benchmark results.

TODO:

Add tests

Closes #1211
Closes #1113

pratikvn

I like the approach of having the function as a kernel instead of in the core, but does there need to be a separate register operation ? Why not use exec->get_master()->run(...) ?

pratikvn · 2022-12-08T19:44:11Z

core/factorization/symbolic.cpp

 {
    using matrix_type = matrix::Csr<ValueType, IndexType>;
    const auto exec = mtx->get_executor();
    const auto host_exec = exec->get_master();
-    const auto forest = compute_elim_forest(mtx);
+    std::unique_ptr<elimination_forest<IndexType>> forest;
+    exec->run(make_compute_elim_forest(mtx, forest));


Is there some drawback in doing exec->get_master()->run(...) instead of having a separate call ? Then you dont need the GKO_REGISTER_HOST_OPERATION at all, right ?

It's not being logged with the executor the object lives on, which is also the reason that we need the workaround in the benchmarks. With the suggested approach, we'd need to pass an additional Executor parameter and provide overloads for both OmpExecutor and ReferenceExecutor and put them inside the kernels namespace, which IMO isn't a good place to put core functionality.

In my opinion, anything computationally expensive shouldn't be happening on the host. It should be on the kernel side. That also is in line with the Ginkgo philosophy of kernels doing the hard work and the host orchestrating the calling. It is perfectly fine if these kernels then are serial (with reference or with single threaded OpenMP), but I dont think we should be doing anything intensive on the host.

The operations we are looking at here (symbolic factorizations, elimination tree, later on MC64, AMD, METIS, ...) are either inherently sequential or very hard to parallelize (see e.g. RCM), let alone implement on GPUs, so either we duplicate the same implementation to OpenMP and Reference and call the kernels for host data only, or we have just a single place where they are implemented, in this case core. I think the latter is the more sensible choice here.

I agree that the operations are sequential and probably not suited for GPUs. My point is that they perform computations which are potentially expensive (with scaled up problem size) and hence they should not be on the core side. Regarding the duplication, we can just have a common kernel file (like we do for HIP and CUDA) and have the serial implementation on OpenMP (forcing it on one thread) and reference alike.

I think performing computation on the host side can have implications for future design. For example, if we want to add some stream/asynchronous tasking for these operations. Performing these operations on the kernel side would allow us to keep the asynchronous tasking/streaming kernel interface and keep the core side clean.

IMO avoiding this also keeps our logging and profiling interface clean, without fragmenting it into host operations and kernel operations.

I think we need to unpack the meaning of core and kernel a bit more here. core is everything in libginkgo.so, kernel is everything in libginkgo_<backend>.so. You are talking about a conceptional separation that IMO doesn't apply here - we already have some expensive operations in core, mainly factorization-related. The only thing this PR does is make them visible to profilers regardless of the executor the objects live on. The logging interface is not impacted by this change, since to the loggers, host and kernel operations look the same.

If we were to make them kernel operations, we would have to do a lot more work: Kernels can't create Ginkgo objects, so we need to operate on arrays and combine them into Ginkgo objects on the core side again. Also as a minor point, it also doubles compilation time and binary size of the resulting functions.

Yes, I would advocate for making them kernel operations, so this might be something we discuss before merging this PR.

Allocations should not happen inside a operation, and I think that is all the more reason to have allocations on the host side and do the operations on the kernel side. We log and profile the allocations separately anyway, so having the allocations inside the host operation makes the profiling and logging muddled.

We allocate a lot inside kernels, e.g. SpGEMM/SpGEAM, sparselib bindings, all kernels that use temporary memory, ..., removing these allocations/moving them into core would complicate the control flow significantly.

codecov · 2022-12-09T04:39:23Z

Codecov Report

Base: 91.55% // Head: 91.51% // Decreases project coverage by -0.03% ⚠️

Coverage data is based on head (9373771) compared to base (5b2f7a0).
Patch coverage: 77.89% of modified lines in pull request are covered.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1232      +/-   ##
===========================================
- Coverage    91.55%   91.51%   -0.04%     
===========================================
  Files          556      556              
  Lines        47418    47455      +37     
===========================================
+ Hits         43413    43429      +16     
- Misses        4005     4026      +21

Impacted Files	Coverage Δ
include/ginkgo/core/base/executor.hpp	`71.28% <ø> (ø)`
core/test/factorization/elimination_forest.cpp	`85.29% <50.00%> (-14.71%)`	⬇️
reference/test/factorization/lu_kernels.cpp	`97.70% <50.00%> (-2.30%)`	⬇️
reference/test/factorization/cholesky_kernels.cpp	`90.99% <68.75%> (-9.01%)`	⬇️
test/factorization/cholesky_kernels.cpp	`92.85% <71.42%> (-7.15%)`	⬇️
core/factorization/elimination_forest.cpp	`100.00% <100.00%> (ø)`
core/factorization/lu.cpp	`100.00% <100.00%> (ø)`
core/factorization/symbolic.cpp	`100.00% <100.00%> (ø)`
test/base/executor.cpp	`73.91% <100.00%> (+7.24%)`	⬆️
reference/base/index_set_kernels.cpp	`94.20% <0.00%> (+0.08%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

MarcelKoch

It seems to be a bit annoying that this approach doesn't allow for return values of the host kernels. Perhaps that could also be managed by adding a run_host to Executor (or just an run overload) that takes a HostOperation instead. I'm thinking of something like: https://godbolt.org/z/TTd71zKnq. With the appropriate macros that should be as simple as what we have right now. In c++17 this could be simpler, see https://stackoverflow.com/a/53673648

upsj · 2022-12-09T10:46:13Z

Yes, I agree this is a bit unergonomic. Do you see a way how to make that work considering that template functions can't be virtual?

MarcelKoch · 2022-12-09T11:16:07Z

Which virtual function do you refer to, Executor::run, HostOperation::run? I think both of them don't need to be virtual.

upsj · 2022-12-09T11:30:02Z

Having an overloaded function that is partially virtual and partially templated seems a bit dangerous, so I think we should choose only one of the two. Note that there is no such thing as a HostOperation ATM, as any potential later addition to what constitutes an Operation would need to be mirrored with HostOperation (e.g. work estimates, relocation and asynchronous execution for runtime systems). If we added return types, I would ideally like to support them both for host and device operations, and have them look as similar as possible to each other.

MarcelKoch · 2022-12-09T11:41:04Z

We could use Executor::run_host, which might also be helpful to add some verbosity. I think we could derive HostOperation from Operation, then changes that apply to Operations could be easily integrated.
But of course that would not help with return values for device operations. Perhaps with some CRTP magic it could be possible, but that is a bigger head scratcher.

yhmtsai · 2022-12-13T03:24:50Z

include/ginkgo/core/base/executor.hpp

+ *
+ * @ingroup Executor
+ */
+#define GKO_REGISTER_HOST_OPERATION(_name, _kernel)                          \


with this, will you still need to add the logger to host_executor?

No, but if the host operation calls kernels on the host side, they would not be visible in the benchmark runtime breakdown.

pratikvn · 2022-12-13T12:55:47Z

I will not hold up this PR anymore, because it is fairly small change and even though it adds the interface for host operations, I think removal of these requires update of code in develop. Additionally, as @upsj needs it for benchmarking purposes, from my side we can probably merge this once approved.

I have created #1240 to document what I propose.

MarcelKoch

LGTM. The out-parameters are annoying, but that is what we have to use for now. I think we could try to tackle kernels with return values later on. Although, now that I think of it, I don't think that returning non-trivial types from kernels is possible. In most cases, a kernel can't create a ginkgo object, because the constructors are implemented in core. But for host operations, that should still work.

core/factorization/symbolic.cpp

yhmtsai · 2022-12-16T16:19:47Z

@upsj Before I approve it, add test is still left in TODO. Did you already finish it?

upsj · 2022-12-16T17:56:44Z

@yhmtsai now I have, thanks for the reminder!

yhmtsai

LGTM. only minor format change.

core/factorization/symbolic.hpp

Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>

ginkgo-bot · 2022-12-16T18:28:06Z

Note: This PR changes the Ginkgo ABI:

Functions changes summary: 26 Removed, 0 Changed, 108 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable

For details check the full ABI diff under Artifacts here

sonarcloud · 2022-12-17T23:13:12Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
10 Code Smells

64.6% Coverage
0.0% Duplication

Release 1.6.0 of Ginkgo. The Ginkgo team is proud to announce the new Ginkgo minor release 1.6.0. This release brings new features such as: - Several building blocks for GPU-resident sparse direct solvers like symbolic and numerical LU and Cholesky factorization, ..., - A distributed Schwarz preconditioner, - New FGMRES and GCR solvers, - Distributed benchmarks for the SpMV operation, solvers, ... - Support for non-default streams in the CUDA and HIP backends, - Mixed precision support for the CSR SpMV, - A new profiling logger which integrates with NVTX, ROCTX, TAU and VTune to provide internal Ginkgo knowledge to most HPC profilers! and much more. If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions). Supported systems and requirements: + For all platforms, CMake 3.13+ + C++14 compliant compiler + Linux and macOS + GCC: 5.5+ + clang: 3.9+ + Intel compiler: 2018+ + Apple Clang: 14.0 is tested. Earlier versions might also work. + NVHPC: 22.7+ + Cray Compiler: 14.0.1+ + CUDA module: CUDA 9.2+ or NVHPC 22.7+ + HIP module: ROCm 4.5+ + DPC++ module: Intel OneAPI 2021.3+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp`. + Windows + MinGW: GCC 5.5+ + Microsoft Visual Studio: VS 2019+ + CUDA module: CUDA 9.2+, Microsoft Visual Studio + OpenMP module: MinGW. ### Version Support Changes + ROCm 4.0+ -> 4.5+ after [#1303](#1303) + Removed Cygwin pipeline and support [#1283](#1283) ### Interface Changes + Due to internal changes, `ConcreteExecutor::run` will now always throw if the corresponding module for the `ConcreteExecutor` is not build [#1234](#1234) + The constructor of `experimental::distributed::Vector` was changed to only accept local vectors as `std::unique_ptr` [#1284](#1284) + The default parameters for the `solver::MultiGrid` were improved. In particular, the smoother defaults to one iteration of `Ir` with `Jacobi` preconditioner, and the coarse grid solver uses the new direct solver with LU factorization. [#1291](#1291) [#1327](#1327) + The `iteration_complete` event gained a more expressive overload with additional parameters, the old overloads were deprecated. [#1288](#1288) [#1327](#1327) ### Deprecations + Deprecated less expressive `iteration_complete` event. Users are advised to now implement the function `void iteration_complete(const LinOp* solver, const LinOp* b, const LinOp* x, const size_type& it, const LinOp* r, const LinOp* tau, const LinOp* implicit_tau_sq, const array<stopping_status>* status, bool stopped)` [#1288](#1288) ### Added Features + A distributed Schwarz preconditioner. [#1248](#1248) + A GCR solver [#1239](#1239) + Flexible Gmres solver [#1244](#1244) + Enable Gmres solver for distributed matrices and vectors [#1201](#1201) + An example that uses Kokkos to assemble the system matrix [#1216](#1216) + A symbolic LU factorization allowing the `gko::experimental::factorization::Lu` and `gko::experimental::solver::Direct` classes to be used for matrices with non-symmetric sparsity pattern [#1210](#1210) + A numerical Cholesky factorization [#1215](#1215) + Symbolic factorizations in host-side operations are now wrapped in a host-side `Operation` to make their execution visible to loggers. This means that profiling loggers and benchmarks are no longer missing a separate entry for their runtime [#1232](#1232) + Symbolic factorization benchmark [#1302](#1302) + The `ProfilerHook` logger allows annotating the Ginkgo execution (apply, operations, ...) for profiling frameworks like NVTX, ROCTX and TAU. [#1055](#1055) + `ProfilerHook::created_(nested_)summary` allows the generation of a lightweight runtime profile over all Ginkgo functions written to a user-defined stream [#1270](#1270) for both host and device timing functionality [#1313](#1313) + It is now possible to enable host buffers for MPI communications at runtime even if the compile option `GINKGO_FORCE_GPU_AWARE_MPI` is set. [#1228](#1228) + A stencil matrices generator (5-pt, 7-pt, 9-pt, and 27-pt) for benchmarks [#1204](#1204) + Distributed benchmarks (multi-vector blas, SpMV, solver) [#1204](#1204) + Benchmarks for CSR sorting and lookup [#1219](#1219) + A timer for MPI benchmarks that reports the longest time [#1217](#1217) + A `timer_method=min|max|average|median` flag for benchmark timing summary [#1294](#1294) + Support for non-default streams in CUDA and HIP executors [#1236](#1236) + METIS integration for nested dissection reordering [#1296](#1296) + SuiteSparse AMD integration for fillin-reducing reordering [#1328](#1328) + Csr mixed-precision SpMV support [#1319](#1319) + A `with_loggers` function for all `Factory` parameters [#1337](#1337) ### Improvements + Improve naming of kernel operations for loggers [#1277](#1277) + Annotate solver iterations in `ProfilerHook` [#1290](#1290) + Allow using the profiler hooks and inline input strings in benchmarks [#1342](#1342) + Allow passing smart pointers in place of raw pointers to most matrix functions. This means that things like `vec->compute_norm2(x.get())` or `vec->compute_norm2(lend(x))` can be simplified to `vec->compute_norm2(x)` [#1279](#1279) [#1261](#1261) + Catch overflows in prefix sum operations, which makes Ginkgo's operations much less likely to crash. This also improves the performance of the prefix sum kernel [#1303](#1303) + Make the installed GinkgoConfig.cmake file relocatable and follow more best practices [#1325](#1325) ### Fixes + Fix OpenMPI version check [#1200](#1200) + Fix the mpi cxx type binding by c binding [#1306](#1306) + Fix runtime failures for one-sided MPI wrapper functions observed on some OpenMPI versions [#1249](#1249) + Disable thread pinning with GPU executors due to poor performance [#1230](#1230) + Fix hwloc version detection [#1266](#1266) + Fix PAPI detection in non-implicit include directories [#1268](#1268) + Fix PAPI support for newer PAPI versions: [#1321](#1321) + Fix pkg-config file generation for library paths outside prefix [#1271](#1271) + Fix various build failures with ROCm 5.4, CUDA 12, and OneAPI 6 [#1214](#1214), [#1235](#1235), [#1251](#1251) + Fix incorrect read for skew-symmetric MatrixMarket files with explicit diagonal entries [#1272](#1272) + Fix handling of missing diagonal entries in symbolic factorizations [#1263](#1263) + Fix segmentation fault in benchmark matrix construction [#1299](#1299) + Fix the stencil matrix creation for benchmarking [#1305](#1305) + Fix the additional residual check in IR [#1307](#1307) + Fix the cuSPARSE CSR SpMM issue on single strided vector when cuda >= 11.6 [#1322](#1322) [#1331](#1331) + Fix Isai generation for large sparsity powers [#1327](#1327) + Fix Ginkgo compilation and test with NVHPC >= 22.7 [#1331](#1331) + Fix Ginkgo compilation of 32 bit binaries with MSVC [#1349](#1349)

upsj added the 1:ST:ready-for-review This PR is ready for review label Dec 8, 2022

upsj requested a review from a team December 8, 2022 18:44

upsj self-assigned this Dec 8, 2022

pratikvn reviewed Dec 8, 2022

View reviewed changes

MarcelKoch self-requested a review December 9, 2022 09:12

MarcelKoch reviewed Dec 9, 2022

View reviewed changes

thoasm self-requested a review December 9, 2022 11:38

yhmtsai reviewed Dec 13, 2022

View reviewed changes

MarcelKoch self-requested a review December 16, 2022 13:52

MarcelKoch approved these changes Dec 16, 2022

View reviewed changes

core/factorization/symbolic.cpp Show resolved Hide resolved

add support for host operations

6b41a9c

upsj force-pushed the host_operations branch from a9ff5ef to 995f9e2 Compare December 16, 2022 17:55

add host operation test

4852cef

upsj force-pushed the host_operations branch from 995f9e2 to 4852cef Compare December 16, 2022 17:56

yhmtsai approved these changes Dec 16, 2022

View reviewed changes

core/factorization/symbolic.hpp Outdated Show resolved Hide resolved

core/factorization/symbolic.hpp Outdated Show resolved Hide resolved

formatting improvements

9373771

Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>

upsj added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Dec 16, 2022

upsj merged commit 60b0514 into develop Dec 18, 2022

upsj deleted the host_operations branch December 18, 2022 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for host operations to executor #1232

Add support for host operations to executor #1232

upsj commented Dec 8, 2022 •

edited

Loading

pratikvn left a comment

pratikvn Dec 8, 2022

upsj Dec 8, 2022 •

edited

Loading

pratikvn Dec 9, 2022

upsj Dec 9, 2022

pratikvn Dec 9, 2022

upsj Dec 9, 2022 •

edited

Loading

pratikvn Dec 9, 2022

upsj Dec 9, 2022

codecov bot commented Dec 9, 2022 •

edited

Loading

MarcelKoch left a comment

upsj commented Dec 9, 2022

MarcelKoch commented Dec 9, 2022

upsj commented Dec 9, 2022 •

edited

Loading

MarcelKoch commented Dec 9, 2022

yhmtsai Dec 13, 2022

upsj Dec 13, 2022

pratikvn commented Dec 13, 2022 •

edited

Loading

MarcelKoch left a comment

yhmtsai commented Dec 16, 2022

upsj commented Dec 16, 2022

yhmtsai left a comment

ginkgo-bot commented Dec 16, 2022

sonarcloud bot commented Dec 17, 2022

Add support for host operations to executor #1232

Add support for host operations to executor #1232

Conversation

upsj commented Dec 8, 2022 • edited Loading

pratikvn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upsj Dec 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upsj Dec 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 9, 2022 • edited Loading

Codecov Report

MarcelKoch left a comment

Choose a reason for hiding this comment

upsj commented Dec 9, 2022

MarcelKoch commented Dec 9, 2022

upsj commented Dec 9, 2022 • edited Loading

MarcelKoch commented Dec 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratikvn commented Dec 13, 2022 • edited Loading

MarcelKoch left a comment

Choose a reason for hiding this comment

yhmtsai commented Dec 16, 2022

upsj commented Dec 16, 2022

yhmtsai left a comment

Choose a reason for hiding this comment

ginkgo-bot commented Dec 16, 2022

sonarcloud bot commented Dec 17, 2022

upsj commented Dec 8, 2022 •

edited

Loading

upsj Dec 8, 2022 •

edited

Loading

upsj Dec 9, 2022 •

edited

Loading

codecov bot commented Dec 9, 2022 •

edited

Loading

upsj commented Dec 9, 2022 •

edited

Loading

pratikvn commented Dec 13, 2022 •

edited

Loading