Half factorization #1712

yhmtsai · 2024-10-25T17:17:50Z

this pr adds the factorization with half support.

Hip does not support atomic on the 16bits type currently

TODO:

add the fix of tri solve with half

MarcelKoch

Generally LGTM. I have a question regarding atomics and hip. The latest ROCm shows support for fp16 atomic operations: https://rocm.docs.amd.com/en/latest/reference/precision-support.html#atomic-operations-support, but TBH I can't figure out what operations exactly they mean with that. Did you try anything in that regard?

MarcelKoch · 2024-11-11T11:37:35Z

test/factorization/par_ilut_kernels.cpp

                 PairTypenameNameGenerator);


 TYPED_TEST(ParIlut, KernelThresholdSelectIsEquivalentToRef)
 {
+    using value_type = typename TestFixture::value_type;


Many of the tests here are missing SKIP_HALF if compiling for HIP.

we do not support compute_l_u_factors in hip, but the others still works with half precision in HIP

I got your meaning now

MarcelKoch · 2024-11-11T14:45:16Z

cuda/solver/common_trs_kernels.cuh

@@ -212,13 +212,15 @@ struct CudaSolveStruct : gko::solver::SolveStruct {

        size_type work_size{};

+        // TODO: In nullptr is considered nullptr_t not casted to const
+        // it does not work in cuda110/100 images


nit:

Suggested change

// it does not work in cuda110/100 images

// Explicitly cast `nullptr` to `const ValueType*` to prevent compiler issues with cuda 10/11

I think it is more on the host compiler side because it goes through our binding first with specfic type

cuda/solver/common_trs_kernels.cuh

hip/components/memory.hip.hpp

reference/factorization/par_ilut_kernels.cpp

test/factorization/lu_kernels.cpp

… in shared memory

upsj · 2024-11-21T19:22:55Z

cuda/solver/common_trs_kernels.cuh

@@ -212,12 +212,16 @@ struct CudaSolveStruct : gko::solver::SolveStruct {

        size_type work_size{};

+        // nullptr is considered nullptr_t not casted to the function signature
+        // automatically Explicitly cast `nullptr` to `const ValueType*` to


nit:

Suggested change

// automatically Explicitly cast `nullptr` to `const ValueType*` to

// automatically explicitly cast `nullptr` to `const ValueType*` to

upsj · 2024-11-21T19:25:04Z

cuda/solver/common_trs_kernels.cuh

+template <bool is_upper, typename SharedValueType, typename ValueType,
+          typename IndexType>


Could SharedValueType be deduced inside, instead of making it an additional template parameter? You should be able to pull the code from the kernel launch into here and add a type alias. Otherwise it is easier to accidentally call the kernel with inconsistent types.

upsj · 2024-11-21T19:38:45Z

reference/factorization/par_ilut_kernels.cpp

+        // optimization wrongly on a custom class when IndexType is long. We set
+        // the index explicitly with volatile to solve it. NVHPC24.1 fixed this
+        // issue. https://godbolt.org/z/srYhGndKn
+        volatile auto index = (i + 1) * sampleselect_oversampling;


I'm not sure we should go this far to accommodate broken compilers. We have workarounds for compilation issues, but not really for this degree of broken-ness.

upsj · 2024-11-21T19:46:51Z

For HIP 16 bit atomics, as long as you only use load and store, you could implement them as

a 32 bit load_* plus a memcpy and,
a 32 bit load_* plus an atomic CAS, similar to how we did atomicAdd in the past. For safety, we need to execute this twice, assuming that every memory location only ever gets written once (which is true for all algorithms that use atomics), because your write can either fail because the upper half changed, or the lower half changed, one of which belongs to you and can't change without your knowledge.

yhmtsai · 2024-11-22T08:38:16Z

using 32 bit memory operation for 16 bit, it will have illegal memory access in the tail or head if we do not handle it in a upper level.

upsj · 2024-11-22T08:41:06Z

Theoretically that would be an easy fix: Make sure all allocations are ~~at least 32 bits and~~ rounded up to multiples of 4. But I believe most allocators already silently fulfill that assumption, and GPUs are unlikely to have 16 bit allocation boundaries for alignment purposes.

yhmtsai · 2024-11-22T09:04:13Z

I do not like slight guarantee unless we have a way to ensure or at least check.
However, I would suggest we do not consider it for this pr and release such that we have enough time ensure that it works correctly on hip.

upsj · 2024-11-22T09:12:51Z

I can give you a somewhat technical justification for this: cudaMalloc returns correctly aligned memory for thrust::complex<double>, despite not knowing anything about the type. So that means that the allocator is not using any space between those 16 byte-aligned allocations. Whether this is special-cased for allocations divisible by 16 or not I'm not sure (I would assume not, since people also allocate memory pools themselves), but again, we have an easy fix, which I would honestly consider useful in any case: round up the sizes raw_allow uses to at least be divisible by 4.

yhmtsai · 2024-11-22T13:42:49Z

I know the idea, sometimes it is necessary for optimized half precision by packing them (so, we will have kind of natively 32 bit by enforcing packing structure requirement)
I will still say it is not easy and confident to say it will be correct in this short period.
For example, user allocates some memory with 16 bits type but only pass the odd number to array_view.
should we accept or throw the error? Of course, these memory operation will not change the value out of the actual array, but it is still illegal memory operation.

yhmtsai added the 1:ST:WIP This PR is a work in progress. Not ready for review. label Oct 25, 2024

yhmtsai self-assigned this Oct 25, 2024

yhmtsai mentioned this pull request Oct 25, 2024

Half preconditioner, multigrid, log, and reorder #1713

Open

yhmtsai force-pushed the half_factorization branch from 3db59fd to cd9677a Compare October 28, 2024 16:12

yhmtsai force-pushed the half_solver branch from e962cb2 to 9a15695 Compare October 28, 2024 16:12

yhmtsai force-pushed the half_factorization branch from cd9677a to 5e5cd03 Compare October 28, 2024 17:19

yhmtsai force-pushed the half_solver branch from 9a15695 to 1d7f1d1 Compare October 28, 2024 17:19

yhmtsai force-pushed the half_factorization branch from 5e5cd03 to c276034 Compare October 29, 2024 09:17

yhmtsai force-pushed the half_solver branch from 1d7f1d1 to 1038d78 Compare October 29, 2024 09:17

yhmtsai force-pushed the half_factorization branch from c276034 to bbefde6 Compare October 29, 2024 18:21

yhmtsai force-pushed the half_solver branch from 1038d78 to 1959026 Compare October 29, 2024 18:21

yhmtsai mentioned this pull request Oct 30, 2024

Half precision support #1257

Open

12 tasks

yhmtsai added this to the Ginkgo 1.9.0 milestone Oct 30, 2024

yhmtsai force-pushed the half_solver branch from 1959026 to ac679c2 Compare November 4, 2024 14:24

yhmtsai force-pushed the half_factorization branch from bbefde6 to 72d9d50 Compare November 4, 2024 14:24

yhmtsai force-pushed the half_solver branch from ac679c2 to eda6a77 Compare November 4, 2024 18:15

yhmtsai force-pushed the half_factorization branch from 72d9d50 to 88967e6 Compare November 4, 2024 18:15

yhmtsai added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Nov 5, 2024

yhmtsai force-pushed the half_factorization branch from 88967e6 to e667ec0 Compare November 5, 2024 18:03

yhmtsai force-pushed the half_solver branch 2 times, most recently from 50ae4c1 to bba40e0 Compare November 7, 2024 14:40

yhmtsai force-pushed the half_factorization branch from e667ec0 to c32201d Compare November 7, 2024 14:40

MarcelKoch self-requested a review November 11, 2024 11:25

MarcelKoch requested changes Nov 11, 2024

View reviewed changes

MarcelKoch approved these changes Nov 13, 2024

View reviewed changes

MarcelKoch requested a review from upsj November 13, 2024 14:16

yhmtsai force-pushed the half_factorization branch 2 times, most recently from 7568854 to d68a589 Compare November 14, 2024 10:08

yhmtsai force-pushed the half_solver branch from 4d26712 to d64417c Compare November 18, 2024 11:16

yhmtsai force-pushed the half_factorization branch from d68a589 to e1a3b3d Compare November 18, 2024 11:16

yhmtsai force-pushed the half_solver branch from d64417c to 18139fd Compare November 18, 2024 12:46

yhmtsai force-pushed the half_factorization branch 2 times, most recently from bea709e to e4973cb Compare November 18, 2024 13:43

yhmtsai force-pushed the half_solver branch 2 times, most recently from 88c19f5 to 5993a90 Compare November 19, 2024 09:20

yhmtsai force-pushed the half_factorization branch from e4973cb to f6291e6 Compare November 19, 2024 09:20

yhmtsai added 9 commits November 20, 2024 18:33

triangular and direct solver

ca39960

workaround for half precision of load/store by using single precision…

1d0c633

… in shared memory

delete the current unusable half memory op on shared memory

eee2f38

direct and tri config dispatch

f23fcbe

factorization

e17fa4a

factorization config dispatch

57bb67b

cmake cuda test with cuda arch and fix is_finite

54e77d8

figure out factorization test

65dc7c1

change the diagonal to reduce random on parilut/parict

8ba9c5a

yhmtsai force-pushed the half_solver branch from 5993a90 to 6206ea8 Compare November 20, 2024 17:33

yhmtsai force-pushed the half_factorization branch from f6291e6 to 8ba9c5a Compare November 20, 2024 17:33

upsj reviewed Nov 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Half factorization #1712

Half factorization #1712

yhmtsai commented Oct 25, 2024 •

edited

Loading

MarcelKoch left a comment

MarcelKoch Nov 11, 2024

yhmtsai Nov 12, 2024

yhmtsai Nov 14, 2024

MarcelKoch Nov 11, 2024

yhmtsai Nov 12, 2024

upsj Nov 21, 2024

upsj Nov 21, 2024

yhmtsai Nov 22, 2024

upsj Nov 21, 2024

upsj commented Nov 21, 2024

yhmtsai commented Nov 22, 2024

upsj commented Nov 22, 2024 •

edited

Loading

yhmtsai commented Nov 22, 2024

upsj commented Nov 22, 2024

yhmtsai commented Nov 22, 2024

	// it does not work in cuda110/100 images
	// Explicitly cast `nullptr` to `const ValueType*` to prevent compiler issues with cuda 10/11

	// automatically Explicitly cast `nullptr` to `const ValueType*` to
	// automatically explicitly cast `nullptr` to `const ValueType*` to

		template <bool is_upper, typename SharedValueType, typename ValueType,
		typename IndexType>

Half factorization #1712

Are you sure you want to change the base?

Half factorization #1712

Conversation

yhmtsai commented Oct 25, 2024 • edited Loading

MarcelKoch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upsj commented Nov 21, 2024

yhmtsai commented Nov 22, 2024

upsj commented Nov 22, 2024 • edited Loading

yhmtsai commented Nov 22, 2024

upsj commented Nov 22, 2024

yhmtsai commented Nov 22, 2024

yhmtsai commented Oct 25, 2024 •

edited

Loading

upsj commented Nov 22, 2024 •

edited

Loading