Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Half factorization #1712

Open
wants to merge 9 commits into
base: half_solver
Choose a base branch
from
Open

Half factorization #1712

wants to merge 9 commits into from

Conversation

yhmtsai
Copy link
Member

@yhmtsai yhmtsai commented Oct 25, 2024

this pr adds the factorization with half support.

Hip does not support atomic on the 16bits type currently

TODO:

  • add the fix of tri solve with half

@yhmtsai yhmtsai added the 1:ST:WIP This PR is a work in progress. Not ready for review. label Oct 25, 2024
@yhmtsai yhmtsai self-assigned this Oct 25, 2024
@ginkgo-bot ginkgo-bot added reg:testing This is related to testing. type:solver This is related to the solvers type:factorization This is related to the Factorizations reg:helper-scripts This issue/PR is related to the helper scripts mainly concerned with development of Ginkgo. mod:all This touches all Ginkgo modules. labels Oct 25, 2024
@yhmtsai yhmtsai mentioned this pull request Oct 30, 2024
12 tasks
@yhmtsai yhmtsai added this to the Ginkgo 1.9.0 milestone Oct 30, 2024
@yhmtsai yhmtsai added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Nov 5, 2024
@yhmtsai yhmtsai force-pushed the half_solver branch 2 times, most recently from 50ae4c1 to bba40e0 Compare November 7, 2024 14:40
@MarcelKoch MarcelKoch self-requested a review November 11, 2024 11:25
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. I have a question regarding atomics and hip. The latest ROCm shows support for fp16 atomic operations: https://rocm.docs.amd.com/en/latest/reference/precision-support.html#atomic-operations-support, but TBH I can't figure out what operations exactly they mean with that. Did you try anything in that regard?

PairTypenameNameGenerator);


TYPED_TEST(ParIlut, KernelThresholdSelectIsEquivalentToRef)
{
using value_type = typename TestFixture::value_type;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the tests here are missing SKIP_HALF if compiling for HIP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not support compute_l_u_factors in hip, but the others still works with half precision in HIP

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your meaning now

@@ -212,13 +212,15 @@ struct CudaSolveStruct : gko::solver::SolveStruct {

size_type work_size{};

// TODO: In nullptr is considered nullptr_t not casted to const
// it does not work in cuda110/100 images
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// it does not work in cuda110/100 images
// Explicitly cast `nullptr` to `const ValueType*` to prevent compiler issues with cuda 10/11

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more on the host compiler side because it goes through our binding first with specfic type

cuda/solver/common_trs_kernels.cuh Outdated Show resolved Hide resolved
cuda/solver/common_trs_kernels.cuh Outdated Show resolved Hide resolved
cuda/solver/common_trs_kernels.cuh Outdated Show resolved Hide resolved
cuda/solver/common_trs_kernels.cuh Outdated Show resolved Hide resolved
cuda/solver/common_trs_kernels.cuh Outdated Show resolved Hide resolved
hip/components/memory.hip.hpp Outdated Show resolved Hide resolved
reference/factorization/par_ilut_kernels.cpp Outdated Show resolved Hide resolved
test/factorization/lu_kernels.cpp Show resolved Hide resolved
@@ -212,12 +212,16 @@ struct CudaSolveStruct : gko::solver::SolveStruct {

size_type work_size{};

// nullptr is considered nullptr_t not casted to the function signature
// automatically Explicitly cast `nullptr` to `const ValueType*` to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// automatically Explicitly cast `nullptr` to `const ValueType*` to
// automatically explicitly cast `nullptr` to `const ValueType*` to

Comment on lines +401 to +402
template <bool is_upper, typename SharedValueType, typename ValueType,
typename IndexType>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could SharedValueType be deduced inside, instead of making it an additional template parameter? You should be able to pull the code from the kernel launch into here and add a type alias. Otherwise it is easier to accidentally call the kernel with inconsistent types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

// optimization wrongly on a custom class when IndexType is long. We set
// the index explicitly with volatile to solve it. NVHPC24.1 fixed this
// issue. https://godbolt.org/z/srYhGndKn
volatile auto index = (i + 1) * sampleselect_oversampling;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should go this far to accommodate broken compilers. We have workarounds for compilation issues, but not really for this degree of broken-ness.

@upsj
Copy link
Member

upsj commented Nov 21, 2024

For HIP 16 bit atomics, as long as you only use load and store, you could implement them as

  • a 32 bit load_* plus a memcpy and,
  • a 32 bit load_* plus an atomic CAS, similar to how we did atomicAdd in the past. For safety, we need to execute this twice, assuming that every memory location only ever gets written once (which is true for all algorithms that use atomics), because your write can either fail because the upper half changed, or the lower half changed, one of which belongs to you and can't change without your knowledge.

@yhmtsai
Copy link
Member Author

yhmtsai commented Nov 22, 2024

using 32 bit memory operation for 16 bit, it will have illegal memory access in the tail or head if we do not handle it in a upper level.

@upsj
Copy link
Member

upsj commented Nov 22, 2024

Theoretically that would be an easy fix: Make sure all allocations are at least 32 bits and rounded up to multiples of 4. But I believe most allocators already silently fulfill that assumption, and GPUs are unlikely to have 16 bit allocation boundaries for alignment purposes.

@yhmtsai
Copy link
Member Author

yhmtsai commented Nov 22, 2024

I do not like slight guarantee unless we have a way to ensure or at least check.
However, I would suggest we do not consider it for this pr and release such that we have enough time ensure that it works correctly on hip.

@upsj
Copy link
Member

upsj commented Nov 22, 2024

I can give you a somewhat technical justification for this: cudaMalloc returns correctly aligned memory for thrust::complex<double>, despite not knowing anything about the type. So that means that the allocator is not using any space between those 16 byte-aligned allocations. Whether this is special-cased for allocations divisible by 16 or not I'm not sure (I would assume not, since people also allocate memory pools themselves), but again, we have an easy fix, which I would honestly consider useful in any case: round up the sizes raw_allow uses to at least be divisible by 4.

@yhmtsai
Copy link
Member Author

yhmtsai commented Nov 22, 2024

I know the idea, sometimes it is necessary for optimized half precision by packing them (so, we will have kind of natively 32 bit by enforcing packing structure requirement)
I will still say it is not easy and confident to say it will be correct in this short period.
For example, user allocates some memory with 16 bits type but only pass the odd number to array_view.
should we accept or throw the error? Of course, these memory operation will not change the value out of the actual array, but it is still illegal memory operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-for-review This PR is ready for review mod:all This touches all Ginkgo modules. reg:helper-scripts This issue/PR is related to the helper scripts mainly concerned with development of Ginkgo. reg:testing This is related to testing. type:factorization This is related to the Factorizations type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants