You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a UniformMemoryPool is constructed, the entire arena (all chunks) are (optionally) initialized with a value "initialized_value". But this value is not saved in the class, and if a chunk is used and then freed, it is never re-initialized. Effectively, there is no guarantee and no O(1) way to check if a chunk returned by allocate_chunk() is actually initialized or not, which seems to defeat the purpose of an initialized memory pool and seems very likely to cause nondeterministic bugs due to sometimes-initialized memory.
For an example of where this is used, the SPGEMM non-compressed symbolic case (StructureC_NC GPUTag functor, line 609 of src/sparse/impl/KokkosSparse_spgemm_impl_symbolic.hpp) is aware of this problem - it re-initializes hash_begins (the only part of the chunk that needs to be initialized) before freeing it:
if (is_global_alloced){
num_elements += used_hash_sizes[1];
//now thread leaves the memory as it finds. so there is no need to initialize the hash begins
nnz_lno_t dirty_hashes = globally_used_hash_count[0];
Kokkos::parallel_for(
Kokkos::ThreadVectorRange(teamMember, dirty_hashes),
[&] (nnz_lno_t i) {
nnz_lno_t dirty_hash = globally_used_hash_indices[i];
hm2.hash_begins[dirty_hash] = -1;
});
Kokkos::single(Kokkos::PerThread(teamMember),[&] () {
m_space.release_chunk(globally_used_hash_indices);
});
But this whole memory pool is eagerly initialized to -1 when it is constructed, even though only part of each chunk needs to be initialized.
It would be slow to lazily initialize chunks when allocated, or eagerly when freed because allocate_chunk is normally called inside a Kokkos::single(). So the initialization couldn't be done in parallel because not all threads/vector lanes are running it. allocate_chunk/free_chunk would need to be aware of the team_member and who is sharing the chunk (a team or a thread).
The best way to solve this (which would also slightly improve SPGEMM performance) would be to never initialize the chunks, and let the caller do that however it needs. If we go this route, we should just use the Kokkos memory pool instead.
The text was updated successfully, but these errors were encountered:
First the pattern we want:
Initialize memory pool
Someone gets a chunk and uses it
They release it with values reset to original
This is a common pattern in several algorithms. If someone is releasing memory with arbitrary values, then it could lead it problems. I agree users don't know what it is initialized to as it is not in the class. So far we have been the only user, so we know what it is.
Second, Kokkos memory pool was device level. At least it was.
If you think unitialized is the best we can consider that.
When a UniformMemoryPool is constructed, the entire arena (all chunks) are (optionally) initialized with a value "initialized_value". But this value is not saved in the class, and if a chunk is used and then freed, it is never re-initialized. Effectively, there is no guarantee and no O(1) way to check if a chunk returned by allocate_chunk() is actually initialized or not, which seems to defeat the purpose of an initialized memory pool and seems very likely to cause nondeterministic bugs due to sometimes-initialized memory.
For an example of where this is used, the SPGEMM non-compressed symbolic case (StructureC_NC GPUTag functor, line 609 of src/sparse/impl/KokkosSparse_spgemm_impl_symbolic.hpp) is aware of this problem - it re-initializes hash_begins (the only part of the chunk that needs to be initialized) before freeing it:
But this whole memory pool is eagerly initialized to -1 when it is constructed, even though only part of each chunk needs to be initialized.
It would be slow to lazily initialize chunks when allocated, or eagerly when freed because allocate_chunk is normally called inside a Kokkos::single(). So the initialization couldn't be done in parallel because not all threads/vector lanes are running it. allocate_chunk/free_chunk would need to be aware of the team_member and who is sharing the chunk (a team or a thread).
The best way to solve this (which would also slightly improve SPGEMM performance) would be to never initialize the chunks, and let the caller do that however it needs. If we go this route, we should just use the Kokkos memory pool instead.
The text was updated successfully, but these errors were encountered: