Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Desul ordered atomic policies + litmus tests #1616

Draft
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

publixsubfan
Copy link

Summary

  • When the Desul atomic backend is enabled, adds atomic policies in the form of RAJA::atomic_{mem_policy}_{scope}, where:
    • mem_policy is one of relaxed, acquire, release, acq_rel, or seq_cst
    • scope is either empty (device-scope), system for a system-wide atomic, or block for a block-wide atomic
  • Adds 2-thread litmus tests to test the ability of acquire-release and SC atomics to restore sequential memory behavior on relaxed memory platforms

Motivation

On architectures which adopt relaxed memory models (ARM, PowerPC, most GPU architectures), the order in which memory modifications on one thread are observed on another thread may differ from the "program order" of the memory operations. This may lead to unexpected results if, for example, an atomic variable is used as a mutex; writes done in the critical section may not be visible to another thread due to the memory subsystem reordering the writes.

x86 implements a much stronger, but not entirely sequentially-consistent memory model (x86-TSO). The only observable reordering between threads is Store->Load reordering, where an earlier (in "program-order") store can be reordered after the load. The "Store Buffer" litmus test demonstrates this behavior, where without fencing, it can appear as if the store instruction in both threads happen after their corresponding load instructions.

Desul supports specifying a memory order policy, which can restore a consistent view of memory operations between threads with a stronger memory ordering pair between the two threads.

Litmus testing

The added GPU litmus tests are based off of the work here: https://gpuharbor.ucsc.edu/webgpu-mem-testing/ and in a paper "Foundations of Empirical Memory Consistency Testing", Kirkham et al. (OOPSLA 2020).

Litmus testing allows us to probe the existence of relaxed memory behavior on GPU platforms. We implement a family of 2-thread tests, where each thread attempts to write data to or read data from a thread residing on a different block.

More references

"A Tutorial Introduction to the ARM and POWER Relaxed Memory Models"

Fiddling around with some parameters for the litmus test driver:
- It seems that having only a subset of the running blocks participate
  in the Message Passing litmus test increases the rate at which weak
  memory behaviors are observed.
- Pre-stressing memory doesn't seem to help on NVIDIA V100s.
Store buffering is an observable behavior where a store may be reordered
after a load. This exercises MemoryOrderSeqCst.
- Use a forall device kernel to check results
- Interleave order of operations between testing threads
- Only warn on a lack of observed relaxed behaviors
Correctly use the stress testing formulation from the paper, "Foundations
of Empirical Memory Consistency Testing" (OOPSLA 2020). Instead of having
all stressing blocks scatter their accesses across the "stressing" array,
select a small-ish subset of 64-word lines and stripe them across the
stressing blocks. This increases the stress on the contention hardware
in a GPU.

Synchronize testing blocks and stressing blocks together on each
iteration.
@trws
Copy link
Member

trws commented Mar 20, 2024

One comment based on yours @publixsubfan, all previously existing raja atomics were relaxed. It's the stronger ones, and really the scopes, that are most interesting because they can mean we can pass data without having to do all loads and stores with atomics. Block scope atomics, with device scope fences only when necessary, also are likely to help us greatly on El Cap. The atomics themselves wont be faster, but there will be substantially less expensive cache invalidation.

The code looks good to me with a cursory look over it. I'm not sure what we want to do with respect to these interfaces longer term, but this looks good to me as a place to explore.

@MrBurmark
Copy link
Member

Can you add something about how you get back non-atomic memory operations using this interface?

RAJA_HOST_DEVICE RAJA_INLINE T atomicAdd(AtomicPolicy, T volatile *acc, T value)
{
using desul_order =
typename detail::DesulAtomicPolicy<AtomicPolicy>::memory_order;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that AtomicPolicy has to be detail_atomic_t<...> instead of DesulAtomicPolicy<...>

@trws
Copy link
Member

trws commented Mar 22, 2024

Can you add something about how you get back non-atomic memory operations using this interface?

When designing desul, we realized that the sequential option behaved more like a scope than a memory order. The scope is MemoryScopeCaller in the desul interface here. The main reason for this is that we only need one implementation for essentially all backends to support it, and it makes no sense to give a different scope when there's no coherence. Of course, there's no point in giving a memory order either, but if the scope is the caller they at least all make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants