Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Refactor <atomic> and move implementation to libcxx #179

Merged
merged 34 commits into from
Aug 4, 2021

Conversation

wmaxey
Copy link
Member

@wmaxey wmaxey commented Jul 7, 2021

Overview

This is a refactor of <cuda/std/atomic> and the underlying <atomic> headers. This moves the implementation of CUDA specific atomics to the inner implementation. It also allows for us to implement the underlying mechancis for atomic_ref in the near future and allows NVC++ to access atomic PTX intrinsics when support is finally enabled.

Motivation

There are several goals here:

  • Eases the review burden on the nvcxx feature branch.
  • An opportunity to fix some alignment bugs with atomic<struct> types.
  • Allows us to prepare and test a backend that supports atomic_ref, a feature we will support very soon.

Requirements

  • Begin using <nv/target> in the atomic backend and within relevant tests.
  • Move atomic intrinsics from <cuda/std/...> and instead into `<libcxx/support/atomic>.
  • Refactor the atomic_impl internals to support more generic host and device layering.

Design

  • Move implementation specific atomic backends into individual headers.
  • When the CUDA backend is enabled #include a relevant host atomic backend into a host:: namespace.
  • For atomic_ref implement a corresponding atomic implementation class that is compatible with __cxx_atomic_unwrap.
  • Remove C11 support or continue to support it in host only code.

Test Plan

  • All currently existing tests should continue to function as expected.
  • Add a test for ensuring the correctness of differently aligned types.

Performance Tests

  • Compare generated code to previously generated code. The results of this should indicate that nothing will have changed.

Documentation

  • No documentation changes needed as of yet.

Complex Internal Systems

  • The internal namespace layering that will be used to break apart host and device atomics.
    • Controlled by macro switches for a given compiler.
    • Splits host and device atomics into separate namespaces so that the atomic implementation layer may remain unaware of which compiler/mode has been targeted.

Copy link
Collaborator

@griwes griwes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, only tests so far.

@wmaxey wmaxey force-pushed the feature/atomic_refactor branch from 4797649 to 51b2eb8 Compare July 8, 2021 22:52
@wmaxey wmaxey added this to the 1.6.0 milestone Jul 9, 2021
@wmaxey
Copy link
Member Author

wmaxey commented Jul 14, 2021

Partial review, only tests so far.

I've removed test changes in the interest of time.

Unfortunately there is a bug blocking progress related to <nv/target>: https://nvbugs/3341536

@wmaxey wmaxey force-pushed the feature/atomic_refactor branch 2 times, most recently from 2dab8e2 to 4c3ab20 Compare July 15, 2021 00:19
using std::detail::thread_scope_block;
using std::detail::thread_scope_thread;

namespace detail {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's embarrassing to admit it, but I never realized we had a namespace called detail. We should rename it to __detail throughout .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a rename commit.

@@ -293,7 +293,7 @@ inline void __strided_memcpy(char * __destination, char const * __source, std::s
}
}

#if __CUDA_ARCH__ >= 800
#if __CUDA_MINIMUM_ARCH__ >= 800
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code down below still has __CUDA_ARCH__ in the codepath selection logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed remaining CUDA_ARCH dispatches.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any further changes in this file. Line 389, for one, isn't using nvtarget.

Copy link
Member Author

@wmaxey wmaxey Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... That's because barrier fixes are part of another patch set. I won't be pulling those ones into this refactor. Instead I can reset this file to main to avoid complicating things further.

barrier will need the same injection into libcxx treatment.

// END TODO

// Wrap host atomic implementations into a sub-namespace
namespace host {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__host

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will uglify!

#include "atomic_cuda_derived.h"

template <typename _Tp>
struct __skip_amt { enum {value = 1}; };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"amt"? "amount"? I understand what this is for because I know the __atomic intrinsics, but I'd like this to have a better name. "__difference_scale" or something? (I don't think that's a good name either, but it's more evocative of what it is used for.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially the same hack that was in another layer. I'll try to use that one instead and get rid of the duplication.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with a rename to __atomic_ptr_inc. I think that conveys the idea that it's an increment for pointers.

}
__host__ __device__
inline int __cuda_memcmp(void const * __lhs, void const * __rhs, size_t __count) {
#ifdef __CUDA_ARCH__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to give this if target treatment? If not, why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed now.

@@ -435,4 +463,213 @@ _Type __host__ __atomic_fetch_min(_Type volatile *__ptr, _Delta __val, int __mem
return __expected;
}

_LIBCUDACXX_END_NAMESPACE_CUDA
template <typename _Tp, int _Sco>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code from here onwards repeats for GCC and MSVC, right? Can we move it to some sort of a "derived" header that is just included in both of these headers to avoid repeating all of it?

Maybe if we had a from-scratch backend for MSVC this wouldn't be an issue, but seems we aren't really in such a place, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be 'easiest' to just strip the content from atomic_gcc.h and instead call that some kind of cxx_atomic_base.h

atomic_gcc.h need only include cxx_atomic_base.h since gcc/clang comes with the intrinsics.

atomic_msvc.h creates the back-end intrinsics and then includes cxx_atomic_base.h.

Thoughts on that method? I think that would accomplish reducing code duplication and wouldn't be too confusing. Though it still means everything relies on the GCC atomic symbols being defined.

@wmaxey wmaxey force-pushed the feature/atomic_refactor branch from 4c3ab20 to 303418a Compare July 22, 2021 21:18
@wmaxey wmaxey changed the title WIP: Refactor <atomic> and move implementation to libcxx Refactor <atomic> and move implementation to libcxx Jul 26, 2021
@wmaxey wmaxey force-pushed the feature/atomic_refactor branch 2 times, most recently from 1d1b951 to efa59b9 Compare July 29, 2021 17:46
@wmaxey wmaxey force-pushed the feature/atomic_refactor branch from 665e6f8 to 0d5fb0e Compare July 29, 2021 19:49
@wmaxey wmaxey added the testing: internal ci passed Passed internal NVIDIA CI (DVS). label Aug 2, 2021
@wmaxey wmaxey requested a review from griwes August 3, 2021 22:14
@@ -293,7 +293,7 @@ inline void __strided_memcpy(char * __destination, char const * __source, std::s
}
}

#if __CUDA_ARCH__ >= 800
#if __CUDA_MINIMUM_ARCH__ >= 800
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any further changes in this file. Line 389, for one, isn't using nvtarget.

#ifndef _LIBCUDACXX_ATOMIC_BASE_H
#define _LIBCUDACXX_ATOMIC_BASE_H

#include <type_traits>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a file including this file doesn't include type_traits, type_traits is included inside a namespace, right? This needs to stop being here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for the catch.

#endif

#ifndef __CUDACC_RTC__
#include <string.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for type_traits above here.

Copy link
Collaborator

@griwes griwes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good job!

@wmaxey
Copy link
Member Author

wmaxey commented Aug 4, 2021

@wmaxey wmaxey merged commit ca45a79 into main Aug 4, 2021
@wmaxey wmaxey deleted the feature/atomic_refactor branch August 4, 2021 02:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
testing: internal ci passed Passed internal NVIDIA CI (DVS).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants