[alpaka] Refactor prefixScan implementation #220

antoniopetre · 2021-09-09T11:41:09Z

The prefixScan algorithm is implemented in Alpaka using two kernels, while a single kernel is used for Native CUDA.

I refactored the prefixScan implementation in order to use a single kernel (similar with the Native CUDA implementation).

fwyzard · 2021-10-12T09:49:47Z

Fixed conflicts and applied code formatting.

fwyzard · 2021-10-14T08:38:29Z

Rebased and fixed conflicts.

fwyzard · 2021-10-15T12:32:45Z

Rebased and fixed conflicts.

makortel

In general looks ok.

src/alpaka/AlpakaCore/prefixScan.h

makortel · 2021-10-15T15:55:33Z

On Cori (with CUDA 11.2) I got the following failure when running ./alpaka --cuda

Processing 1000 events, of which 1 concurrently, with 1 threads.
terminate called after throwing an instance of 'std::runtime_error'
  what():  .../pixeltrack-standalone/external/alpaka/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(101) 'cudaFree(reinterpret_cast<void*>(memPtr))' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!

I'm really puzzled what BufUniformCudaHipRt is doing here (ok, maybe it is something that works with both CUDA and HIP). The master version runs fine.

makortel · 2021-10-15T16:00:25Z

Here is a stack trace of the exception

#0  __cxxabiv1::__cxa_throw (obj=obj@entry=0xb42e930, tinfo=0x2aaaac5909d0 <typeinfo for std::runtime_error>, dest=0x2aaaac2c3b90 <std::runtime_error::~runtime_error()>) at ../../.././libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x00002aaaab8f1ad8 in alpaka::uniform_cuda_hip::detail::rtCheck (line=<optimized out>, file=<optimized out>, desc=<optimized out>, error=<optimized out>) at .../pixeltrack-standalone/external/alpaka/include/alpaka/core/UniformCudaHip.hpp:67
#2  alpaka::uniform_cuda_hip::detail::rtCheckIgnore<>(cudaError const&, char const*, char const*, int const&) (error=<optimized out>, cmd=<optimized out>, file=<optimized out>, line=<optimized out>) at .../pixeltrack-standalone/external/alpaka/include/alpaka/core/UniformCudaHip.hpp:88
#3  0x00002aaab6cf3324 in alpaka::traits::CurrentThreadWaitFor<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtBase, void>::currentThreadWaitFor (queue=...) at /global/common/cori_cle7/software/sles15_cgpu/gcc/8.3.0/include/c++/8.3.0/bits/shared_ptr_base.h:1018
#4  alpaka::wait<alpaka::QueueUniformCudaHipRtNonBlocking> (awaited=...) at .../pixeltrack-standalone/external/alpaka/include/alpaka/wait/Traits.hpp:38
#5  alpaka_cuda_async::gpuVertexFinder::Producer::makeAsync (this=this@entry=0xbd5db8, tksoa=tksoa@entry=0x2aaae6000000, ptMin=<optimized out>, queue=...) at .../pixeltrack-standalone/src/alpaka/plugin-PixelVertexFinding/alpaka/gpuVertexFinder.cc:179
#6  0x00002aaab6cf952a in alpaka_cuda_async::PixelVertexProducerAlpaka::produce (this=0xbd5da8, iEvent=..., iSetup=...) at .../pixeltrack-standalone/src/alpaka/plugin-PixelVertexFinding/alpaka/PixelVertexProducerAlpaka.cc:53
#7  0x00002aaab6cfa1d4 in edm::EDProducer::doProduce (eventSetup=..., event=..., this=<optimized out>) at .../pixeltrack-standalone/src/alpaka/Framework/EDProducer.h:19
#8  edm::WorkerT<alpaka_cuda_async::PixelVertexProducerAlpaka>::doWorkAsync(edm::Event&, edm::EventSetup const&, edm::WaitingTask*)::{lambda(std::__exception_ptr::exception_ptr const*)#1}::operator()(std::__exception_ptr::exception_ptr const*) (iPtr=<optimized out>, this=<optimized out>)
    at .../pixeltrack-standalone/src/alpaka/Framework/Worker.h:69
#9  edm::FunctorWaitingTask<edm::WorkerT<alpaka_cuda_async::PixelVertexProducerAlpaka>::doWorkAsync(edm::Event&, edm::EventSetup const&, edm::WaitingTask*)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() (this=0x2aaab7f3fd40) at .../pixeltrack-standalone/src/alpaka/Framework/WaitingTask.h:78
#10 0x00002aaaabd6d07d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x2aaab7f93e00, context_guard=..., t=t@entry=0x2aaab7f3fd40, isolation=isolation@entry=0) at ../../include/tbb/task.h:992
#11 0x00002aaaabd6d375 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2aaab7f93e00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:992
#12 0x000000000041bf5b in tbb::task::wait_for_all (this=0x2aaab7f97d40) at .../pixeltrack-standalone/external/tbb/include/tbb/task.h:992
#13 edm::EventProcessor::runToCompletion (this=this@entry=0x7fffffff5960) at .../pixeltrack-standalone/src/alpaka/bin/EventProcessor.cc:37
#14 0x00000000004112ce in main (argc=<optimized out>, argv=<optimized out>) at .../pixeltrack-standalone/src/alpaka/bin/main.cc:176

fwyzard · 2021-10-20T06:57:27Z

Fixed conflicts, rebased, etc.

fwyzard · 2021-10-20T07:25:37Z

While the validation is good, now I see a small but systematic loss in performance.

Before:

$ CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --validation --maxEvents 10000; echo; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --maxEvents 10000; done
Processing 10000 events, of which 16 concurrently, with 8 threads.
CountValidator: all 10000 events passed validation
 Average relative track difference 0.000880287 (all within tolerance)
 Average absolute vertex difference 0.0007 (all within tolerance)
Processed 10000 events in 4.353466e+01 seconds, throughput 229.702 events/s.

Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.096583e+01 seconds, throughput 244.106 events/s.
Processed 10000 events in 4.049791e+01 seconds, throughput 246.926 events/s.
Processed 10000 events in 4.007989e+01 seconds, throughput 249.502 events/s.
Processed 10000 events in 4.102423e+01 seconds, throughput 243.758 events/s.

After:

$ CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --validation --maxEvents 10000; echo; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --maxEvents 10000; done
Processing 10000 events, of which 16 concurrently, with 8 threads.
CountValidator: all 10000 events passed validation
 Average relative track difference 0.00088813 (all within tolerance)
 Average absolute vertex difference 0.0004 (all within tolerance)
Processed 10000 events in 4.477171e+01 seconds, throughput 223.355 events/s.

Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.160250e+01 seconds, throughput 240.37 events/s.
Processed 10000 events in 4.151133e+01 seconds, throughput 240.898 events/s.
Processed 10000 events in 4.216559e+01 seconds, throughput 237.16 events/s.
Processed 10000 events in 4.186934e+01 seconds, throughput 238.838 events/s.

So 2-3% slower.

makortel added the alpaka label Sep 9, 2021

fwyzard force-pushed the refactor_prefixScan branch 2 times, most recently from 20130b8 to fb7bd6f Compare October 12, 2021 09:49

fwyzard force-pushed the refactor_prefixScan branch 3 times, most recently from be2894f to d427564 Compare October 14, 2021 08:33

fwyzard force-pushed the refactor_prefixScan branch from d427564 to f8a75ea Compare October 15, 2021 12:32

fwyzard requested review from makortel and waredjeb October 15, 2021 12:35

waredjeb approved these changes Oct 15, 2021

View reviewed changes

makortel reviewed Oct 15, 2021

View reviewed changes

src/alpaka/AlpakaCore/prefixScan.h Outdated Show resolved Hide resolved

fwyzard mentioned this pull request Oct 15, 2021

[alpaka] Make use of cms::alpakatools::ALPAKA_ACCELERATOR_NAMESPACE #241

Closed

fwyzard force-pushed the refactor_prefixScan branch from f8a75ea to 3ba0e0d Compare October 20, 2021 06:57

[alpaka] Refactor prefixScan implementation

63cae86

fwyzard force-pushed the refactor_prefixScan branch from 3ba0e0d to 63cae86 Compare October 20, 2021 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[alpaka] Refactor prefixScan implementation #220

[alpaka] Refactor prefixScan implementation #220

antoniopetre commented Sep 9, 2021

fwyzard commented Oct 12, 2021

fwyzard commented Oct 14, 2021

fwyzard commented Oct 15, 2021

makortel left a comment

makortel commented Oct 15, 2021 •

edited

Loading

makortel commented Oct 15, 2021

fwyzard commented Oct 20, 2021

fwyzard commented Oct 20, 2021

[alpaka] Refactor prefixScan implementation #220

Are you sure you want to change the base?

[alpaka] Refactor prefixScan implementation #220

Conversation

antoniopetre commented Sep 9, 2021

fwyzard commented Oct 12, 2021

fwyzard commented Oct 14, 2021

fwyzard commented Oct 15, 2021

makortel left a comment

Choose a reason for hiding this comment

makortel commented Oct 15, 2021 • edited Loading

makortel commented Oct 15, 2021

fwyzard commented Oct 20, 2021

fwyzard commented Oct 20, 2021

Before:

After:

makortel commented Oct 15, 2021 •

edited

Loading