Shared Memory Curve #846

Robadob · 2022-05-06T17:57:53Z

Closes #560
Closes #571

ptheywood · 2022-05-10T10:36:14Z

As of 2a8f983, this is looking on track to resolve the register use in CUDA 11.3+.

CUDA 11.6, SM_86, boids_bruteforce inputdata is using 46 reg/thread, compared to 145 reg/thread for current master.

ptxas info    : Used 4 registers, 352 bytes cmem[0]
ptxas info    : Compiling entry function '_ZN8flamegpu22agent_function_wrapperI14inputdata_implNS_17MessageBruteForceENS_11MessageNoneEEEvjPKNS_6detail5curve5Curve10CurveTableEPKcjjjjPjjPKvSE_P17curandStateXORWOWSC_SC_SC_' for 'sm_86'
ptxas info    : Function properties for _ZN8flamegpu22agent_function_wrapperI14inputdata_implNS_17MessageBruteForceENS_11MessageNoneEEEvjPKNS_6detail5curve5Curve10CurveTableEPKcjjjjPjjPKvSE_P17curandStateXORWOWSC_SC_SC_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 46 registers, 456 bytes cmem[0]

I've set the circles benchmark going on a 3080 + titan v in CUDA 11.2 so we can see the performance loss in that benchmark from using global curve only as an intermediate step (will be a while before they complete).

ptheywood · 2022-05-11T11:56:23Z

I've set the circles benchmark going on a 3080 + titan v in CUDA 11.2 so we can see the performance loss in that benchmark from using global curve only as an intermediate step (will be a while before they complete).

This has resulted in an invalid configuration argument error during the comm-radius portion of the benchmark, on a 3080 using CUDA 11.2. Run 235/240 within that benchmark.

2022-05-10T21:03:36Z: 235/240: circles_spatial3D 64000 40.000000 64000.000000 1.000000 40.000000 0
2022-05-10T21:03:36Z: 235/240: circles_spatial3D 64000 40.000000 64000.000000 1.000000 40.000000 0
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2-circles-benchmark/build-11-2-globalcurve/_deps/flamegpu2-src/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(27): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2-circles-benchmark/build-11-2-globalcurve/_deps/flamegpu2-src/include/flamegpu/runtime/HostAgentAPI.cuh(899): invalid configuration argument
Aborted (core dumped)

I've pushed to the globalcurve-error branch of the FLAMEGPU2-circles-benchmark repo which uses the globalcurve commit in this PR with just the offending case. This reproduces the issue reliably on the 3080 when using CUDA 11.2 with SEATBELTS=OFF.
Running on a Titan V, this produces an illegal memory acess error instead.

CUDA_VISIBLE_DEVICES=0 FLAMEGPU_INC_DIR=/home/ptheywood/code/flamegpu/FLAMEGPU2-circles-benchmark/build-11-2-globalcurve/_deps/flamegpu2-src/include/ ./bin/Release/circles-benchmarking 
Running experiment comm-radius - 1 configs, 1 simulators, 3 repetitions
2022-05-11T15:42:11Z: 1/3: circles_spatial3D 64000 40.000000 64000.000000 1.000000 40.000000 0
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2-circles-benchmark/build-11-2-globalcurve/_deps/flamegpu2-src/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(27): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2-circles-benchmark/build-11-2-globalcurve/_deps/flamegpu2-src/include/flamegpu/runtime/HostAgentAPI.cuh(899): invalid configuration argument

Chucking this through cuda-memcheck on the titan v reports that invalid __shared__ write of size 16 to 0x0000000 in the agent function wrapper for circiles_spatial3D output message impl.

Narrowed this down to sm_size only being set to a non zero value when SEATBELTS=ON in CUDASimualtion.cu. I've not checked if there are other instances of this.

Patching this resolves the titan v memory issue, but just moves the problem to the invalid confiugraiton error in HostAgentAPI.cuh

Robadob · 2022-05-15T21:37:24Z

Have updated DeviceEnvironment::getProperty(), such that the array element version requires the array length as a template argument. This is only used for validation internally, so it literally does nothing without seatbelts, however it provides much better consistency with the rest of the API (agent/message reads) where it is necessary.

I feel to enforce this consistency further, I will need to apply the same change to HostAPI/CUDASimulation environment methods.

Robadob · 2022-05-16T14:19:56Z

As of fb6c322, have built ran test suite in all 3 main configs on Windows (Debug + Release with/without seatbelts). USE_GLM enabled too, in each case all tests pass.

Robadob · 2022-05-19T11:39:14Z

Final 2 commits (denoted ?) are optional queries.

PRs commits are structured such that it shouldn't be squashed.

ptheywood · 2022-05-19T11:48:16Z

CUDA 11.4 A100 circles benchmark runs going for the last 3 commits (plus an alt version of the smem commit). Should finish overnight.

Will show:

if shared is an improvment, (it should be for brute at least).
if const restrict has a meaningful improvement (though as long as its not a degredation marking const restrict is valid anyway)
Whether smaller curve improves performance (partially occupancy related to smem use which circles might not be impacted by anyway, partially smaller curve table might be better anyway.

ptheywood · 2022-05-25T11:52:12Z

A100 Data for the vairous subcommits, with alpha.2 v100 data for reference.
Some series are fully occluded by others. Previously discussed on slack.

Shos it fixes the 11.3+ issue, and offers a good performance advantage.

V100 data showing the < 11.3 performance relative to the alpha.2 base case.

Global was a degredation for brute, but the smem 512 version is an improvement

ptheywood

I believe include/flamegpu/util/StringUint32Pair.h can now be deleted, as you've removed all use of it (I think).

Otherwise looks good but I've left a few comments on lines that stood out. Most of them don't need anything doing, but it might be nice to address if not too much faff, or for the @todo's promoting to issues might be worthwhile.

I went through relatively quick though so could have missed some bits.

Once done + a little history cleaning it'll be good to go IMO

include/flamegpu/gpu/CUDAAgent.h

include/flamegpu/runtime/AgentFunction.cuh

include/flamegpu/runtime/AgentFunctionCondition.cuh

include/flamegpu/runtime/detail/curve/Curve.cuh

include/flamegpu/runtime/detail/curve/DeviceCurve.cuh

include/flamegpu/runtime/detail/curve/HostCurve.cuh

src/flamegpu/gpu/CUDASimulation.cu

Robadob · 2022-05-25T20:01:53Z

@ptheywood Have applied your changes. You should in particular look at 7de1c16 as this was a substantial change due to your notes.

I have then ran tests (Windows/Release) and all still pass.

I will still need to rebase and clean up the commits if you're happy with all this.

ptheywood

Tests all pass under linux (SEATBELTS=OFF) and profiling still works so the device resetting is still solid.

The other changes all look good. I'll let you merge in case you want to tidy history (the individual commits did make it very easy to review the changes though, so that's appreciated)

Robadob · 2022-05-26T10:16:54Z

Tests all pass under linux (SEATBELTS=OFF) and profiling still works so the device resetting is still solid.

The other changes all look good. I'll let you merge in case you want to tidy history (the individual commits did make it very easy to review the changes though, so that's appreciated)

The only real caveat with device resetting is that i've removed tracking of which devices CUDASim is init on.

So if someone has 2 independent CUDASim on different device, only the 2nd to end will be reset.

Likewise, if they have a CUDASim alive, when their CUDAEnsemble exits, if it's on same device as ensemble was using, it will be reset

Tests pass (Windows, Release, Seatbelts=ON, 1038 Pass, 5 Disabled) Note, Environment cache is still in constant memory.

Tests pass (Windows, Release, Seatbelts=ON, 1038 pass, 5 Disabled)

This involves some big changes to Curve and Environment Manager. Curve is now split into 3 classes Curve, DeviceCurve, HostCurve, refactored to remove features redundant to new use-case. EnvironmentManager has also been refactored to remove features redundant to new use-case. 1028 Tests pass, Debug, Windows, Seatbelts=ON 548 Python Tests pass, Release, Windows, Seatbelts=Off, 10 skip.

and update how cudaDeviceReset() is automatically triggered. Purge should no longer be required as device-wide singletons were removed in the previous commit.

…nal template parameter N. Passing 0 (by default) does no length checking, passing any other value is tested against the length for parity with device API agent variable methods.

…gument. Closes #831

Robadob self-assigned this May 6, 2022

Robadob force-pushed the shared_mem_curve branch from 662e896 to f62d1f4 Compare May 6, 2022 18:22

Robadob force-pushed the shared_mem_curve branch from 680bc3e to df29aa5 Compare May 11, 2022 06:28

Robadob force-pushed the shared_mem_curve branch 2 times, most recently from 3d2e30f to 69a6a39 Compare May 12, 2022 22:22

ptheywood mentioned this pull request May 13, 2022

Use std::any #857

Closed

Robadob force-pushed the shared_mem_curve branch 2 times, most recently from 85d5e33 to 17a302e Compare May 15, 2022 21:43

Robadob force-pushed the shared_mem_curve branch 3 times, most recently from d7469d7 to c790a59 Compare May 19, 2022 11:13

Robadob marked this pull request as ready for review May 19, 2022 11:38

Robadob requested review from mondus and ptheywood May 19, 2022 11:39

Robadob added the optimisation label May 19, 2022

ptheywood requested changes May 25, 2022

View reviewed changes

Robadob requested a review from ptheywood May 25, 2022 20:00

Robadob force-pushed the shared_mem_curve branch from b77be3d to 4d2ea07 Compare May 25, 2022 21:05

ptheywood approved these changes May 26, 2022

View reviewed changes

Robadob added 2 commits May 26, 2022 11:22

Move curve from constant mem to device ptr.

7426664

Tests pass (Windows, Release, Seatbelts=ON, 1038 Pass, 5 Disabled) Note, Environment cache is still in constant memory.

Move env buffer from constant mem to device ptr.

bac097e

Tests pass (Windows, Release, Seatbelts=ON, 1038 pass, 5 Disabled)

Robadob added 5 commits May 26, 2022 11:22

Remove all purge behaviours.

e3da586

and update how cudaDeviceReset() is automatically triggered. Purge should no longer be required as device-wide singletons were removed in the previous commit.

All Environment set/get array property element methods now have optio…

3ababda

…nal template parameter N. Passing 0 (by default) does no length checking, passing any other value is tested against the length for parity with device API agent variable methods.

Move curve into shared memory.

bfbccb9

Py Breaking Change: Unify array methods, removing redundant length ar…

e4919f8

…gument. Closes #831

Robadob force-pushed the shared_mem_curve branch from 4d2ea07 to e4919f8 Compare May 26, 2022 10:22

ptheywood merged commit 620e904 into master May 26, 2022

ptheywood deleted the shared_mem_curve branch May 26, 2022 13:07

ptheywood mentioned this pull request May 26, 2022

Mixed RTC + Non RTC models may not be correct #842

Closed

Robadob mentioned this pull request May 27, 2022

Update docs for Py API breaking changes. FLAMEGPU/FLAMEGPU2-docs#86

Merged

ptheywood mentioned this pull request Oct 19, 2022

Updating Constant Cache once per layer. #353

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared Memory Curve #846

Shared Memory Curve #846

Robadob commented May 6, 2022 •

edited

Loading

ptheywood commented May 10, 2022 •

edited

Loading

ptheywood commented May 11, 2022 •

edited

Loading

Robadob commented May 15, 2022

Robadob commented May 16, 2022

Robadob commented May 19, 2022 •

edited

Loading

ptheywood commented May 19, 2022

ptheywood commented May 25, 2022

ptheywood left a comment

Robadob commented May 25, 2022 •

edited

Loading

ptheywood left a comment

Robadob commented May 26, 2022

Shared Memory Curve #846

Shared Memory Curve #846

Conversation

Robadob commented May 6, 2022 • edited Loading

ptheywood commented May 10, 2022 • edited Loading

ptheywood commented May 11, 2022 • edited Loading

Robadob commented May 15, 2022

Robadob commented May 16, 2022

Robadob commented May 19, 2022 • edited Loading

ptheywood commented May 19, 2022

ptheywood commented May 25, 2022

ptheywood left a comment

Choose a reason for hiding this comment

Robadob commented May 25, 2022 • edited Loading

ptheywood left a comment

Choose a reason for hiding this comment

Robadob commented May 26, 2022

Robadob commented May 6, 2022 •

edited

Loading

ptheywood commented May 10, 2022 •

edited

Loading

ptheywood commented May 11, 2022 •

edited

Loading

Robadob commented May 19, 2022 •

edited

Loading

Robadob commented May 25, 2022 •

edited

Loading