Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared Memory Curve #846

Merged
merged 7 commits into from
May 26, 2022
Merged

Shared Memory Curve #846

merged 7 commits into from
May 26, 2022

Conversation

Robadob
Copy link
Member

@Robadob Robadob commented May 6, 2022

  • Curve to device ptr
  • Environment to device ptr
  • Curve to per-agent function
  • Curve to shared memory
  • Check/update pyflamegpu tests

Closes #560
Closes #571

@Robadob Robadob self-assigned this May 6, 2022
@Robadob Robadob force-pushed the shared_mem_curve branch from 662e896 to f62d1f4 Compare May 6, 2022 18:22
@ptheywood
Copy link
Member

ptheywood commented May 10, 2022

As of 2a8f983, this is looking on track to resolve the register use in CUDA 11.3+.

CUDA 11.6, SM_86, boids_bruteforce inputdata is using 46 reg/thread, compared to 145 reg/thread for current master.

ptxas info    : Used 4 registers, 352 bytes cmem[0]
ptxas info    : Compiling entry function '_ZN8flamegpu22agent_function_wrapperI14inputdata_implNS_17MessageBruteForceENS_11MessageNoneEEEvjPKNS_6detail5curve5Curve10CurveTableEPKcjjjjPjjPKvSE_P17curandStateXORWOWSC_SC_SC_' for 'sm_86'
ptxas info    : Function properties for _ZN8flamegpu22agent_function_wrapperI14inputdata_implNS_17MessageBruteForceENS_11MessageNoneEEEvjPKNS_6detail5curve5Curve10CurveTableEPKcjjjjPjjPKvSE_P17curandStateXORWOWSC_SC_SC_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 46 registers, 456 bytes cmem[0]

I've set the circles benchmark going on a 3080 + titan v in CUDA 11.2 so we can see the performance loss in that benchmark from using global curve only as an intermediate step (will be a while before they complete).

@Robadob Robadob force-pushed the shared_mem_curve branch from 680bc3e to df29aa5 Compare May 11, 2022 06:28
@ptheywood
Copy link
Member

ptheywood commented May 11, 2022

I've set the circles benchmark going on a 3080 + titan v in CUDA 11.2 so we can see the performance loss in that benchmark from using global curve only as an intermediate step (will be a while before they complete).

This has resulted in an invalid configuration argument error during the comm-radius portion of the benchmark, on a 3080 using CUDA 11.2. Run 235/240 within that benchmark.

2022-05-10T21:03:36Z: 235/240: circles_spatial3D 64000 40.000000 64000.000000 1.000000 40.000000 0
2022-05-10T21:03:36Z: 235/240: circles_spatial3D 64000 40.000000 64000.000000 1.000000 40.000000 0
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2-circles-benchmark/build-11-2-globalcurve/_deps/flamegpu2-src/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(27): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2-circles-benchmark/build-11-2-globalcurve/_deps/flamegpu2-src/include/flamegpu/runtime/HostAgentAPI.cuh(899): invalid configuration argument
Aborted (core dumped)

I've pushed to the globalcurve-error branch of the FLAMEGPU2-circles-benchmark repo which uses the globalcurve commit in this PR with just the offending case. This reproduces the issue reliably on the 3080 when using CUDA 11.2 with SEATBELTS=OFF.
Running on a Titan V, this produces an illegal memory acess error instead.

CUDA_VISIBLE_DEVICES=0 FLAMEGPU_INC_DIR=/home/ptheywood/code/flamegpu/FLAMEGPU2-circles-benchmark/build-11-2-globalcurve/_deps/flamegpu2-src/include/ ./bin/Release/circles-benchmarking 
Running experiment comm-radius - 1 configs, 1 simulators, 3 repetitions
2022-05-11T15:42:11Z: 1/3: circles_spatial3D 64000 40.000000 64000.000000 1.000000 40.000000 0
terminate called after throwing an instance of 'flamegpu::exception::CUDAError'
  what():  /home/ptheywood/code/flamegpu/FLAMEGPU2-circles-benchmark/build-11-2-globalcurve/_deps/flamegpu2-src/include/flamegpu/gpu/detail/CUDAErrorChecking.cuh(27): CUDA Error: /home/ptheywood/code/flamegpu/FLAMEGPU2-circles-benchmark/build-11-2-globalcurve/_deps/flamegpu2-src/include/flamegpu/runtime/HostAgentAPI.cuh(899): invalid configuration argument

Chucking this through cuda-memcheck on the titan v reports that invalid __shared__ write of size 16 to 0x0000000 in the agent function wrapper for circiles_spatial3D output message impl.

Narrowed this down to sm_size only being set to a non zero value when SEATBELTS=ON in CUDASimualtion.cu. I've not checked if there are other instances of this.

Patching this resolves the titan v memory issue, but just moves the problem to the invalid confiugraiton error in HostAgentAPI.cuh

@Robadob Robadob force-pushed the shared_mem_curve branch 2 times, most recently from 3d2e30f to 69a6a39 Compare May 12, 2022 22:22
@ptheywood ptheywood mentioned this pull request May 13, 2022
@Robadob
Copy link
Member Author

Robadob commented May 15, 2022

Have updated DeviceEnvironment::getProperty(), such that the array element version requires the array length as a template argument. This is only used for validation internally, so it literally does nothing without seatbelts, however it provides much better consistency with the rest of the API (agent/message reads) where it is necessary.

I feel to enforce this consistency further, I will need to apply the same change to HostAPI/CUDASimulation environment methods.

@Robadob Robadob force-pushed the shared_mem_curve branch 2 times, most recently from 85d5e33 to 17a302e Compare May 15, 2022 21:43
@Robadob
Copy link
Member Author

Robadob commented May 16, 2022

As of fb6c322, have built ran test suite in all 3 main configs on Windows (Debug + Release with/without seatbelts). USE_GLM enabled too, in each case all tests pass.

@Robadob Robadob force-pushed the shared_mem_curve branch 3 times, most recently from d7469d7 to c790a59 Compare May 19, 2022 11:13
@Robadob Robadob marked this pull request as ready for review May 19, 2022 11:38
@Robadob
Copy link
Member Author

Robadob commented May 19, 2022

Final 2 commits (denoted ?) are optional queries.

PRs commits are structured such that it shouldn't be squashed.

@ptheywood
Copy link
Member

CUDA 11.4 A100 circles benchmark runs going for the last 3 commits (plus an alt version of the smem commit). Should finish overnight.

Will show:

  • if shared is an improvment, (it should be for brute at least).
  • if const restrict has a meaningful improvement (though as long as its not a degredation marking const restrict is valid anyway)
  • Whether smaller curve improves performance (partially occupancy related to smem use which circles might not be impacted by anyway, partially smaller curve table might be better anyway.

@ptheywood
Copy link
Member

A100 Data for the vairous subcommits, with alpha.2 v100 data for reference.
Some series are fully occluded by others. Previously discussed on slack.

Shos it fixes the 11.3+ issue, and offers a good performance advantage.

fixed-density--agent_count--step-s--model--all
fixed-density--agent_count--step-s--model--zoomed

V100 data showing the < 11.3 performance relative to the alpha.2 base case.

Global was a degredation for brute, but the smem 512 version is an improvement

fixed-density--agent_count--step-s--model--all
fixed-density--agent_count--step-s--model--zoomed

Copy link
Member

@ptheywood ptheywood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe include/flamegpu/util/StringUint32Pair.h can now be deleted, as you've removed all use of it (I think).

Otherwise looks good but I've left a few comments on lines that stood out. Most of them don't need anything doing, but it might be nice to address if not too much faff, or for the @todo's promoting to issues might be worthwhile.

I went through relatively quick though so could have missed some bits.

Once done + a little history cleaning it'll be good to go IMO

@Robadob Robadob requested a review from ptheywood May 25, 2022 20:00
@Robadob
Copy link
Member Author

Robadob commented May 25, 2022

@ptheywood Have applied your changes. You should in particular look at 7de1c16 as this was a substantial change due to your notes.

I have then ran tests (Windows/Release) and all still pass.

I will still need to rebase and clean up the commits if you're happy with all this.

@Robadob Robadob force-pushed the shared_mem_curve branch from b77be3d to 4d2ea07 Compare May 25, 2022 21:05
Copy link
Member

@ptheywood ptheywood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests all pass under linux (SEATBELTS=OFF) and profiling still works so the device resetting is still solid.

The other changes all look good. I'll let you merge in case you want to tidy history (the individual commits did make it very easy to review the changes though, so that's appreciated)

@Robadob
Copy link
Member Author

Robadob commented May 26, 2022

Tests all pass under linux (SEATBELTS=OFF) and profiling still works so the device resetting is still solid.

The other changes all look good. I'll let you merge in case you want to tidy history (the individual commits did make it very easy to review the changes though, so that's appreciated)

The only real caveat with device resetting is that i've removed tracking of which devices CUDASim is init on.

So if someone has 2 independent CUDASim on different device, only the 2nd to end will be reset.

Likewise, if they have a CUDASim alive, when their CUDAEnsemble exits, if it's on same device as ensemble was using, it will be reset

Robadob added 2 commits May 26, 2022 11:22
Tests pass (Windows, Release, Seatbelts=ON, 1038 Pass, 5 Disabled)

Note, Environment cache is still in constant memory.
Tests pass (Windows, Release, Seatbelts=ON, 1038 pass, 5 Disabled)
Robadob added 5 commits May 26, 2022 11:22
This involves some big changes to Curve and Environment Manager.
Curve is now split into 3 classes Curve, DeviceCurve, HostCurve, refactored to remove features redundant to new use-case.
EnvironmentManager has also been refactored to remove features redundant to new use-case.
1028 Tests pass, Debug, Windows, Seatbelts=ON
548 Python Tests pass, Release, Windows, Seatbelts=Off, 10 skip.
and update how cudaDeviceReset() is automatically triggered.
Purge should no longer be required as device-wide singletons were removed in the previous commit.
…nal template parameter N.

Passing 0 (by default) does no length checking, passing any other value is tested against the length for parity with device API agent variable methods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use __shared__ memory for cuRVE CUDA 11.3+ Register Usage
2 participants