Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device Packing Optimization #1382

Merged
merged 4 commits into from
Apr 21, 2021
Merged

Device Packing Optimization #1382

merged 4 commits into from
Apr 21, 2021

Conversation

wrtobin
Copy link
Collaborator

@wrtobin wrtobin commented Apr 6, 2021

Improve CUDA buffer packing performance by using assignment-based packing on properly aligned memory.

This reduces our packing time approximately one order of magnitude in observed cases. Further reduction should be possible through kernel register reduction and loop fusion (not in this PR).

@wrtobin
Copy link
Collaborator Author

wrtobin commented Apr 6, 2021

I've already run the integratedTests on both lassen and quartz, so as long as this passes the CI checks it should be good to go.

@wrtobin wrtobin changed the title Feature/wrtobin/packing opt CUDA Packing Optimization Apr 6, 2021
@wrtobin wrtobin changed the title CUDA Packing Optimization Device Packing Optimization Apr 6, 2021
if( DO_PACKING )
{
T * devBuffer = reinterpret_cast< T * >( buffer );
parallelDeviceStream stream;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clean up this stream and events business later. We should not be creating a new stream for every field that we pack and we should probably do stream based synchronization instead of event based.

buffer_unit_type const * threadBuffer = devBuffer + i * unitSize;
LvArray::forValuesInSlice( var[ indices[ i ] ], [&threadBuffer] GEOSX_DEVICE ( T & value )
T const * threadBuffer = &devBuffer[ ii * sliceSize ];
LvArray::forValuesInSlice( var[ indices[ ii ] ], [&threadBuffer] GEOSX_DEVICE ( T & value )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it later and you can only get rid of forValuesInSlice if the slice is contiguous. If we don't care the order that it's packed in, as long as it's unpacked in a consistent manner then we can probably do much better but it would take some thinking as to how best to pack things and it would likely involve a single thread packing parts of multiple objects in some cases.

Copy link
Member

@rrsettgast rrsettgast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I don't know if this is "outdated" but I still always pre-increment rather than post-increment.

@corbett5 can you merge this if the suggested changes are still worthwhile?

Co-authored-by: Randolph Settgast <settgast1@llnl.gov>
@wrtobin wrtobin added ci: run CUDA builds Allows to triggers (costly) CUDA jobs and removed flag: ready for review labels Apr 7, 2021
@rrsettgast rrsettgast merged commit 88b0759 into develop Apr 21, 2021
@rrsettgast rrsettgast deleted the feature/wrtobin/packing-opt branch April 21, 2021 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci: run CUDA builds Allows to triggers (costly) CUDA jobs type: optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants