-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device Packing Optimization #1382
Conversation
…lignment for assignment packing
…earing in time history collection
I've already run the integratedTests on both lassen and quartz, so as long as this passes the CI checks it should be good to go. |
if( DO_PACKING ) | ||
{ | ||
T * devBuffer = reinterpret_cast< T * >( buffer ); | ||
parallelDeviceStream stream; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should clean up this stream and events business later. We should not be creating a new stream for every field that we pack and we should probably do stream based synchronization instead of event based.
buffer_unit_type const * threadBuffer = devBuffer + i * unitSize; | ||
LvArray::forValuesInSlice( var[ indices[ i ] ], [&threadBuffer] GEOSX_DEVICE ( T & value ) | ||
T const * threadBuffer = &devBuffer[ ii * sliceSize ]; | ||
LvArray::forValuesInSlice( var[ indices[ ii ] ], [&threadBuffer] GEOSX_DEVICE ( T & value ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it later and you can only get rid of forValuesInSlice
if the slice is contiguous. If we don't care the order that it's packed in, as long as it's unpacked in a consistent manner then we can probably do much better but it would take some thinking as to how best to pack things and it would likely involve a single thread packing parts of multiple objects in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I don't know if this is "outdated" but I still always pre-increment rather than post-increment.
@corbett5 can you merge this if the suggested changes are still worthwhile?
Co-authored-by: Randolph Settgast <settgast1@llnl.gov>
Improve CUDA buffer packing performance by using assignment-based packing on properly aligned memory.
This reduces our packing time approximately one order of magnitude in observed cases. Further reduction should be possible through kernel register reduction and loop fusion (not in this PR).