Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

memcpy_async should cache only in L2 when possible #220

Merged
merged 2 commits into from
Nov 5, 2021

Conversation

gonzalobg
Copy link
Collaborator

@gonzalobg gonzalobg commented Oct 28, 2021

This PR adds src sizes to memcpy_async specializations, and for the 16 byte alignment case, it changes the cache operator from caching at all levels, to only caching at the global L2 level.

After an app carves some shared memory out of the L1 and then copies data to that shared memory, chances are that it will only read that data from shared memory. Caching the data only in the L2, and not in the L1, seems like the better default.

Closes #135 .

This commit changes memcpy_async for 16 byte alignment from using
ca (cache all) to using cg (cache global) hint and also specifies
the size of the source.
@gonzalobg gonzalobg changed the title memcpy_async should use cache only in the L2 when possible memcpy_async should cache only in L2 when possible Nov 3, 2021
@wmaxey wmaxey requested a review from griwes November 3, 2021 20:31
@wmaxey wmaxey added testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). testing: internal ci passed Passed internal NVIDIA CI (DVS). and removed testing: internal ci in progress Currently testing on internal NVIDIA CI (DVS). labels Nov 3, 2021
Copy link
Collaborator

@griwes griwes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but do secure an approve from @ogiroux too ;>

@wmaxey wmaxey merged commit 4f42427 into NVIDIA:main Nov 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
testing: internal ci passed Passed internal NVIDIA CI (DVS).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cuda::pipeline does not use L1 bypass
4 participants