Support for loading values through the texture cache (__ldg) #158

maleadt · 2018-02-05T16:15:52Z

This branch packs a couple of improvements, as well as initial support for __ldg for loading values through the texture cache. See the CUDA docs for more info, but in short: the texture cache is faster than the global cache (which caches global values automatically), but it is a non-coherent cache which implies that the array should be read-only for the entire duration of the kernel.

This PR only features the compiler support, no real front-end yet. Ideas/suggestions? Keyword argument to getindex? Constness typevar for CuArray/CuDeviceArray? Even though we'd need proper alias analysis (ref JuliaLang/julia#25890), this should already be usable in eg. non-mutating broadcast.

cc @vchuravy @MikeInnes

MikeInnes · 2018-02-05T18:34:52Z

Perhaps something like @cuda kernel(y, readonly(x))? And readonly can create a simple wrapper and/or set a type-level flag that overrides getindex.

vchuravy · 2018-02-05T18:54:39Z

Yeah I think I am going to try out the readonly(x) -> Const(DeviceArray) way and see how it feels.

cdsousa · 2018-03-13T16:22:54Z

Hi, let me ask this: will this change allow to do something like the function tex2d?
(I'm really missing texture access with (free) interpolation in CUDAnative.jl kernels.)

maleadt · 2018-03-13T16:32:02Z

Sorry, no. This just exposes the texture cache, proper texture memory isn't supported yet.

cdsousa · 2018-03-13T17:15:14Z

But will it enable one to read device array memory, and then do the interpolation by hand, with performance comparable to texture fetches using tex2d?

…

On Tue, Mar 13, 2018 at 4:32 PM Tim Besard ***@***.***> wrote: Sorry, no. This just exposes the texture *cache*, proper texture memory isn't supported yet. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#158 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUWzzZWpq-jN0CtZ4HZl7zeEd93iUl3ks5td_SDgaJpZM4R5sii> .

maleadt · 2018-03-16T09:11:08Z

Sorry, missed your reply.

interpolation by hand, with performance comparable to texture fetches using tex2d

Hard to answer, I don't know how this interpolation is performed in hardware (I can't find much on the web). From a quick glance it looks like tex2d is compiled to a compiler intrinsic called __tex2Dgather, which is turned into the PTX instruction tex.2d, which generates the SASS instruction TEX. Assuming this uses a hardware sampler, you won't get all of the performance benefit from only loading from the texture memory cache as implemented in this PR. But if your workload is memory bound, this PR gets you most of the way.

For my own reference (if I ever have time to implement this feature):

__global__ void tex2D_kernel(float* output, cudaTextureObject_t texObj) 
{
    output[0] = tex2D<float>(texObj, 1, 2);
}

.visible .entry _Z12tex2D_kernelPfy(
        .param .u64 _Z12tex2D_kernelPfy_param_0,
        .param .u64 _Z12tex2D_kernelPfy_param_1
)
{
        .reg .f32       %f<7>;
        .reg .b64       %rd<4>;


        ld.param.u64    %rd1, [_Z12tex2D_kernelPfy_param_0];
        ld.param.u64    %rd2, [_Z12tex2D_kernelPfy_param_1];
        cvta.to.global.u64      %rd3, %rd1;
        mov.f32         %f1, 0f40000000;
        mov.f32         %f2, 0f3F800000;
        tex.2d.v4.f32.f32       {%f3, %f4, %f5, %f6}, [%rd2, {%f2, %f1}];
        st.global.f32   [%rd3], %f3;
        ret;
}

        /*0008*/                   MOV R1, c[0x0][0x44];               /* 0x2800400110005de4 */
        /*0010*/                   MOV32I R4, 0x3f800000;              /* 0x18fe000000011de2 */
        /*0018*/                   MOV32I R5, 0x40000000;              /* 0x1900000000015de2 */
        /*0020*/                   TEX.NODEP.P R4, R4, 0x52, 2D, 0x1;  /* 0x80104052fc411f06 */
        /*0028*/                   MOV R2, c[0x0][0x140];              /* 0x2800400500009de4 */
        /*0030*/                   MOV R3, c[0x0][0x144];              /* 0x280040051000dde4 */
        /*0038*/                   TEXDEPBAR 0x0;                      /* 0xf000000000001de6 */
                                                                       /* 0x200000000002f047 */
        /*0048*/                   ST.E [R2], R4;                      /* 0x9400000000211c85 */
        /*0050*/                   EXIT;                               /* 0x8000000000001de7 */
        /*0058*/                   BRA 0x58;                           /* 0x4003ffffe0001de7 */
        /*0060*/                   NOP;                                /* 0x4000000000001de4 */
        /*0068*/                   NOP;                                /* 0x4000000000001de4 */
        /*0070*/                   NOP;                                /* 0x4000000000001de4 */
        /*0078*/                   NOP;                                /* 0x4000000000001de4 */

I'm not convinced by the design, this probably needs to be part of a buffer hierarchy. But let's keep it as a part of the history, for future reference.

cdsousa · 2018-03-16T10:35:12Z

Thanks for the thoughtful answer. I myself am trying to play with the ldg branch, though I'm still dealing with other unrelated issues. I'll post any results that I get.

maleadt · 2018-03-16T10:37:17Z

Great. Since I'm not too happy about the way this hooks into CuDeviceArray (I think I want to have a buffer type, with a more generic device array type on top of that), I'll reduce the scope of this PR and merge that. So you'll only be able to use this through unsafe_cached_load, but that should be fine for performance sensitive code. Or maybe I can add a ldg function for now.

maleadt added performance codegen CUDA support labels Feb 5, 2018

maleadt mentioned this pull request Feb 14, 2018

WIP: const wrapper for __ldg jrevels/MixedModeBroadcastAD.jl#18

Closed

maleadt added 16 commits March 16, 2018 11:29

Don't use a typevar for generated functions.

25176d1

Lift generic intrinsic generalization functionality from shuffle.

12d286f

Initial implementation of cached load.

e0b203e

Simplifications.

176aa8d

Fix simplifications.

b193b16

Disable recursive ldg

5e967c1

Add tests.

21b37e7

Fix Val usage.

e72ee07

Extend ldg for larger sizes.

bd198c2

Optimization: greedy selection of largest supported pointer.

faf356e

Uniform insert/extract value functionality.

e3a0582

Update TODO.

7d1ca4b

Add support for loading structs with ldg.

9784544

Add cached type param to CuDeviceArray.

d0ce798

Add CuDeviceArray wrapper for cached access.

706a2ae

Revert the changes to CuDeviceArray.

63e3df3

I'm not convinced by the design, this probably needs to be part of a buffer hierarchy. But let's keep it as a part of the history, for future reference.

maleadt force-pushed the tb/ldg branch from e759441 to 63e3df3 Compare March 16, 2018 10:30

Add and test ldg.

b07c11b

maleadt force-pushed the tb/ldg branch from 9f07fe1 to b07c11b Compare March 16, 2018 10:51

maleadt merged commit d1b00d6 into master Mar 16, 2018

maleadt deleted the tb/ldg branch March 16, 2018 11:53

maleadt mentioned this pull request May 28, 2018

WIP: Generic Matrix Multiplication JuliaGPU/GPUArrays.jl#115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for loading values through the texture cache (__ldg) #158

Support for loading values through the texture cache (__ldg) #158

maleadt commented Feb 5, 2018

MikeInnes commented Feb 5, 2018

vchuravy commented Feb 5, 2018

cdsousa commented Mar 13, 2018

maleadt commented Mar 13, 2018

cdsousa commented Mar 13, 2018 via email •

edited

Loading

maleadt commented Mar 16, 2018

cdsousa commented Mar 16, 2018

maleadt commented Mar 16, 2018

Support for loading values through the texture cache (__ldg) #158

Support for loading values through the texture cache (__ldg) #158

Conversation

maleadt commented Feb 5, 2018

MikeInnes commented Feb 5, 2018

vchuravy commented Feb 5, 2018

cdsousa commented Mar 13, 2018

maleadt commented Mar 13, 2018

cdsousa commented Mar 13, 2018 via email • edited Loading

maleadt commented Mar 16, 2018

cdsousa commented Mar 16, 2018

maleadt commented Mar 16, 2018

cdsousa commented Mar 13, 2018 via email •

edited

Loading