Question about the `structured-to-memref` pass #126

mfrancepillois · 2024-04-05T09:38:25Z

mfrancepillois
Apr 5, 2024

Hello,

I have a few questions about the choices made to lower tt.load and its impact on performance.
Indeed, as the documentation shows and mentions here :

tt.load (together with all of its related address calculation instructions such as tt.addptr and tt.splat) are lowered to a combination of memref.reinterpret_cast, memref.alloc, and memref.copy

If we take a closer look to the conversion pipeline, the triton-to-structured pass converts first tt.splat + tt.make_range + tt.addptr to tts.make_tptr, and then tt.load to tts.load as explained in this discussion.
This discussion also describes the "Triton-shared LoadOp (tts.load)" as follows:

tts.load represents loading from a structured pointer, with optional sub-dimensions and constant fill/other value. It always takes a pointer produced directly by tts.make_tptr. It can take optional arguments to indicate the sub-dimension memory loads. It can also take another optional scalar value to indicate what data to use to fill the rest of the tensor.

This suggests that this operation is useful for creating a sub-view from the pointer created by tts.make_tptr.
Is this correct?

If this is the case, I'm not sure why we need a memref.alloc + a memref.copy to convert this operation.
Indeed, if the purpose behind the memref.alloc + memref.copy is "only" to create a sub-view, the memref.subview operation should be sufficient since the structured-to-memref pass already converts tts.make_tptr into a memref.reinterpret_cast, which creates an unranked/ranked memref.
Maybe I missed some specific triton scenarios that require potentially expensive memref.alloc + memref.copy operations, but could you please clarify this implementation choice for me?

Thank you.
Best regards.

Answered by nhat-nguyen

Apr 19, 2024

@mfrancepillois

This suggests that this operation [tts.load] is useful for creating a sub-view from the pointer created by tts.make_tptr.
Is this correct?

If this is the case, I'm not sure why we need a memref.alloc + a memref.copy to convert this operation.

This question touches on a few design key points that I will try to address.

Our triton-to-linalg pass was originally created to help with compiling triton on accelerators. Because accelerators potentially have multiple levels of memory, tensors are first created from the main memory and distributed to many triton program instances, each operating on a slice of the main tensor at another memory level. As a result, each triton prog…

View full answer

nhat-nguyen · 2024-04-18T16:14:24Z

nhat-nguyen
Apr 18, 2024
Maintainer

@mfrancepillois Hey thank you for your question, and sorry for the late reply. I don't get notifications from the discussion so I missed this. Will get back to you with some thoughts as soon as possible.

0 replies

nhat-nguyen · 2024-04-19T18:43:51Z

nhat-nguyen
Apr 19, 2024
Maintainer

@mfrancepillois

This suggests that this operation [tts.load] is useful for creating a sub-view from the pointer created by tts.make_tptr.
Is this correct?

If this is the case, I'm not sure why we need a memref.alloc + a memref.copy to convert this operation.

This question touches on a few design key points that I will try to address.

Our triton-to-linalg pass was originally created to help with compiling triton on accelerators. Because accelerators potentially have multiple levels of memory, tensors are first created from the main memory and distributed to many triton program instances, each operating on a slice of the main tensor at another memory level. As a result, each triton program has to copy a slice of the data to its local memory. The lowering to memref as you can see now is therefore a design choice that we made for accelerators.

As for CPU, the this doesn't apply and unfortunately generates sub-optimal code in certain cases. However, by always generating a pair of alloc and copy, it helps us maintains the value semantics for cases like:

import torch
import triton
import triton.language as tl

@triton.jit
def test_mem(
 A,
 B,
):
    # [1, 1]
    t1 = tl.load(A + tl.arange(0, 2))

    # [1, 1]
    t2 = tl.load(B + tl.arange(0, 2))

    # should be [2, 2]
    t3 = t1 + t2

    # writing back to A! t1 should still have [1, 1] as its value
    # if t1 is simply a subview, this would change t1's value to [2, 2]
    tl.store(A + tl.arange(0, 2), t3)

    # should be [3, 3]
    # but if we simply let t1 be a subview of A, t1 now aliases
    # with t3 which is already [2, 2].
    t4 = t1 + t3
    tl.store(B + tl.arange(0, 2), t4)

def test():
    n_cols = 2
    x = torch.full([n_cols], 1, device="cuda", dtype=torch.float32)
    y = torch.full([n_cols], 1, device="cuda", dtype=torch.float32)
    grid = lambda meta: (n_cols,)
    test_mem[grid](x, y)
    print('x: ', x)
    print('y: ', y)
    src = triton.compiler.ASTSource(
        test_mem,
        signature="*fp32,*fp32"
    )
    ret = triton.compile(
        src,
    )
    print(ret.asm["ttir"])

test()

Running this we would expect to see x = [2, 2] and y = [3, 3]. But if we simply lower t1 to a subview, the first store back to A will overwrite the value of t1, making t4 have incorrect results. This is a rather complex and strange example however, in other cases, I think a subview is enough. We will probably need an additional pass to optimize such cases.

2 replies

sommerlukas Apr 23, 2024

Thanks for your explanation @nhat-nguyen!

I'm not sure I follow your example: t1 in your example is not the result of tts.make_tptr, which would be lowered to a subview, but it is the result of tts.load, which would still be lowered to a memref.load.

So your example would still work correctly, even if tts.make_tptr was lowered to a subview, as t1 would be loaded and still retain its correct value after storing t3 back to A.

nhat-nguyen Apr 23, 2024
Maintainer

Ah I think there's a misunderstanding here.

I interpreted the question as: if tts.make_tptr is already lowered to memref.reinterpret_cast, why do we need to lower tts.load to memref.alloc + memref.copy when we can lower it to a subview of the memref.reinterpret_cast directly? The example above illustrated that we can't sub-view results from tts.make_tptr for tts.load due to potential aliasing.

From your reply, however, specifically this point

t1 in your example is not the result of tts.make_tptr, which would be lowered to a subview, but it is the result of tts.load, which would still be lowered to a memref.load.

This matches exactly what we are doing. For every tts.load, there is a pair of memref.alloc + memref.copy. This prevents aliasing. Note that memref.load loads a single element and not a whole tensor: https://mlir.llvm.org/docs/Dialects/MemRef/#memrefload-memrefloadop, but memref.alloc + memref.copy achieves the semantic that I think you're referring to.

Please let me know if this makes sense at all and sorry for the misunderstanding! Communication is hard! 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the `structured-to-memref` pass #126

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question about the structured-to-memref pass #126

mfrancepillois Apr 5, 2024

Replies: 2 comments · 2 replies

nhat-nguyen Apr 18, 2024 Maintainer

nhat-nguyen Apr 19, 2024 Maintainer

sommerlukas Apr 23, 2024

nhat-nguyen Apr 23, 2024 Maintainer

Question about the `structured-to-memref` pass #126

mfrancepillois
Apr 5, 2024

Replies: 2 comments 2 replies

nhat-nguyen
Apr 18, 2024
Maintainer

nhat-nguyen
Apr 19, 2024
Maintainer

nhat-nguyen Apr 23, 2024
Maintainer