Wrong values in broadcasted computation #1074

Edenhofer · 2023-01-19T02:45:47Z

Triton produces wrong values in a broadcasted subtraction:

#!/usr/bin/env python3

import torch

import triton
import triton.language as tl


@triton.jit
def toy_kernel(
    dist_ptr,
    coo_bounds_ptr,
    output_ptr,
    block_size: tl.constexpr,
    n_elements: tl.constexpr,
):
    pid = tl.program_id(axis=0)
    block_start = pid * block_size

    block_off = block_start + tl.arange(0, block_size)
    block_mask = block_off < n_elements
    dist = tl.load(dist_ptr + block_off, mask=block_mask)
    coo_bounds = tl.load(coo_bounds_ptr + block_off, mask=block_mask)

    coo_bounds = tl.reshape(coo_bounds, (1, n_elements))
    dist = tl.reshape(dist, (block_size, 1))
    nr = coo_bounds
    print(nr)
    nr -= dist
    print(nr)

    output = tl.sum(nr, axis=1)
    block_off = block_start + tl.arange(0, block_size)
    tl.store(output_ptr + block_off, output, mask=block_mask)


def toy(dist, coo_bounds):
    block_size = n_elements = 4
    grid = ((dist.shape[0] - 1) // block_size + 1, )
    output = torch.empty_like(dist)
    toy_kernel[grid](
        dist,
        coo_bounds,
        output,
        block_size=block_size,
        n_elements=n_elements,
        num_warps=2,
    )
    return output


dist = 1. + torch.arange(4, dtype=torch.float32, device='cuda')
coo_bounds = 2. + torch.arange(4, dtype=torch.float32, device='cuda')
print(toy(dist, coo_bounds))

# Reference implementation in torch
o = (coo_bounds.reshape(1, -1) - dist.reshape(-1, 1)).sum(axis=1)
print(o)

produces the following incompatible results

tensor([1., 0., 0., 0.], device='cuda:0')  # triton
tensor([10.,  6.,  2., -2.], device='cuda:0')  # reference implementation

The text was updated successfully, but these errors were encountered:

Jokeren · 2023-01-19T18:27:00Z

The problem should have been fixed using triton master. Please verify.

Edenhofer · 2023-01-19T18:55:54Z

Yup, it works when building from master. Sorry for the noise.

…tel SPIR-V Extension (triton-lang#1074) Related to issue triton-lang#1001. This pass is already lowering `arith::TruncFOp` and `arith::ExtFOp`, so there was the original suggesting of lowering to arith operators didn't make sense, but I have replace most of the bit operations with calls to an Intel SPIR-V extension that translates to a MOV instruction in vISA. I couldn't remove the round to zero mode of `convertFp32ToBf16`, since the extension only supports round to closest even. The code that calls `convertFp32ToBf16` uses round to closest even by default, so that's fine.

Jokeren self-assigned this Jan 19, 2023

Jokeren added the help wanted label Jan 19, 2023

Edenhofer closed this as completed Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong values in broadcasted computation #1074

Wrong values in broadcasted computation #1074

Edenhofer commented Jan 19, 2023

Jokeren commented Jan 19, 2023

Edenhofer commented Jan 19, 2023

Wrong values in broadcasted computation #1074

Wrong values in broadcasted computation #1074

Comments

Edenhofer commented Jan 19, 2023

Jokeren commented Jan 19, 2023

Edenhofer commented Jan 19, 2023