Numerics issue with vectorized conv2d #820

newling · 2024-10-03T01:26:08Z

Enabling vectorization (see #789) for convolution results in numerical failure. The values are off only slightly (although they are definitely not correct, there is no floating point rounding issue here). Experiments with the input values suggests that the problem is that the input image data (i.e. not the kernel data) is being read incorrectly, with an incorrect offset inside the scf.for loop (not confirmed).

Some things I've tried:

Setting optimization flags in LLVM to 0 (the default is 2) has no effect.
Inverting the scf.for loop order has no effect.
Using different versions of peano has no effect.

This task is to find the source of the error, and if it's peano create a reproducer for the team.

Attached are the ll and opt.ll files (vectorized and unvectorized) : ll_files.zip (they're quite small, vectorized_input.opt.ll is 94 lines only)

The MLIR IR looks basically identical except for the inner-loop.

// With vectorization:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %0 = vector.transfer_read %reinterpret_cast_21[%c0, %arg1, %arg3, %arg2, %c0], %cst {in_bounds = [true, true]} : memref<1x3x4x6x8xbf16>, vector<4x8xbf16>
      %1 = vector.transfer_read %reinterpret_cast_22[%arg1, %arg2, %arg3, %c0, %c0, %c0], %cst {in_bounds = [true, true]} : memref<3x3x4x1x8x4xbf16>, vector<8x4xbf16>
      %2 = vector.transfer_read %reinterpret_cast[%c0, %c0, %c0, %c0, %c0], %cst_20 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3, d4) -> (d2, d4)>} : memref<1x1x4x1x4xf32>, vector<4x4xf32>
      %3 = arith.extf %0 : vector<4x8xbf16> to vector<4x8xf32>
      %4 = arith.extf %1 : vector<8x4xbf16> to vector<8x4xf32>
      %5 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %3, %4, %2 : vector<4x8xf32>, vector<8x4xf32> into vector<4x4xf32>
      vector.transfer_write %5, %reinterpret_cast[%c0, %c0, %c0, %c0, %c0] {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3, d4) -> (d2, d4)>} : vector<4x4xf32>, memref<1x1x4x1x4xf32>
    }
  }
}

// Without vectorization:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %subview = memref.subview %reinterpret_cast_20[%arg1, %arg2, %arg3, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : memref<3x3x4x1x8x4xbf16> to memref<8x4xbf16, strided<[4, 1], offset: ?>>
      %subview_25 = memref.subview %reinterpret_cast[0, 0, 0, 0, 0] [1, 1, 4, 1, 4] [1, 1, 1, 1, 1] : memref<1x1x4x1x4xf32> to memref<4x4xf32, strided<[4, 1]>>
      %subview_26 = memref.subview %reinterpret_cast_21[0, %arg1, %arg3, %arg2, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : memref<1x3x4x6x8xbf16> to memref<4x8xbf16, strided<[8, 1], offset: ?>>
      scf.for %arg4 = %c0 to %c4 step %c1 {
        scf.for %arg5 = %c0 to %c4 step %c1 {
          scf.for %arg6 = %c0 to %c8 step %c1 {
            %0 = memref.load %subview_26[%arg4, %arg6] : memref<4x8xbf16, strided<[8, 1], offset: ?>>
            %1 = memref.load %subview[%arg6, %arg5] : memref<8x4xbf16, strided<[4, 1], offset: ?>>
            %2 = memref.load %subview_25[%arg4, %arg5] : memref<4x4xf32, strided<[4, 1]>>
            %3 = arith.extf %0 : bf16 to f32
            %4 = arith.extf %1 : bf16 to f32
            %5 = arith.mulf %3, %4 : f32
            %6 = arith.addf %2, %5 : f32
            memref.store %6, %subview_25[%arg4, %arg5] : memref<4x4xf32, strided<[4, 1]>>
          }
        }
      }
    }
  }
}

In the vectorized case, the vector.contract gets lowered to a aievec.matmul which in turn gets lowered to

 %38 = "xllvm.intr.aie2.bf.mac16.conf"(%33, %34, %37, %36) : (vector<32xbf16>, vector<32xbf16>, vector<8xi64>, i32) -> vector<8xi64> 
 ``

The text was updated successfully, but these errors were encountered:

newling · 2024-10-03T22:48:23Z

I've narrowed it down to the alignment of the loads : if the alignments are set/forced sufficiently low, then the numerics with vectorization are fine. I am a bit perplexed why the alignment needs to be as low as it does, consider the 2 IRs in the zip file:
opts.zip

They are basically identical, except for some loads which have alignment 4 in the one file, and alignment 2 in the other. The case with alignment 2 gives numerically correct results, the one with alignment 4 does not. What I find confusing is that alignment 4 is surely enough here: none of the strides in any of the loads is less than 8.

newling · 2024-10-04T00:06:50Z

Running the above 2 IRs through compiler explorer

http://xsjsda132:10240/z/6qc4cc

align 2:
with_align_2.txt
align 4:
with_align_4.txt

This PR switches all numerical convolution tests to use the objectFifo pipeline. With respect to the new tiling strategy: 1) A single **column** is currently used. Targeting multiple columns results in ` error: 'aie.memtile_dma' op could not find and assign a valid BD id`. This will will be investigated as follow-up work: #821 2) There is no longer interleaving of compute and L2->L1 data movement, which means #619 becomes low priority / obsolete 3) L3->L2, L2->L3 still uses padding. But L2->L1, L1->L2 uses packing. 4) Channel-first convolution is completely unsupported, we expect high level transforms to convert to channel last before reaching our backend. 5) Vectorization is not currently enabled, due to issues with alignment. See follow-up task #820. This is functionally ok for now, as peano can scalarize code for all data types.

This PR adds the necessary pass to align `transfer_read`. It is based on mlir-aie's aievec, see: https://github.com/Xilinx/mlir-aie/blob/d3da586305ebc22e5ecdf1d3e682b44853436e91/lib/Dialect/AIEVec/Transforms/VectorToVectorConversions.cpp#L123 Some changes were needed for our use case, however. The main one is that the lowering in this PR skips the `vector.extract_strided_slice` operation, because we have an offset which is not constant. i.e. the offsets in https://mlir.llvm.org/docs/Dialects/Vector/#vectorextract_strided_slice-vectorextractstridedsliceop cannot be integers for us, because they are determined from loop induction variables. The pass implemented here goes straight to aievec extract and shift operations, where mlir Values are used for offsets. Also included in this PR: an aievec.shift folder. I can make this a separate PR if preferred. This PR enables vectorization for convolution and resolves #820

makslevental · 2024-11-05T14:47:57Z

congrats @newling for tracking this all the day down

newling changed the title ~~Numerics issue with vectorized matmul~~ Numerics issue with vectorized con2d Oct 3, 2024

newling changed the title ~~Numerics issue with vectorized con2d~~ Numerics issue with vectorized conv2d Oct 3, 2024

newling mentioned this issue Oct 3, 2024

Remove load alignment removal pass #825

Closed

newling mentioned this issue Oct 8, 2024

[Convolution] Packing + objectFifo, initial support #789

Merged

newling mentioned this issue Oct 31, 2024

Pass to align transfer_reads #867

Merged

jtuyls closed this as completed in #867 Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerics issue with vectorized conv2d #820

Numerics issue with vectorized conv2d #820

newling commented Oct 3, 2024 •

edited

Loading

newling commented Oct 3, 2024 •

edited

Loading

newling commented Oct 4, 2024

makslevental commented Nov 5, 2024

Numerics issue with vectorized conv2d #820

Numerics issue with vectorized conv2d #820

Comments

newling commented Oct 3, 2024 • edited Loading

newling commented Oct 3, 2024 • edited Loading

newling commented Oct 4, 2024

makslevental commented Nov 5, 2024

newling commented Oct 3, 2024 •

edited

Loading

newling commented Oct 3, 2024 •

edited

Loading