Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerics issue with vectorized conv2d #820

Closed
newling opened this issue Oct 3, 2024 · 3 comments · Fixed by #867
Closed

Numerics issue with vectorized conv2d #820

newling opened this issue Oct 3, 2024 · 3 comments · Fixed by #867

Comments

@newling
Copy link
Contributor

newling commented Oct 3, 2024

Enabling vectorization (see #789) for convolution results in numerical failure. The values are off only slightly (although they are definitely not correct, there is no floating point rounding issue here). Experiments with the input values suggests that the problem is that the input image data (i.e. not the kernel data) is being read incorrectly, with an incorrect offset inside the scf.for loop (not confirmed).

Some things I've tried:

Setting optimization flags in LLVM to 0 (the default is 2) has no effect.
Inverting the scf.for loop order has no effect.
Using different versions of peano has no effect.

This task is to find the source of the error, and if it's peano create a reproducer for the team.

Attached are the ll and opt.ll files (vectorized and unvectorized) : ll_files.zip (they're quite small, vectorized_input.opt.ll is 94 lines only)

The MLIR IR looks basically identical except for the inner-loop.

// With vectorization:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %0 = vector.transfer_read %reinterpret_cast_21[%c0, %arg1, %arg3, %arg2, %c0], %cst {in_bounds = [true, true]} : memref<1x3x4x6x8xbf16>, vector<4x8xbf16>
      %1 = vector.transfer_read %reinterpret_cast_22[%arg1, %arg2, %arg3, %c0, %c0, %c0], %cst {in_bounds = [true, true]} : memref<3x3x4x1x8x4xbf16>, vector<8x4xbf16>
      %2 = vector.transfer_read %reinterpret_cast[%c0, %c0, %c0, %c0, %c0], %cst_20 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3, d4) -> (d2, d4)>} : memref<1x1x4x1x4xf32>, vector<4x4xf32>
      %3 = arith.extf %0 : vector<4x8xbf16> to vector<4x8xf32>
      %4 = arith.extf %1 : vector<8x4xbf16> to vector<8x4xf32>
      %5 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %3, %4, %2 : vector<4x8xf32>, vector<8x4xf32> into vector<4x4xf32>
      vector.transfer_write %5, %reinterpret_cast[%c0, %c0, %c0, %c0, %c0] {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3, d4) -> (d2, d4)>} : vector<4x4xf32>, memref<1x1x4x1x4xf32>
    }
  }
}

// Without vectorization:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %subview = memref.subview %reinterpret_cast_20[%arg1, %arg2, %arg3, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : memref<3x3x4x1x8x4xbf16> to memref<8x4xbf16, strided<[4, 1], offset: ?>>
      %subview_25 = memref.subview %reinterpret_cast[0, 0, 0, 0, 0] [1, 1, 4, 1, 4] [1, 1, 1, 1, 1] : memref<1x1x4x1x4xf32> to memref<4x4xf32, strided<[4, 1]>>
      %subview_26 = memref.subview %reinterpret_cast_21[0, %arg1, %arg3, %arg2, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : memref<1x3x4x6x8xbf16> to memref<4x8xbf16, strided<[8, 1], offset: ?>>
      scf.for %arg4 = %c0 to %c4 step %c1 {
        scf.for %arg5 = %c0 to %c4 step %c1 {
          scf.for %arg6 = %c0 to %c8 step %c1 {
            %0 = memref.load %subview_26[%arg4, %arg6] : memref<4x8xbf16, strided<[8, 1], offset: ?>>
            %1 = memref.load %subview[%arg6, %arg5] : memref<8x4xbf16, strided<[4, 1], offset: ?>>
            %2 = memref.load %subview_25[%arg4, %arg5] : memref<4x4xf32, strided<[4, 1]>>
            %3 = arith.extf %0 : bf16 to f32
            %4 = arith.extf %1 : bf16 to f32
            %5 = arith.mulf %3, %4 : f32
            %6 = arith.addf %2, %5 : f32
            memref.store %6, %subview_25[%arg4, %arg5] : memref<4x4xf32, strided<[4, 1]>>
          }
        }
      }
    }
  }
}

In the vectorized case, the vector.contract gets lowered to a aievec.matmul which in turn gets lowered to

 %38 = "xllvm.intr.aie2.bf.mac16.conf"(%33, %34, %37, %36) : (vector<32xbf16>, vector<32xbf16>, vector<8xi64>, i32) -> vector<8xi64> 
 ``
@newling newling changed the title Numerics issue with vectorized matmul Numerics issue with vectorized con2d Oct 3, 2024
@newling newling changed the title Numerics issue with vectorized con2d Numerics issue with vectorized conv2d Oct 3, 2024
@newling
Copy link
Contributor Author

newling commented Oct 3, 2024

I've narrowed it down to the alignment of the loads : if the alignments are set/forced sufficiently low, then the numerics with vectorization are fine. I am a bit perplexed why the alignment needs to be as low as it does, consider the 2 IRs in the zip file:
opts.zip

They are basically identical, except for some loads which have alignment 4 in the one file, and alignment 2 in the other. The case with alignment 2 gives numerically correct results, the one with alignment 4 does not. What I find confusing is that alignment 4 is surely enough here: none of the strides in any of the loads is less than 8.

@newling
Copy link
Contributor Author

newling commented Oct 4, 2024

Running the above 2 IRs through compiler explorer

http://xsjsda132:10240/z/6qc4cc

align 2:
with_align_2.txt
align 4:
with_align_4.txt

newling added a commit that referenced this issue Oct 9, 2024
This PR switches all numerical convolution tests to use the objectFifo
pipeline. With respect to the new tiling strategy:

1) A single **column** is currently used. Targeting multiple columns
results in ` error: 'aie.memtile_dma' op could not find and assign a
valid BD id`. This will will be investigated as follow-up work:
#821

2) There is no longer interleaving of compute and L2->L1 data movement,
which means #619 becomes
low priority / obsolete

3) L3->L2, L2->L3 still uses padding. But L2->L1, L1->L2 uses packing. 

4) Channel-first convolution is completely unsupported, we expect high
level transforms to convert to channel last before reaching our backend.

5) Vectorization is not currently enabled, due to issues with alignment.
See follow-up task #820.
This is functionally ok for now, as peano can scalarize code for all
data types.
jtuyls pushed a commit that referenced this issue Nov 5, 2024
This PR adds the necessary pass to align `transfer_read`. It is based on
mlir-aie's aievec, see:


https://github.com/Xilinx/mlir-aie/blob/d3da586305ebc22e5ecdf1d3e682b44853436e91/lib/Dialect/AIEVec/Transforms/VectorToVectorConversions.cpp#L123

Some changes were needed for our use case, however. The main one is that
the lowering in this PR skips the `vector.extract_strided_slice`
operation, because we have an offset which is not constant. i.e. the
offsets in
https://mlir.llvm.org/docs/Dialects/Vector/#vectorextract_strided_slice-vectorextractstridedsliceop
cannot be integers for us, because they are determined from loop
induction variables. The pass implemented here goes straight to aievec
extract and shift operations, where mlir Values are used for offsets.

Also included in this PR: an aievec.shift folder. I can make this a
separate PR if preferred.

This PR enables vectorization for convolution and resolves
#820
@makslevental
Copy link
Collaborator

congrats @newling for tracking this all the day down

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants