`fn decode_coefs` is slow #1180

fbossen · 2024-06-06T18:05:46Z

fn decode_coefs has been observed to be about 40% slower than its C equivalent. The main contributor to the speed difference appears to be the numerous bound checks that Rust is performing. For example, bounds checks in get_lo_ctx when accessing accessing levels may be superfluous as the maximum offset into levels is 2 * stride and we know that an additional 2 * stride values have been cleared in levels before: levels[..(stride * (4 * sw as isize + 2)) as usize].fill(0); (note the + 2).
It may sufficient to check for y < 4 * sh (note that stride = 4 * sh) and x < 4 * sw and forgo all other bound checks inside the while i > 0 loop.

The text was updated successfully, but these errors were encountered:

kkysen · 2024-06-06T18:54:29Z

Okay let me look into bounds checks. I like eliminating them.

…-bounds (#1232) * Part of #1180. It also removes some checks from `fn decode_b` and its callees, which is also a hot spot.

…n values (#1233) * Part of #1180. This adds docs to the `mod msac` `fn`s on the range of their args and return values, and for return values, it ensures this by masking them. This eliminates a bunch of bounds checks in callers, like `fn decode_coefs`. I also included a couple of other small commits in here; didn't really want to put them all in their own PRs.

…more places (#1234) * Part of #1180. This removes a bunch of casts and also moves checks to where the `Rav1dFilterMode`s are first calculated, and where it is deducible that no checks are needed.

* Part of #1180. Optimize and remove all of the bounds checks by checking `dir.len()` directly instead of `tx`.

…generic range (#1236) * Part of #1180. Use it, for now, to optimize `struct SegmentId` (already similarly optimized, but not generically) and `fn get_skip_ctx`'s return value. Note that we need this for `fn get_skip_ctx` because it is not inlined, and thus the bounds checks using `sctx` aren't removed in `fn decode_coefs`. It is not inlined because `fn get_skip_ctx` is not `BitDepth`-generic, but `fn decode_coefs` is. I think `dav1d` still had `fn get_skip_ctx` inlined because it is manually duplicated with the templating system. But if LLVM thinks it is better not to inline this even with an `#[inline]` or even an `#[inline(always)]`, and we're still able to propagate information through the return type, I'm inclined to leave it un-inlined.

* Part of #1180. This makes `TxfmSize` and `RectTxfmSize` into a real `enum`, which eliminates many bounds checks in hot `fn`s like `fn decode_ceofs`. It also lets us further simplify the `itx` macros and eliminate the distinction between the `R`ect ones. Note that `TxfmSize` defined the square sizes; these ones are named with an `S` prefix, while the `R`-prefixed `RectTxfmSize` stay `R`-prefixed. There are a few new places where we have to create `TxfmSize`s using `from_repr` and thus do a bounds check. These should be pretty cheap, but there are optimizable out, as they ultimately come from `static`/`const` data whos bounds are known but LLVM is not able to deduce them. This is why I added 7f0cdc2, starting to try to fix this, though it's a little complicated, so I didn't want to finish it in this PR.

* Part of #1180. This optimizes `fn get_lo_ctx`. 9d00ae6 optimizes it enough that it is now inlined (it was already `#[inline]`), which enables other optimizations. a5c6209 makes `stride` a `u8`, which it is in the caller `fn decode_coefs`, and which tells LLVM that `_ * stride` can't overflow, and thus `2 * stride` being in bounds means that `0 * stride` is in bounds, for example. 242065b then pre-indexes `levels` so that there is only one bounds check per branch. There is still actually another since LLVM does not know that `stride != 0`. This can be optimized out with more effort in `fn decode_coefs`, as `stride` is ultimately derived from `t_dim.h`, which comes from `static`/`const` data.

* Part of #1180. Move the `tok != 0` check before the `*= 0x17ff41` so that `cmov` is used, not a mispredicted branch.

* Part of #1180. I forgot to open this PR earlier. This optimizes some `CfSelect` uses and makes its offset, as well as the offsets in `Rav1dTileState_frame_thread` `u32`s instead of `usize`s, saving space, prohibiting overflows on multiplications, and reducing stack usage a bit.

…ng it, since the derive may panic (#1247) * Part of #1180. It is more optimal to `.unwrap_or_default()`, which is a `cmov`, rather than branching to a panic message, which bloats the code a lot. This reduces stack usage in `fn decode_coefs` by 64 bytes, from 296 to 232.

…#1248) * Part of #1180.

…icing once in `fn decode_coefs` (#1249) * Part of #1180. For the `CfSelect::Task` case, no slicing is done, the array is just used as is. For the `CfSelect::Frame` case, the `DisjointMut` is sliced to get a `CF_LEN` array. Then `fn get` and `fn set` can directly index into this array with no bounds checks and no `match`.

kkysen · 2024-07-07T06:53:45Z

@fbossen, how slow is fn decode_coefs still?

fbossen · 2024-07-08T13:08:44Z

@fbossen, how slow is fn decode_coefs still?

I am measuring about 5% slower than the C code on average, which is a lot better than it used to be.

kkysen self-assigned this Jun 16, 2024

kkysen added a commit that referenced this issue Jun 24, 2024

struct SegmentId: Add newtype wrapper for seg_id that enforces in…

cf978a9

…-bounds (#1232) * Part of #1180. It also removes some checks from `fn decode_b` and its callees, which is also a hot spot.

kkysen added a commit that referenced this issue Jun 24, 2024

fn get_skip_ctx: Optimize (#1235)

da472a5

* Part of #1180. Optimize and remove all of the bounds checks by checking `dir.len()` directly instead of `tx`.

This was referenced Jun 24, 2024

enum TxfmSize: Make a real enum #1240

Merged

fn get_lo_ctx: Optimize #1241

Merged

kkysen mentioned this issue Jun 27, 2024

fn decode_coefs_class: Fix branch misprediction #1246

Merged

kkysen added a commit that referenced this issue Jun 27, 2024

fn decode_coefs_class: Fix branch misprediction (#1246)

6be08b8

* Part of #1180. Move the `tok != 0` check before the `*= 0x17ff41` so that `cmov` is used, not a mispredicted branch.

kkysen added the performance label Jun 30, 2024

kkysen added a commit that referenced this issue Jul 1, 2024

enum Ac: Put qm_tbl: &[u8] in Ac::Qm to avoid .unwrap()ing it (…

51f0ef2

…#1248) * Part of #1180.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fn decode_coefs` is slow #1180

`fn decode_coefs` is slow #1180

fbossen commented Jun 6, 2024

kkysen commented Jun 6, 2024

kkysen commented Jul 7, 2024

fbossen commented Jul 8, 2024

fn decode_coefs is slow #1180

fn decode_coefs is slow #1180

Comments

fbossen commented Jun 6, 2024

kkysen commented Jun 6, 2024

kkysen commented Jul 7, 2024

fbossen commented Jul 8, 2024

`fn decode_coefs` is slow #1180

`fn decode_coefs` is slow #1180