Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fn decode_coefs is slow #1180

Open
fbossen opened this issue Jun 6, 2024 · 3 comments · May be fixed by #1231
Open

fn decode_coefs is slow #1180

fbossen opened this issue Jun 6, 2024 · 3 comments · May be fixed by #1231
Assignees

Comments

@fbossen
Copy link
Collaborator

fbossen commented Jun 6, 2024

fn decode_coefs has been observed to be about 40% slower than its C equivalent. The main contributor to the speed difference appears to be the numerous bound checks that Rust is performing. For example, bounds checks in get_lo_ctx when accessing accessing levels may be superfluous as the maximum offset into levels is 2 * stride and we know that an additional 2 * stride values have been cleared in levels before: levels[..(stride * (4 * sw as isize + 2)) as usize].fill(0); (note the + 2).
It may sufficient to check for y < 4 * sh (note that stride = 4 * sh) and x < 4 * sw and forgo all other bound checks inside the while i > 0 loop.

@kkysen
Copy link
Collaborator

kkysen commented Jun 6, 2024

Okay let me look into bounds checks. I like eliminating them.

@kkysen kkysen self-assigned this Jun 16, 2024
kkysen added a commit that referenced this issue Jun 24, 2024
…-bounds (#1232)

* Part of #1180.

It also removes some checks from `fn decode_b` and its callees, which is
also a hot spot.
kkysen added a commit that referenced this issue Jun 24, 2024
…n values (#1233)

* Part of #1180.

This adds docs to the `mod msac` `fn`s on the range of their args and
return values, and for return values, it ensures this by masking them.
This eliminates a bunch of bounds checks in callers, like `fn
decode_coefs`.

I also included a couple of other small commits in here; didn't really
want to put them all in their own PRs.
kkysen added a commit that referenced this issue Jun 24, 2024
…more places (#1234)

* Part of #1180.

This removes a bunch of casts and also moves checks to where the
`Rav1dFilterMode`s are first calculated, and where it is deducible that
no checks are needed.
kkysen added a commit that referenced this issue Jun 24, 2024
* Part of #1180.

Optimize and remove all of the bounds checks by checking `dir.len()`
directly instead of `tx`.
kkysen added a commit that referenced this issue Jun 24, 2024
…generic range (#1236)

* Part of #1180.

Use it, for now, to optimize `struct SegmentId` (already similarly
optimized, but not generically) and `fn get_skip_ctx`'s return value.

Note that we need this for `fn get_skip_ctx` because it is not inlined,
and thus the bounds checks using `sctx` aren't removed in `fn
decode_coefs`. It is not inlined because `fn get_skip_ctx` is not
`BitDepth`-generic, but `fn decode_coefs` is. I think `dav1d` still had
`fn get_skip_ctx` inlined because it is manually duplicated with the
templating system. But if LLVM thinks it is better not to inline this
even with an `#[inline]` or even an `#[inline(always)]`, and we're still
able to propagate information through the return type, I'm inclined to
leave it un-inlined.
kkysen added a commit that referenced this issue Jun 24, 2024
* Part of #1180.

This makes `TxfmSize` and `RectTxfmSize` into a real `enum`, which
eliminates many bounds checks in hot `fn`s like `fn decode_ceofs`. It
also lets us further simplify the `itx` macros and eliminate the
distinction between the `R`ect ones.

Note that `TxfmSize` defined the square sizes; these ones are named with
an `S` prefix, while the `R`-prefixed `RectTxfmSize` stay `R`-prefixed.

There are a few new places where we have to create `TxfmSize`s using
`from_repr` and thus do a bounds check. These should be pretty cheap,
but there are optimizable out, as they ultimately come from
`static`/`const` data whos bounds are known but LLVM is not able to
deduce them. This is why I added 7f0cdc2, starting to try to fix this,
though it's a little complicated, so I didn't want to finish it in this
PR.
kkysen added a commit that referenced this issue Jun 27, 2024
* Part of #1180.

This optimizes `fn get_lo_ctx`.

9d00ae6 optimizes it enough that it is now inlined (it was already
`#[inline]`), which enables other optimizations.

a5c6209 makes `stride` a `u8`, which it is in the caller `fn
decode_coefs`, and which tells LLVM that `_ * stride` can't overflow,
and thus `2 * stride` being in bounds means that `0 * stride` is in
bounds, for example.

242065b then pre-indexes `levels` so that there is only one bounds check
per branch. There is still actually another since LLVM does not know
that `stride != 0`. This can be optimized out with more effort in `fn
decode_coefs`, as `stride` is ultimately derived from `t_dim.h`, which
comes from `static`/`const` data.
kkysen added a commit that referenced this issue Jun 27, 2024
* Part of #1180.

Move the `tok != 0` check before the `*= 0x17ff41` so that `cmov` is
used, not a mispredicted branch.
kkysen added a commit that referenced this issue Jul 1, 2024
* Part of #1180.

I forgot to open this PR earlier. This optimizes some `CfSelect` uses
and makes its offset, as well as the offsets in
`Rav1dTileState_frame_thread` `u32`s instead of `usize`s, saving space,
prohibiting overflows on multiplications, and reducing stack usage a
bit.
kkysen added a commit that referenced this issue Jul 1, 2024
…ng it, since the derive may panic (#1247)

* Part of #1180.

It is more optimal to `.unwrap_or_default()`, which is a `cmov`, rather
than branching to a panic message, which bloats the code a lot.

This reduces stack usage in `fn decode_coefs` by 64 bytes, from 296 to
232.
kkysen added a commit that referenced this issue Jul 1, 2024
…icing once in `fn decode_coefs` (#1249)

* Part of #1180.

For the `CfSelect::Task` case, no slicing is done, the array is just
used as is. For the `CfSelect::Frame` case, the `DisjointMut` is sliced
to get a `CF_LEN` array. Then `fn get` and `fn set` can directly index
into this array with no bounds checks and no `match`.
@kkysen
Copy link
Collaborator

kkysen commented Jul 7, 2024

@fbossen, how slow is fn decode_coefs still?

@fbossen
Copy link
Collaborator Author

fbossen commented Jul 8, 2024

@fbossen, how slow is fn decode_coefs still?

I am measuring about 5% slower than the C code on average, which is a lot better than it used to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants