-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fn decode_coefs
is slow
#1180
Labels
Comments
Okay let me look into bounds checks. I like eliminating them. |
This was referenced Jun 20, 2024
kkysen
added a commit
that referenced
this issue
Jun 24, 2024
…n values (#1233) * Part of #1180. This adds docs to the `mod msac` `fn`s on the range of their args and return values, and for return values, it ensures this by masking them. This eliminates a bunch of bounds checks in callers, like `fn decode_coefs`. I also included a couple of other small commits in here; didn't really want to put them all in their own PRs.
kkysen
added a commit
that referenced
this issue
Jun 24, 2024
* Part of #1180. Optimize and remove all of the bounds checks by checking `dir.len()` directly instead of `tx`.
kkysen
added a commit
that referenced
this issue
Jun 24, 2024
…generic range (#1236) * Part of #1180. Use it, for now, to optimize `struct SegmentId` (already similarly optimized, but not generically) and `fn get_skip_ctx`'s return value. Note that we need this for `fn get_skip_ctx` because it is not inlined, and thus the bounds checks using `sctx` aren't removed in `fn decode_coefs`. It is not inlined because `fn get_skip_ctx` is not `BitDepth`-generic, but `fn decode_coefs` is. I think `dav1d` still had `fn get_skip_ctx` inlined because it is manually duplicated with the templating system. But if LLVM thinks it is better not to inline this even with an `#[inline]` or even an `#[inline(always)]`, and we're still able to propagate information through the return type, I'm inclined to leave it un-inlined.
This was referenced Jun 24, 2024
kkysen
added a commit
that referenced
this issue
Jun 24, 2024
* Part of #1180. This makes `TxfmSize` and `RectTxfmSize` into a real `enum`, which eliminates many bounds checks in hot `fn`s like `fn decode_ceofs`. It also lets us further simplify the `itx` macros and eliminate the distinction between the `R`ect ones. Note that `TxfmSize` defined the square sizes; these ones are named with an `S` prefix, while the `R`-prefixed `RectTxfmSize` stay `R`-prefixed. There are a few new places where we have to create `TxfmSize`s using `from_repr` and thus do a bounds check. These should be pretty cheap, but there are optimizable out, as they ultimately come from `static`/`const` data whos bounds are known but LLVM is not able to deduce them. This is why I added 7f0cdc2, starting to try to fix this, though it's a little complicated, so I didn't want to finish it in this PR.
kkysen
added a commit
that referenced
this issue
Jun 27, 2024
* Part of #1180. This optimizes `fn get_lo_ctx`. 9d00ae6 optimizes it enough that it is now inlined (it was already `#[inline]`), which enables other optimizations. a5c6209 makes `stride` a `u8`, which it is in the caller `fn decode_coefs`, and which tells LLVM that `_ * stride` can't overflow, and thus `2 * stride` being in bounds means that `0 * stride` is in bounds, for example. 242065b then pre-indexes `levels` so that there is only one bounds check per branch. There is still actually another since LLVM does not know that `stride != 0`. This can be optimized out with more effort in `fn decode_coefs`, as `stride` is ultimately derived from `t_dim.h`, which comes from `static`/`const` data.
kkysen
added a commit
that referenced
this issue
Jun 27, 2024
* Part of #1180. Move the `tok != 0` check before the `*= 0x17ff41` so that `cmov` is used, not a mispredicted branch.
This was referenced Jun 27, 2024
kkysen
added a commit
that referenced
this issue
Jul 1, 2024
* Part of #1180. I forgot to open this PR earlier. This optimizes some `CfSelect` uses and makes its offset, as well as the offsets in `Rav1dTileState_frame_thread` `u32`s instead of `usize`s, saving space, prohibiting overflows on multiplications, and reducing stack usage a bit.
kkysen
added a commit
that referenced
this issue
Jul 1, 2024
…icing once in `fn decode_coefs` (#1249) * Part of #1180. For the `CfSelect::Task` case, no slicing is done, the array is just used as is. For the `CfSelect::Frame` case, the `DisjointMut` is sliced to get a `CF_LEN` array. Then `fn get` and `fn set` can directly index into this array with no bounds checks and no `match`.
@fbossen, how slow is |
I am measuring about 5% slower than the C code on average, which is a lot better than it used to be. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
fn decode_coefs
has been observed to be about 40% slower than its C equivalent. The main contributor to the speed difference appears to be the numerous bound checks that Rust is performing. For example, bounds checks inget_lo_ctx
when accessing accessinglevels
may be superfluous as the maximum offset intolevels
is2 * stride
and we know that an additional2 * stride
values have been cleared inlevels
before:levels[..(stride * (4 * sw as isize + 2)) as usize].fill(0);
(note the+ 2
).It may sufficient to check for
y < 4 * sh
(note thatstride = 4 * sh
) andx < 4 * sw
and forgo all other bound checks inside thewhile i > 0
loop.The text was updated successfully, but these errors were encountered: