Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of page faults occurring in fn rav1d_submit_frame #1358

Open
ivanloz opened this issue Sep 13, 2024 · 10 comments
Open

Large number of page faults occurring in fn rav1d_submit_frame #1358

ivanloz opened this issue Sep 13, 2024 · 10 comments
Assignees

Comments

@ivanloz
Copy link

ivanloz commented Sep 13, 2024

(Copied from #1294 (comment) -- @kkysen for visibility, thanks!)

I've been looking into what remaining sources of overhead might be and wanted to chime in with some of my findings.

One major discrepency when running benchmarks I noticed between dav1d and rav1d was an order of magnitude difference in the number of madvise calls and and page-faults. I also saw a larger number of context-switches in rav1d than dav1d (likely related to page-faults?). This may explain at least some of the performance difference seen.

Digging into this, I found the majority (~82%) of the page-faults in rav1d are coming from rav1d::src::decode::rav1d_submit_frame. Specifically, the stack trace points to:

<alloc::boxed::Box<[rav1d::src::refmvs::RefMvsTemporalBlock]> as core::iter::traits::collect::FromIterator<rav1d::src::refmvs::RefMvsTemporalBlock>>::from_iter::<core::iter::adapters::map::Map<core::ops::range::Range<usize>, rav1d::src::decode::rav1d_submit_frame::{closure#1}>>

I believe that corresponds to this closure:

rav1d/src/decode.rs

Lines 5223 to 5227 in 7d72409

f.mvs = Some(
(0..f.sb128h as usize * 16 * (f.b4_stride >> 1) as usize)
.map(|_| Default::default())
.collect(),
);

This is the equivalent operation in dav1d:

rav1d/src/decode.c

Lines 3623 to 3624 in 7d72409

f->mvs_ref = dav1d_ref_create_using_pool(c->refmvs_pool,
sizeof(*f->mvs) * f->sb128h * 16 * (f->b4_stride >> 1));

Here dav1d_submit_frame allocates using dav1d_ref_create_using_pool. This then calls into dav1d_mem_pool_pop, which allocates from pooled memory (initialized in dav1d_mem_pool_init). This likely reduces the amount of allocator calls.

The switch from using pooled memory in rav1d looks to have been introduced as part of 6420e5a, PR #984.

@kkysen kkysen changed the title Large number of page faults occurring in rav1d_submit_frame Large number of page faults occurring in fn rav1d_submit_frame Sep 17, 2024
@CrazyboyQCD
Copy link

CrazyboyQCD commented Sep 19, 2024

We could use other allocators like jemalloc and mimalloc for comparison.

@kkysen
Copy link
Collaborator

kkysen commented Sep 19, 2024

It's possible that if we just changed this to use an explicitly zeroed allocation, it will be optimized enough. If the performance of that is good, that's much easier than pooling it again and managing the lifetime for that.

@kkysen
Copy link
Collaborator

kkysen commented Sep 19, 2024

The optimization Rust does for zeroed allocations is only done for types that implement IsZero, which is only implemented for the primitive types and for arrays of them. Thus, doing it for custom types won't work out of the box. We could, however, allocate zeroed u8s (or maybe a larger alignment primitive for more aligned types), align it, and then use FromZeroes to transmute it. There is also the issue of how we can initialize an Arc<[T]> this way, but I think we can figure that out.

@ivanloz, if you want to see if doing this for f.mvs fixed the performance issue you measured on your machine, that'd be great. We can just try a quick and dirty solution to see if it's worth pursuing first before trying to figure out a good API for it and doing it for all of our Vecs, AlignedVecs, Box<[_]>s, and Arc<[_]>s that are zero-initialized.

@kkysen
Copy link
Collaborator

kkysen commented Sep 19, 2024

zerocopy (if we enable the "alloc" feature) actually already has FromZeroes::new_{box_slice,vec}_zeroed that do this optimization, though more directly. It doesn't have an Arc one, though, as that's more complex, but it should be enough for testing I think.

@ivanloz
Copy link
Author

ivanloz commented Sep 19, 2024

While I've got experience tracking down performance issues, I'm a bit of a novice when it comes to actually writing Rust. This sounds like it should be a relatively straightforward change to put together, but I haven't had much luck implementing it.

If you (or someone else) can provide the patch I'd be happy to test it out and report the results.

@kkysen
Copy link
Collaborator

kkysen commented Sep 19, 2024

While I've got experience tracking down performance issues, I'm a bit of a novice when it comes to actually writing Rust. This sounds like it should be a relatively straightforward change to put together, but I haven't had much luck implementing it.

If you (or someone else) can provide the patch I'd be happy to test it out and report the results.

Sure, I'll work on it when I get some more time. If there's not much of a rush either, we'll also probably want to wait for Arc::new_zeroed_slice (rust-lang/rust#129396) to stabilize. The similar new_uninit/new_uninit_slices functions were just recently stabilized in rust-lang/rust#63291, so hopefully there is still momentum to stabilize the zeroed version sooner. Otherwise, we'll have to use Arc<Box<[T]>>s instead of Arc<[T]>s (in release mode) to get this zeroed initialization optimization, as there's no other way to (safely/stably) create a zeroed Arc.

@kkysen
Copy link
Collaborator

kkysen commented Oct 28, 2024

Hey @ivanloz, I'm trying to look into this further now and fix it. Are you measuring the page-faults, madvises, and context-switches using perf? What are you running exactly so I can try to reproduce it and iterate on it? And how are you determining where the page faults are occurring? Thanks!

@ivanloz
Copy link
Author

ivanloz commented Oct 30, 2024

Thanks @kkysen -- I'm testing on an Android device so I'm using simpleperf, but AIUI it has the same usage/functionality as perf.

I collected the raw number of context-switches and page-faults with
simpleperf stat ./rav1d-cli -l 2000 --threads 1 -i $INPUT -o /dev/null

To see where they were occurring, I used record to count page-fault events while recording the call-graph:
simpleperf record -o $OUT_DATA --duration 2000 -f 4000 --call-graph fp -e page-faults ./rav1d-cli -l 2000 --threads 1 -i $INPUT -o /dev/null

Android provides inferno to inspect the data, and I believe perf can produce it via perf script report flamegraph

As for the madvise calls, I believe I saw those while comparing perfetto traces, though there may also be a way for perf to count syscalls.

@kkysen
Copy link
Collaborator

kkysen commented Oct 31, 2024

@ivanloz, thanks! Also, which files were you testing on? We've been testing on Chimera 10-bit 1080 mostly. I'm also curious if different kinds of videos make a difference here, especially if they do might do things like change frame sizes and stuff like that.

@ivanloz
Copy link
Author

ivanloz commented Oct 31, 2024

I'm testing with the Chimera set as well, primarily with the 10-bit 1080 6191kbps file but I believe I've seen this with any of the files I've tested with from that set.

I went ahead and tested this on my Linux x86-64 host as well just to make sure this wasn't some artifact from running in Android or ARM64, and I'm seeing the same behavior.

It's less pronounced (maybe because of the greater resources), but page-faults grow with frames processed in rav1d vs. dav1d largely holding stable. Again this is with the Chimera 10-bit 1080 6191kbps sample on Linux x86-64.

frames rav1d pagefaults dav1d pagefaults
2000 2,349 2,066
8929 11,094 2,080

I tested with #1368 and saw no difference unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants