Large number of page faults occurring in `fn rav1d_submit_frame` #1358

ivanloz · 2024-09-13T16:46:26Z

(Copied from #1294 (comment) -- @kkysen for visibility, thanks!)

I've been looking into what remaining sources of overhead might be and wanted to chime in with some of my findings.

One major discrepency when running benchmarks I noticed between dav1d and rav1d was an order of magnitude difference in the number of madvise calls and and page-faults. I also saw a larger number of context-switches in rav1d than dav1d (likely related to page-faults?). This may explain at least some of the performance difference seen.

Digging into this, I found the majority (~82%) of the page-faults in rav1d are coming from rav1d::src::decode::rav1d_submit_frame. Specifically, the stack trace points to:

<alloc::boxed::Box<[rav1d::src::refmvs::RefMvsTemporalBlock]> as core::iter::traits::collect::FromIterator<rav1d::src::refmvs::RefMvsTemporalBlock>>::from_iter::<core::iter::adapters::map::Map<core::ops::range::Range<usize>, rav1d::src::decode::rav1d_submit_frame::{closure#1}>>

I believe that corresponds to this closure:

rav1d/src/decode.rs

Lines 5223 to 5227 in 7d72409

    
           f.mvs = Some( 
        
               (0..f.sb128h as usize * 16 * (f.b4_stride >> 1) as usize) 
        
                   .map(|_| Default::default()) 
        
                   .collect(), 
        
           );

This is the equivalent operation in dav1d:

rav1d/src/decode.c

Lines 3623 to 3624 in 7d72409

    
           f->mvs_ref = dav1d_ref_create_using_pool(c->refmvs_pool, 
        
               sizeof(*f->mvs) * f->sb128h * 16 * (f->b4_stride >> 1));

Here dav1d_submit_frame allocates using dav1d_ref_create_using_pool. This then calls into dav1d_mem_pool_pop, which allocates from pooled memory (initialized in dav1d_mem_pool_init). This likely reduces the amount of allocator calls.

The switch from using pooled memory in rav1d looks to have been introduced as part of 6420e5a, PR #984.

The text was updated successfully, but these errors were encountered:

CrazyboyQCD · 2024-09-19T01:16:55Z

We could use other allocators like jemalloc and mimalloc for comparison.

kkysen · 2024-09-19T05:27:18Z

It's possible that if we just changed this to use an explicitly zeroed allocation, it will be optimized enough. If the performance of that is good, that's much easier than pooling it again and managing the lifetime for that.

kkysen · 2024-09-19T05:58:18Z

The optimization Rust does for zeroed allocations is only done for types that implement IsZero, which is only implemented for the primitive types and for arrays of them. Thus, doing it for custom types won't work out of the box. We could, however, allocate zeroed u8s (or maybe a larger alignment primitive for more aligned types), align it, and then use FromZeroes to transmute it. There is also the issue of how we can initialize an Arc<[T]> this way, but I think we can figure that out.

@ivanloz, if you want to see if doing this for f.mvs fixed the performance issue you measured on your machine, that'd be great. We can just try a quick and dirty solution to see if it's worth pursuing first before trying to figure out a good API for it and doing it for all of our Vecs, AlignedVecs, Box<[_]>s, and Arc<[_]>s that are zero-initialized.

kkysen · 2024-09-19T06:24:57Z

zerocopy (if we enable the "alloc" feature) actually already has FromZeroes::new_{box_slice,vec}_zeroed that do this optimization, though more directly. It doesn't have an Arc one, though, as that's more complex, but it should be enough for testing I think.

ivanloz · 2024-09-19T18:09:34Z

While I've got experience tracking down performance issues, I'm a bit of a novice when it comes to actually writing Rust. This sounds like it should be a relatively straightforward change to put together, but I haven't had much luck implementing it.

If you (or someone else) can provide the patch I'd be happy to test it out and report the results.

kkysen · 2024-09-19T22:01:01Z

While I've got experience tracking down performance issues, I'm a bit of a novice when it comes to actually writing Rust. This sounds like it should be a relatively straightforward change to put together, but I haven't had much luck implementing it.

If you (or someone else) can provide the patch I'd be happy to test it out and report the results.

Sure, I'll work on it when I get some more time. If there's not much of a rush either, we'll also probably want to wait for Arc::new_zeroed_slice (rust-lang/rust#129396) to stabilize. The similar new_uninit/new_uninit_slices functions were just recently stabilized in rust-lang/rust#63291, so hopefully there is still momentum to stabilize the zeroed version sooner. Otherwise, we'll have to use Arc<Box<[T]>>s instead of Arc<[T]>s (in release mode) to get this zeroed initialization optimization, as there's no other way to (safely/stably) create a zeroed Arc.

kkysen · 2024-10-28T10:54:42Z

Hey @ivanloz, I'm trying to look into this further now and fix it. Are you measuring the page-faults, madvises, and context-switches using perf? What are you running exactly so I can try to reproduce it and iterate on it? And how are you determining where the page faults are occurring? Thanks!

ivanloz · 2024-10-30T00:37:18Z

Thanks @kkysen -- I'm testing on an Android device so I'm using simpleperf, but AIUI it has the same usage/functionality as perf.

I collected the raw number of context-switches and page-faults with
simpleperf stat ./rav1d-cli -l 2000 --threads 1 -i $INPUT -o /dev/null

To see where they were occurring, I used record to count page-fault events while recording the call-graph:
simpleperf record -o $OUT_DATA --duration 2000 -f 4000 --call-graph fp -e page-faults ./rav1d-cli -l 2000 --threads 1 -i $INPUT -o /dev/null

Android provides inferno to inspect the data, and I believe perf can produce it via perf script report flamegraph

As for the madvise calls, I believe I saw those while comparing perfetto traces, though there may also be a way for perf to count syscalls.

kkysen · 2024-10-31T11:09:12Z

@ivanloz, thanks! Also, which files were you testing on? We've been testing on Chimera 10-bit 1080 mostly. I'm also curious if different kinds of videos make a difference here, especially if they do might do things like change frame sizes and stuff like that.

ivanloz · 2024-10-31T16:53:41Z

I'm testing with the Chimera set as well, primarily with the 10-bit 1080 6191kbps file but I believe I've seen this with any of the files I've tested with from that set.

I went ahead and tested this on my Linux x86-64 host as well just to make sure this wasn't some artifact from running in Android or ARM64, and I'm seeing the same behavior.

It's less pronounced (maybe because of the greater resources), but page-faults grow with frames processed in rav1d vs. dav1d largely holding stable. Again this is with the Chimera 10-bit 1080 6191kbps sample on Linux x86-64.

frames	rav1d pagefaults	dav1d pagefaults
2000	2,349	2,066
8929	11,094	2,080

I tested with #1368 and saw no difference unfortunately.

ivanloz mentioned this issue Sep 13, 2024

Tracking issue for performance #1294

Open

kkysen added the performance label Sep 13, 2024

kkysen changed the title ~~Large number of page faults occurring in rav1d_submit_frame~~ Large number of page faults occurring in fn rav1d_submit_frame Sep 17, 2024

kkysen self-assigned this Sep 19, 2024

kkysen mentioned this issue Sep 23, 2024

Try using new_zeroed_slice to optimize f.mvs with zeroed allocations #1360

Open

ivanloz mentioned this issue Oct 31, 2024

Try optimizing f.mvs by only allocating when the length changes #1368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large number of page faults occurring in `fn rav1d_submit_frame` #1358

Large number of page faults occurring in `fn rav1d_submit_frame` #1358

ivanloz commented Sep 13, 2024

CrazyboyQCD commented Sep 19, 2024 •

edited

Loading

kkysen commented Sep 19, 2024

kkysen commented Sep 19, 2024 •

edited

Loading

kkysen commented Sep 19, 2024

ivanloz commented Sep 19, 2024

kkysen commented Sep 19, 2024

kkysen commented Oct 28, 2024

ivanloz commented Oct 30, 2024

kkysen commented Oct 31, 2024

ivanloz commented Oct 31, 2024

Large number of page faults occurring in fn rav1d_submit_frame #1358

Large number of page faults occurring in fn rav1d_submit_frame #1358

Comments

ivanloz commented Sep 13, 2024

CrazyboyQCD commented Sep 19, 2024 • edited Loading

kkysen commented Sep 19, 2024

kkysen commented Sep 19, 2024 • edited Loading

kkysen commented Sep 19, 2024

ivanloz commented Sep 19, 2024

kkysen commented Sep 19, 2024

kkysen commented Oct 28, 2024

ivanloz commented Oct 30, 2024

kkysen commented Oct 31, 2024

ivanloz commented Oct 31, 2024

Large number of page faults occurring in `fn rav1d_submit_frame` #1358

Large number of page faults occurring in `fn rav1d_submit_frame` #1358

CrazyboyQCD commented Sep 19, 2024 •

edited

Loading

kkysen commented Sep 19, 2024 •

edited

Loading