-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing pl_mpeg #51
Comments
Very cool! I can't promise to merge any of your changes (if that's even desired), but I'll certainly have a look. For performance optimizations, have a look at the n64 port of pl_mpeg in libdragon: https://github.com/DragonMinded/libdragon/tree/preview/src/video – from what I understand they added a bunch of general optimizations for bit reading and VLC lookups (and of course some very n64 specific YUV conversion functions etc). |
I've actually gone beyond most of the optimizations in that repo. I still have a few more to go... |
One of the strategies I need to implement for MCUs is knowing which MBs have changed each frame and only drawing those. A significant amount of time on MCUs is spent converting pixel colorspace and pushing those pixels to the display. Any MBs that can be avoided per frame helps a lot. |
@bitbank2 There's no issues page on your fork page so I'm messaging you here. I tried your optimized version and it's really impressive ! But currently it's unusable for me because it randomly hangs and takes 100% of the CPU, stopping decoding entirely. It happens with every mpg file I tested when seeking through it, sometimes for particular files I don't even have to seek through.
It happens with the example video in the readme too. |
Thanks for letting me know - I'll figure it out. |
@siteswapv4 I'm not able to reproduce your hang on the mac. I'm playing the bjork example video and seeking at the beginning to get the duration. I'm not familiar with other player programs. Can you give me the specific function to call with the specific parameter that makes it hang on the bjork video? I didn't add a seek function yet to my test player. |
This hangs forever at loop 4 for me playing bjork example
|
Maybe you could open your github repo for issues also so people can report bugs better later Hope you can reproduce it this time |
Also I'm testing this on linux but same behavior on psvita so I don't think it's relevant |
I was able to reproduce the problem on my MacOS project. It has to do with the changes I made to the buffer/file access. It gets messed up when seeking. I think I can resolve it. How much faster does it run in your tests? On the Mac it's about 230% faster than the original (so far). |
I haven't been able to benchmark it yet because of the crash issue but on desktop my whole game went from 25% to 20% single core usage just with that. I'd have to run tests on the Vita where it's most relevant. |
@bitbank2 Just tested your version on psvita (without seeking) and I can play videos with a bitrate 2 to 3 times higher than before (I disable audio and only play video), so yup the improvement is crazy lol |
I fixed the problem, but discovered something in the design that I really don't like. Because of the way it seems to seek audio and video separately, it gums up the buffer such that it reallocates the file buffer to 2x size mid-decode. I will push my fix for you to try, but I would like to redesign the logic to not ever use memrealloc(). |
@bitbank2 Completely fixed, both the seek issue and the random crashes I was getting, tested with the same files that were failing before. Thanks a lot ! |
Do you think you can push the optimizations further or are you about done ? |
I'm at the point where the next set of optimizations would require breaking the code and then fixing the breakage. Most of the hot spots are in already in good shape. One option I've tested is to have a macroblock callback function instead of frame. This works on some videos, but not all because the switch between showing a B frame or an F frame mess things up. It's a dramatic speedup because redrawing the entire frame every time is wasteful because most pixels don't change. The problem is also that dramatic movement will cause a worst-case scenario, so optimizing for slim frame changes can lead to uneven playback load. Here's the main profile so far. The code taking the most time is doing the necessary macroblock decoding/copying/moving. |
I see ! Well the improvement is already incredible right now. I'll take a peek at your fork from time to time but right now there's nothing more I can ask for the PSVita, it decodes up to native resolution (960x540) 3000K video bitrate now, thanks |
There are plenty of places in the current code where some 128-bit SIMD would help significantly, but that's a little beyond the scope of what I'm doing. I wanted to maintain a pure C project that can compile on any target CPU (32-bits or better). |
If SIMD is to be considered, the suggestions would be SSE2 for X86(_64) and NEON for ARM. Virtually every CPU launched in the past 20 years support them. And you can always hide them behind preprocessor directives. (I don't know much about implementing SIMD, just wanted to point that out) |
Yup same, if SIMD make the thing faster I'm all in, but not at the cost of compatibility with most CPUs |
There's also "SIMD everywhere" that exists, though I've never actually used SIMD myself, just checked the github page it seems promising ? |
I can add SIMD for x86 and Arm NEON easily and in a way that doesn't break it for generic C targets. I had assumed that pl_mpeg was only used on "strange/odd" targets because x86/Arm would have optimized playback in some other form already. If you're saying that your use of pl_mpeg is on an x86 or Arm CPU with SIMD capability, then sure, I'm in! |
PSVita has an ARM Cortex 9A which supports NEON SIMD |
I'm not converting anything I use SDL2 (now 3 but same behavior) and I directly update a YUV texture. The SDL2 only example in the pl_mpeg github is from me if you wanna take a look. |
wow great ! |
I did a test by compiling Before: https://0x0.st/s/SyCGgHGrTtYPHlMHMVSlFg/88wG.mp4 After: https://0x0.st/s/uJT7rguTFuNecxx2FDyvmA/88wD.mp4 Outside pl_mpeg_extract_frames or on less demanding videos (1080p or less, 24-30fps) the decoded result is a bit less broken. This can be observed since the first commit that implemented the fast VLC. I get that the optimization process isn't finished yet, but I thought pointing this out could help in some way. Despite this, the performance gains are quite impressive. Even Theora with its SSE2 can't stand against. |
Thanks for the info. My initial changes reduced the frequency of checking the buffer state. The original code was checking for data available after each VLC decode. I need to implement a highwater mark check at the start of each macroblock decode to fix this. |
@siteswapv4 I just pushed a change which adds some NEON SIMD and the overall video decode speedup is about 21% compared to the previous build. I will now work on fixing the stream issues since removing some of the "has enough" checks. |
It doesn't compile on PSVita right now because
|
Here's the log |
enabling the hinted flag fixed it |
Can confirm there's a noticeable improvement on PSVita ! Very nice ! |
@siteswapv4 some compilers are picky about casting SIMD registers; LLVM isn't. I can fix that with a NEON macro. Glad you're able to see the improvement. Now to try rewriting it in ESP32-S3 SIMD. Did you say that someone needed a SSE2 version? |
Well I'm not super familiar with SIMD but if I understand correctly and it means it'll be faster on x86/64 CPUs it's a full win |
Each CPU has it's own variation of these instructions. Most allow manipulating 128-bit values as if they're 4 x 32-bits or 8 x 16-bits or 16 x 8-bits. I can support the unique instructions needed to speed up the code on x86 targets, but it would be helpful to have at least 1 person specifically needing that code. I can always do it later, but the ESP32-S3 is more important to me at the moment. |
I won't have immediate use of this myself. Though I'm guessing most people that use pl_mpeg will use it on recent desktops/laptops, as I said before pl_mpeg really is about its easiness to include in all possible projects. |
I've been using pl_mpeg in my godot-mpg module to add support for MPEG-1 videos in Godot. It was meant to replace Theora, a video codec that was supported by Godot since before it was open-sourced and to this day is full of bugs (which might probably get solved soon). pl_mpeg itself turned out to be slower than Theora, which has SSE2. But after just the VLC optimizations it was able to surpass it to the point even unoptimized editor builds are able to play without stuttering. I pointed to SSE2 and NEON because not only those are the only SIMD accepted there, but virtually every amd64 CPU supports SSE2 and every armv8 supports NEON. (also shouldn't it use |
Good points. I fixed the macro name (not pushed yet). After I get ESP32-S3 SIMD working, I'll return to do SSE2. The functions are all really simple, someone else can also do a PR with the SSE2 code. |
I completely missed the boat on this one. As someone who really wants to have videos working at good speeds on psvita, I really must have a good look at this one. Also, not sure if it's even useful at this point, but I had posted a PR that yielded a small (7%) optimization - check #33. SIMD and video decoding is way beyond my knowledge, so that was the best I could do. |
This was one of the first changes I made. I can see all of the hotspots with Xcode Instruments. |
Hi Dominic,
I recently got approved for a grant from NLnet and one of the projects I proposed was to optimize the pl_mpeg player to make it suitable for use on constrained devices (e.g. Microcontrollers). I've forked it and will be sharing my optimizations as I make them. Besides speed optimizations, I will make some #ifdef changes for the different memory challenges on MCUs (e.g. PSRAM).
The text was updated successfully, but these errors were encountered: