-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libass support #439
Comments
Hi May I ask why you need libass? I don't think you can use libass easily without modifying ffmpeg builds because all of the addon libraries need to be linked into ffmpeg, otherwise ffmpeg wouldn't know it can use it. You technically can use libass into ffmpeginteropx but that would require a code change that's not trivial. |
That's because ffmpeg is a C library and does not provide .net interfaces, which is what .wnmd files are for. |
The av-libs built by Shift-Media-Project include all this.
It would be rather easy, but there's a little caveat to that: you'll loose hardware acceleration, because the subtitle "burn-in" can only happen in a sw filter. Even "hw decoding to cpu mem" is pointless to do, because experience has shown that software decoding is less resource intesive than "hw-decode, hwdownload, hwupload" in most cases. The hwupload alone is killing already for high-res videos like 4k). That little caveat is a KO for the idea, unfortunately. |
I might have misunderstood but i was under the impression that OP wanted libass from SMP with our own ffmpeg builds. Which is ofc not possible. Filter burnin isn't the only possible way to work with libass. We can expose ssa/srt as image cues and render the images ourselves to feedv into the sub stream. Then the media sink will handle rendering/burn in/whatever. |
Hi @rmtjokar, long time no see! I think libass has recently added a meson build system. It should be pretty easy now to directly integrate it, without having to resort to a SMP fork. The SMP build system has some disadvantages for us, like, not all its libs do have UWP targets, and if they do, they usually don't have ARM targets. And their project files are horribly messy, which makes it hard to maintain a fork with added WinRT+ARM configs. So using SMP is always last resort for me. As others have noted, the question is, what are you trying to accomplish with adding libass? You won't be able to use it without bigger changes in our lib. Subtitles are not normal streams, and our effects system does not work with them. We would need explicit code to post-process subs with libass (through ffmpeg filters). A big problem is that libass does not support GPU rendering, and copying frames from GPU to CPU memory for rendering is very expensive (and afterwards they need to be copied back!). Which means that we cannot really use libass to directly render the subs into video frames. And the subtitle rendering system in windows is rather poorly implemented. The bitmap subtitle rendering is intended for static bitmap (text) frames, not animated live frames. I am pretty sure that it would be horribly jaggy if we'd try to feed animated subs into it. We had to use quite elaborate workarounds, only to get clean flicker-free static subs. The only use I could currently see is for transcoding static (non-animated) ssa/ass subs into bitmap subtitles, using the libass rendering engine, which is sure better than the Windows rendering system. Downside is that we don't know which target size to render to, which might lead to more or less noticeable scaling artifacts. And I am not sure how flexible the libass filter in ffmpeg is - we would somehow have to disable animations. Oh and keep in mind that bitmap subtitle rendering is still broken in WinUI. Or has this bug "already" been resolved? I guess not, but I did not check for a long time. Has anyone tried with a recent version @brabebhin @softworkz? |
Hi @lukasf I haven't heard anything on the MPE front, but I would guess the bug is still there as all the winui 3 effort seems to have gone into supporting AOT and designers. I have since developed my own MPE based on directx and frame server mode, as well as custom sub rendering with win2D. But this shouldn't be a show stopper for us. We can use UWP as a benchmark, since both UWP and winUI use the same MF interfaces. As an ugly workaround for size rendering we could simply ask the user to provide a size for us to render against and have the user update it on resize. |
Sure we could pass in a size, but at least when using the Windows rendering, it is not even clear at which exact position and size a sub is rendered. Of course this is not a problem when using custom subtitle rendering (which is probably the better approach anyways). |
For sub animations, custom rendering would be the only way to do it. Windows rendering is just 50% arcane. The regions containing a cue will direct where subtitles are rendered. The region itself has coordinates that determine where it will be rendered on screen (these may be absolute positions or percentages). IIRC, images cues are always rendered in their own region and will have pretty much absolute positioning and size. For text cues it gets more arcane because Windows groups them by regions, and then inside the region you have some sort of flow directions. For whatever ungodly reason they also use XAML composition for rendering, which I would guess is why we observe flickering. The MF interfaces provide A LOT of customization and allows applications fine grained control over subtitles, but seems MPE chooses to only implement a few of the combinations. Which is why it seems arcane, as it only implements whatever suits them. It's almost as if the MF team had nothing to do with MPE. At this point I think the conversation moves towards whether we want to also become a rendering library and not just demux+decoding. In the end I think creating our own MPE isn't that hard. If we do want to create our own MPE, we could completely forget about MF's way of doing subtitles and just render them directly inside FFmpegInteropX. We'd still have to support Windows rendering too. |
My Subtitle Fitlering patchset includes a text2graphicsub filter, which allows to convert text subtitles (including ASS) to graphical subtitles like dvd, dvb or x-subs, so all you need to do is to add some filters and at the end you get graphical subs like from any other file. It also has an option to strip animations. Yet, my general view on this though is this: Most of those who have ASS subtitles are expecting animations to work. For non-animated subtitles we are not 100% accurate but pretty close to libass. Starting any work to integrate libass without animation support is rather pointless as it won't make anybody happy. So either go for making libass fully work, including animations, or just leave it. IMO, putting effort into this is only justified when it allows to get it working in full effect. |
Hi, sorry I had to go out of the city for the past week. @brabebhin Using libass is mainly for supporting ASS effects and animations. Its renderer is quite fast and smooth, and other player apps like PotPlayer, KMPlayer, and even MXPlayer on Android use this library. Maybe we can do the same. @softworkz Thanks for the information. As I saw in PotPlayer, there are two ways of showing subtitles: "Vector Text Renderer" and "Image Text Renderer," both of which have better quality than displaying text in a UWP app (using SubtitleCue or even Win2D Canvas with the same font/style). It's strange that they have better text rendering quality. I was thinking maybe I can use libass.dll directly in my app (just giving the whole ASS text to libass for external subtitles only), so I created a wrapper around libass.dll x64 version with P/Invoke in a WPF app (because I couldn't with UWP), but I got stuck. Another option I've been using for the past eight years is Win2D and its CanvasControl. I created a new renderer using ChatGPT, and it seems you can manipulate the fading effect via the color's alpha channel without touching anything else. (Take a look at this:) Rec.0002.mp4 |
So I found a project which uses libass: And I could've use it in uwp with Image control and this is the result: Pot Player > UWP 2 with sound> It’s actually quite good. It’s not as smooth as PotPlayer, but it’s still very good. |
I think this ultimately comes down to whether we want to venture into the land of rendering subtitles. So far we've been strictly a demix+decoding library. MPE is quite limited when it comes to subtitles, and not much can be done about it. I could devote some time to this once a decision is made. |
@softworkz Your changeset is exactly what would be needed for rendering static subtitle frames libass. But I think we all agree that static subs this is not the intention here. If we'd add libass, it is for the animated subtitles. I am not such a big fan of writing our own renderers. While it is very flexible, it makes it more difficult to use our lib. Currently, our output can just be put in a MPE and it all works. If some of our features require custom renderers, we would break with that concept, and require devs to migrate their apps from MPE to our custom rendering. Also, I think it is difficult to synchronize the subtitle renderer with the video renderer. Decoding is decoupled from rendering, and we do not get much information about the actual playback position. Text subs rendering is easier, since we get events when a new sub is to be shown. It would be great if we could find a way to use libass and somehow integrate it in our decoding chain - without killing performance. I am trying to brainstorm in that direction. Idea 1: ffmpeg recently added vulkan as a new cross-platform hw decoder type, and included a buch of filters which can run directly on vulkan frames. There is a filter called "hwmap", which can be used to map frames from one hw decoder type to a different hw decoder type (or a different gpu device of same type). It has an option to "derive" its output hw context from the device which was used in the input frames. It seems that if the underlying device on input and output hw context is same and compatible, then hwmap can directly map the frame, without copying. Setting the mode to "direct" can enforce this. If this would indeed work, then we could achieve hw accelerated gpu rendering: We could render the frames from libass onto a transparent hw texture and overlay this on the video using hwmap(vulkan) -> overlay_vulkan -> hwmap(d3d11). Of course, I don't even know if the hwmap stuff really works like that, and how easy it is to get a ffmpeg build with vulkan support. A downside of the approach is that we would be locked to the video output resolution. Idea 2: We could create a second MediaSource, which just contains the rendered subtitle as a video stream with alpha channel. Users of the lib would have to add a second MPE layered above the first one, and link them using a MediaTimelineController. I never tried MediaTimelineController, so don't know how well it does actually. But at least theoretically, it should take care of the syncing. Frankly, this also requires quite some modification on app side. Not sure if any of this makes sense, just trying to explore some alternative approaches... |
Here's the full tree of options from my point of view: mindmap
root((Subtitle<br>Overlay))
Burn into video
**B1**<br>hw decode<br>hw download<br>sw burn-in<br>hw upload
**B2**<br>sw decode<br>sw burn-in<br>hw upload
**B3**<br>sw render blank frame<br>hw upload<br>++++++++++<br>hw decode<br>hw overlay
**B4**<br>sw render half-size frame<br>hw upload<br>hw upscale<br>++++++++++<br>hw decode<br>hw overlay
**B5**<br>sw render partial sprites<br>hw upload<br>hw upscale<br>++++++++++<br>hw decode<br>hw overlay
Presentation Layering
**L1**<br>render full frames<br>Copy to D3D surface<br>overlay manually
**L2**<br>render partial sprites<br>Copy tp D3D surfaces<br>overlay manually
Burn into video optionsB1This is the worst of all options, because you need to copy every single uncompressed frame from GPU to CPU memory and then again from CPU memory to GPU memory. B2The advantage of hw decoding is often over-estimated. Other than in case of encoding, video decoding can be easily done by CPUs as well. The big advantage of using hw decoding is that the large amounts of data never need to be copied between system and gpu memory, because gpu memory is always the eventual target. HW decoding followed by immediate downloading to cpu memory almost never makes sense. But for this case of doing sw burn-in of subtitles, B2 is significantly better than B1, because the memory transfer hits in much harder than the sw (instead of hw) decoding. B3This is another scenario which requires my subtitles patchset and which will go into our server transcoding process shortly (it's there already, just not unlocked for the public). While it still involves uploading the overlay frames from cpu to gpu memory, like in case of B2, there's still a massive advantage: You don't need to upload at the same rate as the video fps:
Unfortunately, in context of FFmpegInteropX it's not straightforward to go that way, becauzse you cannot use this with the DX11VA decoders. Instead, you would need to use either the vendor-specific hw contexts and fitlers (like overlay_qsv or overlay_cuda). AFAIK, Vulkan is not stable enough on Windows yet. At least it doesn't work reliably with MPV player, even though it supports it. The one other (stable) option is to use OpenCL. There's also an overlay_opencl filter in ffmpeg, I'm just not sure whether you can hwmap from a d3d11va context to OpenCL. I know that d3d11va-to-opencl works for AMD, I know that it works from qsv-to-opencl and from cuda-to-opencl. But I'm not sure whether d3d11va-to-opencl works with Intel and Nvidia gpus. B4This is a low-profile variant of B3. By using only half-size frames for the subtitle overlay, you save 75% of the memory bandwidth. You woluldn't do that for 720p or lower-res videos, but for 4k, it's a good way to optimize for performance B5That would be the "holy grail": Instead of full frames, you would use one or more smaller surfaces, to cover only the regions with subtitle content. It's difficult to implement though, because you are typically working with a pool of D3D surfaces and that becomes difficult to manage when you have sizes which are changing dynamically. This would require modificationsn to the overlay filters in ffmpeg. Presentation Layering optionsBasicThe idea of not touching the video frames at all has a lot of appeal, as the nature of subtitles is significantly different from the actual video:
Means, while the case of full-screen+full-fps needs to be accounted for, but the implementation doesn't need to permanently create full-size overlays at the same rate as the video. ffmpeg SideI don't think it's required to have a totally separate ffmpg instance for subtitles. This would cause a lot of problems and a lot of work. ffmpeg can have more than a single video output, so for this case, one output would be the video frames as D3D surfaces and the rendered subtitles as software frames (L1) or a collection of multiple areas per frame (L2). How these eventually get on screen is an open question though, at this point. Maybe it's feasible to expose a secondary MediaSource from the main session? Another throught I had at some point is whether the 3d stereo capability of Windows.Media could be "misused" for displaying a secondary layer on top of the main video... "Rendering"It's not clear to me what have been referred to as "custom renderer". The renderer is always libass: You give it a range of memory, representing the video frame and then it renders the subtitles into that memory - pixel by pixel. The "only" thing that's left to do is to bring this rendered image on the screen. Canvas2dI've never used it and don'T knoiw about it's abilities. Maybe it's worth to try out whether the images generated by libass can be copied onto the canvas, but it's not clear to me how to do the switch from one image to the next one at the right moment in time. That's rather the domain of a SwapChainHaving a second wapchain on top of the video swapchain seems to be the most natural approach. This would need to be found out, because if not, then the only way would be to create a XAML island hwnd window on a separate thread or any other non-Winui3 technique to render the subs in a win32 window on top of the video so that the DWM (desktop window manager) does the composiion (typically using gpu overlay). L1, L2Same as with B5, a perfect implementation would work with individual areas rather than full-size frames, but it also adds a lot of complication. |
Here's another idea. If we keep this feature only when using DirectX decoders, then we can use compute shaders to burn in the image into the HW AVFrame before we send it to the MF pipeline. This will include an additional step which entitles a CPU->GPU memory copy of the subtitle, and another GPU->GPU operation. Might kill a frame or two. Compute shaders should be available on all Windows devices that we target, since their feature level is a hard requirement for windows support. This is simialr to @softworkz 's L1 (nice graph btw), except it is done on our side. @lukasf your first idea is technically possible. We can share memory between Vulkan and DirectX, there's something called VK_NV_external_memory in Vulkan which allows this kind of thing. |
Then you'll have to deal with all the HW formats that are being used for video frames on d3d surfaces when overlaying the subs.
It's similar to B3 because it would be applied into the video frame. L1/L2 means that video and subs remain separate blended only during presentation (like when you have one semitrasnparent window on top of another.
It's not a bitmap. Use the three-dot menu and shoose edit to see it 😄
ffmpeg has hw mapping to Vulkan currently only for VAAPI and CUDA. |
I'm not sure whether shaders are even needed for a trivial overlay. |
The elephant mama in the room here is a GPU->CPU memory copy, which is what's going to kill performance no matter how you spin it. I guess we wouldn't need compute shaders, but the compute shader has the advantage that you sort of know when it will run. For a custom MPE, we can stick to the same interface of the official MPE and only add new stuff, this will allow a drop-in replacement and ease adoption. We can use frame server mode to detect video position and render subtitles accordingly. |
Yes, it's this PLUS re-uploading again (B1). In case of B2, it's just one direction, so half of B1.
Like I said above:
and this includes 4k videos. SW decoding alone is not an elephant (of any age ;-). You can easily verify this by yourself. Just call ffmpeg like this:
Then you need to watch the "Speed" value in the output and also your CPU usage (because it's often not going to 100%). So for example, when you see 50% CPU usage and Speed of 6.0x, this means that your CPU is 12 times faster than needed for decoding the video in realtime (presentation at 1.0x). |
You might want to take a look at these two videos, demonstating dozens of ways for doing subtitle burn-in: https://github.com/softworkz/SubtitleFilteringDemos/tree/master/TestRun1 |
Sure, a desktop CPU or a plugged in laptop CPU will deal with 4K just fine in software mode. However, as soon as you factor in mobile devices that are not always plugged in and older CPUs, things get complicated. Ideally we should support both software and hardware anyways (we can skip over the system decoders as these are black boxes and only a MF filter will help us there). |
Yes, but that's because of the memory transfer. I'm sure it will run the pure decoding (the ffmpeg command above) of 4k video comfortably above 1.0x speed, even on batteries.
A laptop on batteries will hardly ever be used to drive a 4k display. Even full HD can be considered as a bit too much for a typical laptop screen. But still, FFmpegInteropX is moving around 4k frames when the source video is 4k., which is pretty bad, obviously. But there exists another trick for those cases, which isn't even specific to subtile overlay but can generally improve performance in case of 4k playback when the output display isn't 4k anyway: Above I said that there are no filtering capabilities available for the D3D11Va hw context, which applies to ffmpeg, but it's not the full truth. In fact there exists an API for video hw processing for D3D11Va and it's supported by all major GPU vendors (probably they wouldn't get certified for Windows without it). It's just that ffmpeg doesn't have an implementation for it. The two most important filtering capabilities that you get from this are hw deinterlacing and hw scaling, but let's forget about deinterlacing for now, and focus on scaling. Having such a filter, would allow to optimize performance significantly in all cases where the source video resolution is larger than the presentation (or the max presentation) resolution. The detailed conditions need to be decided upon by every developer individually, but examples would be like:
Like I mentioned above for B2: As soon as you are doing something with the data in hardware before downloading, the cost balance changes, which means: B1 with hw downscaling before hw downlaod becomes better than B2. And even outside of the subtitles subject, this would massively improve performance and reduce energy consumption for 4k playback on non-4k screens. |
And I completely agree. However, any real world scenario of decoding involves memory transfer at some point. And the CPU does bear some responsibility for it. Caching, memory controllers etc will all play a part in it.
You can always find those ridiculously speced "business" laptops that rock a 4k display with iGPU that can barely handle windows animations smoothly at that resolution xD We could technically implement that scaling optimization at our level. We know we can dynamically change the resolution of the video stream descriptors and MF will obey. However, wouldn't this downscaling already happen anyway? We are basically zero memory copy throughout all the decoding loops, MF will do the downscaling as it has to. I am not sure if we would actually win anything from this? It would just be us doing the down scaling instead of MF. I am speaking about the general implementation of this, not specifically for sub animations (that part is pretty clear). |
Media Foundation? How does that come into play? AFAIU, FFmpegInteropX is decoding via ffmpeg using D3D11VA hw decoders and the output from ffmpeg are D3D surfaces. Each time, when the media player element fires its event, we give them one of the D3D surfaces. Not right?
Yes, that's a really good question. Using the hw scaling right after decoding has two advantages: 1. It reduces GPU memory consumption There's always a pool of hw frames involved in decoding. The decoder needs to have a certain number of full-size frames to resolve references (forward and backward). These frames are a fixed requirement. The decoder doesn't produce exactly one frame right at the moment when it needs to be displayed. So there's another number of frames which are needed for queuing up between the decoder (between possible filters) and the final output of ffmpeg before they are actually provided for display. And this second number of hw frames is where GPU memory is reduced when scaling down each frame immediately after it gets out of the decoder. Scaling down 4k to FHD reduces the amount of memory by 75%. 2. Fixed functoin block scaling: you can't get it any cheaper Zero-copy sounds great, because copying is expensive, but what's even more expensive is scaling. When you supply the D3D surfaces to the media player element for display, and those are 4k while the display is just 1920, these surfaces need to be downscaled to the exact size of the element's panel. And who does perform that scaling => the gpu. We probably cannot prevent the GPU scaling from happening at all (or maybe there's a property in the mp element? |
Without access to the MS source code it is impossible to know, but I believe the inner working is something similar to this: FFmpegInteropx--> MediaPlayer --> MediaPlayerElement. A MediaElement would basically be a MediaPlayerElement with an abstracted MediaPlayer attach to it. MediaPlaybackItem will match MediaTopology. I am pretty sure MediaPlayer will do the scaling you are referring to. |
@brabebhin - I believe there are a number of inaccuracies in your post. Let's just wait for @lukasf to clear things up. 😄 |
There is a mistake, which I have since corrected ^^ |
It is absolutely clear that MediaPlayer is based on MF. MF is the way how media is done in Windows, it is the replacement of DirectShow. All the error messages you get from MediaPlayer have MF error codes, you can register IMFByteStreamHandlers and they will be automatically pulled in by the MF engine. You can even obtain some of the MF services from the MediaSource, which is how we get the D3D device. I also assume that internally the MediaPlayer is a wrapper around IMFMediaEngine, which has a very similar API surface and was introduced in a similar time frame, as a replacement of the older MFPlay apis. Sure the GPU can do scaling in no time. But of course, the same super fast scaling is used when rendering the HW frames on the same HW device. You won't gain any performance benefit by forcing a downscale after decode. In fact you will lose a (tiny) bit, because that means there will be two scale operations, on after decode, and a second scale to the actual target size (unless you exactly know it upfront). If you don't know the exact size, the double scaling will not only cost performance, but it will also introduce scaling artifacts which you don't have if you only scale once directly to the final target size (that would be the bigger concern for me here). VRAM is really not an issue, it is only a bunch of frames that are decoded upfront, so even 4K video is easily handled on iGPUs without any issues. I totally disagree that HW decoding is overrated. Sure my high power dev machine can easily do it. But a vast majority of devices out there are old and rather poorly powered and will never be able to decode a high bitrate 4K HEVC on the CPU. A lot of devices are sold even with Celeron CPUs. HW decoding is the only way to bring smooth high res video to those devices. And even if a device can SW decode, it will use at least 10x more CPU power compared to the dedicated HW decoder engines. They are so much more efficient. That means, a laptop that has enough battery to easily play 2h of video on HW decoder will be probably out of battery after half an hour SW decoding. And it will make a lot more noise. I would never use a player which cannot do HW decoding on my laptop, because of noise and battery lifetime concerns. |
There's no doubt about that. But @brabebhin wrote that MF would downscale the video which can be understood in two ways:
.
Of course not. Incorrect.
Incorrect. You do.
These are impossible to compare, and that factor is pure fantasy. It appears that you have mistakenly assumed that I'd have been spilling out some opinions and assumptions above. You can pick any of the details I stated above and I'll take you into that subject as deeply as necessary until you'll acknowledge that I'm right about it. My intention was to share some of the knowledge I have gained over time, especially on things that are not like you would normally think they would be. Don't know how I seemingly created the impression of doing some gossip talk. |
This is not what I mean, but the claim that there's no graph is likely incorrect.
This is also a claim that you cannot really make unless you have access to MS's source control, in which case I will likely bombard you with more questions haha. MF does have something to do with the presentation layer. For non frame server implementation, MediaPlayer likely uses something like this: Just because MediaPlayer isn't by itself an UI element, it doesn't mean it doesn't have anything to do with the presentation layer. Taking in some parameters to render to, as opposed to encapsulating them, is simply a separation of concerns thing. |
Not sure what you mean by that..
No. The rule is: New frame - new game! These "bitmaps" are not independent from any other. Neither does any of the bitmaps correspond to a specific element (like a letter). Please look at my image above with the grey backaground and the black/white regions. I've taken great effort in creating it to make this more understandable. The only thing we can get from libass is an indication whether there has been a change from the previous frame. If not, we can re-use the overlay image that we have generated for the previous frame. |
One way to view that multi-bitmap output from libass is to view it as an "awkward" - yet compact - representation of an overlay image. If you look at the "do some math" figures I've given above, you need to acknowledge that - even though the number of 300 bitmaps looks weird - it takes only 15 MB of data to represent an overlay image which would normally have 120 MB. Another way to look at it would need to remind us a bit about how UI drawing was done in earlier days. For example in case of Windows GDI drawing, you had to create a "Pen" of a distinct color and then you could draw pixels with that Pen. Once you are done, you release that Pen and then you create a Pen of a different color and do the painting for that color, and so on. Creating full bitmaps for overlay would have never been feasible in earlier days, and even today, you would rather want to avoid creating a 120 MB overlay image for each frame in a 4k video. |
At this point I think we're just mumbling in the dark. The first step that needs to be done here is integrate libass in our build system. I currently don't have much time to look at this, so maybe either @lukasf or @softworkz can take this on. Otherwise I'll have to postpone it to around Christmas time (assuming anarchy doesn't take over here). Once we have libass in we can better understand what's going on because we can actually see what it outputs. Dusting off my compute shader skills, we can use structured buffers with the exact struct layout from libass_image struct. Yes, we need to pass the a continuous array and not a linkedlist, but that shouldn't be a problem. Then we can use compute shaders and dispatch 1 kernel for each bitmap in the array to do the job. We can set a kernel size of 32 which should maximize hardware usage. Context->Dispatch(num_bitmaps, 1, 1) The texture2D will be created and will stay in GPU memory (so we save the 4K frame data transfer from CPU to GPU) |
Can you elaborate on that? If I remember correctly, that sample calculation was about a text area covering 15% of a 4K image. A 4K image in ARGB32 takes 24MB of memory. 15% of that area is 3.7MB. And that's full true color image quality including alpha, compared to the strange 15MB output we get from libass, which sure has worse quality (though probably not very noticeable). How is the latter more efficient? And it's not only the amount of memory, but also the tedious creation of all those bitmap layers in the lib, and then having the client reading out all these bitmaps and applying to a different, temporary bitmap over and over before rendering. It would be vastly more efficient to just output an ARGB32 for every changed area. Less memory and a simple copy operation vs the tedious copying of about 4x as much memory. Or am I missing something? |
I'm not a libass developer, but I do work with related software (Aegisub) so I can clear up a couple of the misconceptions here. Like shown further above, libass returns a linked list of
So this is not true (and if it was, it would indeed be extremely inefficient). Rather, this event would generate three bitmaps: One for the fill, one for the border, and one for the shadow.
These subtitles use karaoke text to achieve per-syllable karaoke styling. Because of this, the text cannot be rendered to a single combined bitmap, and is instead split into individual syllables, with four bitmaps (fill, border, shadow, and karaoke fill) being output per syllable. However, these bitmaps are also only as large as the characters they contain, so this does not actually generate any significant overhead (compared to having four combined bitmaps for the entire line). As the statistics show, there are a total of 40.000 pixels, which corresponds to (e.g.) about four 20 x 500 images. If you're looking for efficient ways to blend the images returned by libass, you could look at mpv (specifically https://github.com/mpv-player/mpv/blob/a283f66ede58e0182ac8cd4c930238144427fa74/sub/sd_ass.c). mpv packs all the alpha bitmaps into a single image, which is then passed to the GPU together with information about which rectangles should be blended where in which color. |
Hi @arch1t3cht, thanks a lot for the detailed clarifications! This is actually how I imagined it to work and it just makes so much more sense. I will also check out the mpv references. It sounds like the most efficient way to do the upload and blending in one operation. Thanks for the hints! |
I don't think we can use parallelization on that layer, since there is a strict order in which the bitmaps must be applied. But anyways, like @arch1t3cht just mentioned, we have a lot less images to render. We should look at how mpv does this, maybe we can follow a similar approach. I guess a pixel shader could be better suited here. I can try to take a go on getting a libass build running, I hope I can find some spare time for that in the next weeks. |
Thanks @arch1t3cht ! @lukasf - The 120 MB I had in memory were for 4K 10bit HDR frames. 4k 8bit RGBA is not 24 but 3840 * 2160 * 4 = 32 MB |
HDR typically has 10bit instead of 8bit per channel. Even in case of rare 12bit HDR, you get nowhere near that number, even for the full frame (and we were talking about the 15% overly area). |
I guess the confusion all comes from this ass_image header definition, which clearly states that the buffer is a 1bpp alpha buffer. 1bpp means 1bpp. But it seems that in reality this is a 8bpp alpha buffer, and this is the good news and saves us from a lot of pain and headaches! ^^ I had looked into the libass alloc functions, where it already seemed to be 8bpp to me. That's why I came up with it. |
Yes, what I said was based on the original info we had from that header. Since that is no longer true, then the approach doesn't make sense anymore. |
And I had taken a quick look at whether ffmpeg uses the alpha from the RGBA color value (but no further). So there are two alpha values - the one from the color and the ones from the bitmap - which I don't quite understand then how they relate to each other. |
They should just both be applied (i.e. multiplied). The alpha in the bitmap comes from rasterization, blurring, and clipping, while the alpha part of the color is the one set by alpha tags or fades. Separating the two makes it easier for libass to cache the results of rasterization and blurring. |
I meant uncompressed and unpacked frames like used when processing video (even though it's 60MB only, probably the 120 came from the calculation of data transfer that is needed for dealing with such frames with hwdownload/hwupload).
Sure, but the comparison I made was against the case when libass would output full overlay frames, and the linked list can have content for the full frame area, not for a a fixed subset of 15%. |
Yesterday I added some code to measure the timing for blending the ass output onto an image:
What it shows is that the number of pixels affected plays a much larger role than the number of ass images. |
@arch1t3cht - If you are working on Aegisub, you must be familiar with VSFilter as well (which Aegisub) appears to be using by default). Since FFmpegInteropX is not cross-platform, what would be your advise regarding it, do you think it could be a better choice? |
VSFilter is the reference implementation for the ASS format (though the situation is somewhat complicated since there exist various different versions of it), but it is not really actively developed any more. Libass on the other hand is actively maintained and in general faster than libass. In particular it outsources the blending to the user, who can then decide to blend on the GPU if they want - to my knowledge this is not possible with VSFilter. So my recommendation would be to use libass, but it's possible that other libass developers have other thoughts for this (and note that I don't know much about FFmpegInteropX). |
Blending in the GPU is probably the right approach for us as well. |
Like said above, I don't think it makes sense to send all individual bitmaps to the GPU, there needs to be some preprocessing at the CPI side. Look at these figures from overlaying some more complex animated ass subtitles onto a 4k video:
The count of those bitmaps alone is so huge (and proves that I didn't exaggerate) that it makes gpu processing inefficient as this cannot be parallelized reasonably. It's like when you have a fleet of riding mowers (shader units): you can cut the crass of a football field very quickly (letting 10 of them drive in parallel), but it doesn't help for your family garden at home. It also doesn't help when needing to treat a hundred family gardens but you are forced to do one after another. Most of those individual bitmaps are too small as that any parallel processing would help. |
I compared CPU usage when playing complex ass subs on a 4k screen with Aegisub (the video preview maximized) and there doesn't seem to be much difference between libass and VSFilter..?
Do you know what kind of output VSFilter provides? Normally it's a DirectShow filter but I couldn't detect an active filtergraph, it seems Aegisub is using it in a different way?
If it would provide RGBA region images, as fast as libass creates its output, we would save the step of re-assembling the ass images, that's what I'm wondering about. |
Aegisub is actually not too efficient at subtitle rendering since it blends libass subtitles on the CPU without any big optimizations (i.e. no SIMD, etc). Benchmarking subtitle rendering is hard, especially when comparing different renderers, but one option is to play the subtitles on MPC-HC (clsid's fork) with the internal subtitle renderer (a VSFilter variant) and libass, and compare the number of dropped frames. It's also worth noting that this very much depends on what "complex" means here. As has also been noticed above, the main performance bottleneck in rendering is the total bitmap size. Movement and transforms are often seen as "complex" but aren't necessarily any more performance-heavy than static subtitles. To give a concrete example of where libass is faster than VSFilter: Its blurring implementation is much more efficient. VSFilter's blurring gets much slower with increasing blur radius, while libass's blur implementation is effectively constant time with respect to the blur radius. Moreover, libass has much more extensive caching, so effects like clip-based color gradients may be more efficient.
Aegisub uses VSFilter via CSRI, but I'm afraid I can't tell you much more than that. |
The amount of CPU processing that will actually be needed remains to be seen. We should off-load as much as possible to the GPU. Thanks to @arch1t3cht I have a fairly good idea how this will work but until I see the actual outputs and I can play with libass and render some of these frames myself, it really is hard to just imagine it and figure out the best approach :) |
http://streams.videolan.org/samples/sub/SSA/subtitle_testing_complex.mkv and a 4k version created like this:
It looks like MPC-HC is rendering the subtitles at a lower resolution and upscales them for display.
Can you tell what CSRI is? Never heard of that.. Finally, the question of all questions: What are you working on, the most recent version of Aegisub is 3.2.2 from 2014... Or is there any newer somewhere? 😆 |
VSFilter has many interfaces.
XySubFilter uses SubRenderIntf, originally designed for madVR but nowadays also supported by MPC-HC, which outputs a list of RGBA bitmaps. I don’t know how/whether XySubFilter actually combines small bitmaps into bigger RGBA ones, but at any rate, the final blending onto video is done by the consumer. MPC-HC’s internal VSFilter may have something similar of its own.
When using the internal renderer as arch1t3cht suggested (or when using XySubFilter), it doesn’t. You may be using an external VSFilter/DirectVobSub: check your settings in Options → Playback → Output (or in older versions, directly in Options → Playback).
Assuming you’re using the latest version from clsid2, the libass checkbox is tucked away in Options → Subtitles → Default style. |
I made some comparisons:
So, the places to look at are MPV and VLC. MPV does a lot of things with shaders which puts significant load on GPUs, VLC is the the most efficient player among all. Their use of libass might be more straightforward, but it's just a guess. In terms of what ASS rendering adds to the CPU and GPU loads appears to be similar. |
Performance aside, it may also be less correct. It certainly has been in the past. Exercise caution. mpv is the exemplary existing user of libass that’s known to configure and use everything correctly. |
@astiob - Thanks a lot for the comment! You were right, I needed to block loading of external VSFilter implementations, then it played fluently with the internal renderer and also with libass enabled (I've updated my post above accordingly). In both cases I've seen very high CPU load, very different from VLC and MPV.
Yup, latest from clsid2. Found it, thanks, awkward placement indeed.
Then it's definitely worth looking at it. I'm only familiar with ffmpeg's way of using it. |
In particular, VLC always renders its subtitles at the video's storage resolution and blends them to a single RGBA image, which is then scaled to the display resolution. This can cause artifacts, in particular when the display resolution is lower than the storage resolution (edit: I mean VLC's scaling specifically here. In general there can be good reasons for rendering at storage resolution, in particular for typesetting). This may also be the reason why VLC appears faster than mpv to you: If you're watching subtitles on a 1080p video in fullscreen on a 4k display, VLC will render subtitles at 1080p while mpv will render at 4k, which is slower. (You can make mpv render subtitles at the video's storage resolution using |
It didn't. I said it seems equal.
Right, I've seen that before, It's bad scaling algorithm in place.
From the screenshot images, you you can see that what you said doesn't apply to my test - in case you know that video: I had created a version upscaled to 4k, to avoid players rendering the subs at the original video resolution 😄 |
Hi,
I'm using FFmpeginteropX for a long time, thank you for your great work.
It seems ShiftMediaProject updated all the libraries to latest one around 3 weeks ago.
I read in #384 that @lukasf said:
Since its updated can we at least use libass version and make it work in FFmpeginteropX without touching FFmpeg builds? I tried to compile ShiftMediaProject's libass version and all its dependencies and I managed to build them all but there is no winmd file in the output folder.
Can you please help me in this?
Thanks in advance.
The text was updated successfully, but these errors were encountered: