libass support #439

rmtjokar · 2024-11-09T00:05:19Z

Hi,
I'm using FFmpeginteropX for a long time, thank you for your great work.

It seems ShiftMediaProject updated all the libraries to latest one around 3 weeks ago.
I read in #384 that @lukasf said:

it is difficult to keep it updated

Since its updated can we at least use libass version and make it work in FFmpeginteropX without touching FFmpeg builds? I tried to compile ShiftMediaProject's libass version and all its dependencies and I managed to build them all but there is no winmd file in the output folder.

Can you please help me in this?

Thanks in advance.

brabebhin · 2024-11-09T20:38:00Z

Hi

May I ask why you need libass?

I don't think you can use libass easily without modifying ffmpeg builds because all of the addon libraries need to be linked into ffmpeg, otherwise ffmpeg wouldn't know it can use it.

You technically can use libass into ffmpeginteropx but that would require a code change that's not trivial.

softworkz · 2024-11-09T21:50:07Z

Since its updated can we at least use libass version and make it work in FFmpeginteropX without touching FFmpeg builds? I tried to compile ShiftMediaProject's libass version and all its dependencies and I managed to build them all but there is no winmd file in the output folder.

That's because ffmpeg is a C library and does not provide .net interfaces, which is what .wnmd files are for.

softworkz · 2024-11-09T22:09:34Z

I don't think you can use libass easily without modifying ffmpeg builds because all of the addon libraries need to be linked into ffmpeg, otherwise ffmpeg wouldn't know it can use it.

The av-libs built by Shift-Media-Project include all this.

You technically can use libass into ffmpeginteropx but that would require a code change that's not trivial

It would be rather easy, but there's a little caveat to that: you'll loose hardware acceleration, because the subtitle "burn-in" can only happen in a sw filter. Even "hw decoding to cpu mem" is pointless to do, because experience has shown that software decoding is less resource intesive than "hw-decode, hwdownload, hwupload" in most cases. The hwupload alone is killing already for high-res videos like 4k).

That little caveat is a KO for the idea, unfortunately.

brabebhin · 2024-11-09T22:51:13Z

I might have misunderstood but i was under the impression that OP wanted libass from SMP with our own ffmpeg builds. Which is ofc not possible.

Filter burnin isn't the only possible way to work with libass. We can expose ssa/srt as image cues and render the images ourselves to feedv into the sub stream. Then the media sink will handle rendering/burn in/whatever.

lukasf · 2024-11-10T18:46:03Z

Hi @rmtjokar, long time no see!

I think libass has recently added a meson build system. It should be pretty easy now to directly integrate it, without having to resort to a SMP fork. The SMP build system has some disadvantages for us, like, not all its libs do have UWP targets, and if they do, they usually don't have ARM targets. And their project files are horribly messy, which makes it hard to maintain a fork with added WinRT+ARM configs. So using SMP is always last resort for me.

As others have noted, the question is, what are you trying to accomplish with adding libass? You won't be able to use it without bigger changes in our lib. Subtitles are not normal streams, and our effects system does not work with them. We would need explicit code to post-process subs with libass (through ffmpeg filters).

A big problem is that libass does not support GPU rendering, and copying frames from GPU to CPU memory for rendering is very expensive (and afterwards they need to be copied back!). Which means that we cannot really use libass to directly render the subs into video frames. And the subtitle rendering system in windows is rather poorly implemented. The bitmap subtitle rendering is intended for static bitmap (text) frames, not animated live frames. I am pretty sure that it would be horribly jaggy if we'd try to feed animated subs into it. We had to use quite elaborate workarounds, only to get clean flicker-free static subs.

The only use I could currently see is for transcoding static (non-animated) ssa/ass subs into bitmap subtitles, using the libass rendering engine, which is sure better than the Windows rendering system. Downside is that we don't know which target size to render to, which might lead to more or less noticeable scaling artifacts. And I am not sure how flexible the libass filter in ffmpeg is - we would somehow have to disable animations.

Oh and keep in mind that bitmap subtitle rendering is still broken in WinUI. Or has this bug "already" been resolved? I guess not, but I did not check for a long time. Has anyone tried with a recent version @brabebhin @softworkz?

brabebhin · 2024-11-10T18:55:38Z

Hi @lukasf

I haven't heard anything on the MPE front, but I would guess the bug is still there as all the winui 3 effort seems to have gone into supporting AOT and designers. I have since developed my own MPE based on directx and frame server mode, as well as custom sub rendering with win2D.

But this shouldn't be a show stopper for us. We can use UWP as a benchmark, since both UWP and winUI use the same MF interfaces. As an ugly workaround for size rendering we could simply ask the user to provide a size for us to render against and have the user update it on resize.

lukasf · 2024-11-10T18:59:06Z

Sure we could pass in a size, but at least when using the Windows rendering, it is not even clear at which exact position and size a sub is rendered. Of course this is not a problem when using custom subtitle rendering (which is probably the better approach anyways).

brabebhin · 2024-11-10T19:10:13Z

For sub animations, custom rendering would be the only way to do it.

Windows rendering is just 50% arcane. The regions containing a cue will direct where subtitles are rendered. The region itself has coordinates that determine where it will be rendered on screen (these may be absolute positions or percentages). IIRC, images cues are always rendered in their own region and will have pretty much absolute positioning and size.

For text cues it gets more arcane because Windows groups them by regions, and then inside the region you have some sort of flow directions. For whatever ungodly reason they also use XAML composition for rendering, which I would guess is why we observe flickering.

The MF interfaces provide A LOT of customization and allows applications fine grained control over subtitles, but seems MPE chooses to only implement a few of the combinations. Which is why it seems arcane, as it only implements whatever suits them. It's almost as if the MF team had nothing to do with MPE.

At this point I think the conversation moves towards whether we want to also become a rendering library and not just demux+decoding. In the end I think creating our own MPE isn't that hard. If we do want to create our own MPE, we could completely forget about MF's way of doing subtitles and just render them directly inside FFmpegInteropX. We'd still have to support Windows rendering too.

softworkz · 2024-11-10T20:06:51Z

My Subtitle Fitlering patchset includes a text2graphicsub filter, which allows to convert text subtitles (including ASS) to graphical subtitles like dvd, dvb or x-subs, so all you need to do is to add some filters and at the end you get graphical subs like from any other file. It also has an option to strip animations.
I just haven't come to update the patchset to the latest FFmpeg version.

Yet, my general view on this though is this: Most of those who have ASS subtitles are expecting animations to work. For non-animated subtitles we are not 100% accurate but pretty close to libass. Starting any work to integrate libass without animation support is rather pointless as it won't make anybody happy. So either go for making libass fully work, including animations, or just leave it. IMO, putting effort into this is only justified when it allows to get it working in full effect.

rmtjokar · 2024-11-15T00:20:45Z

Hi, sorry I had to go out of the city for the past week.
@lukasf Indeed, it's been a long time. I've been out of programming for quite some time.

@brabebhin Using libass is mainly for supporting ASS effects and animations. Its renderer is quite fast and smooth, and other player apps like PotPlayer, KMPlayer, and even MXPlayer on Android use this library. Maybe we can do the same.
I'm targeting x64 (only Xbox version), so it's UWP only.

@softworkz Thanks for the information.

As I saw in PotPlayer, there are two ways of showing subtitles: "Vector Text Renderer" and "Image Text Renderer," both of which have better quality than displaying text in a UWP app (using SubtitleCue or even Win2D Canvas with the same font/style). It's strange that they have better text rendering quality.

I was thinking maybe I can use libass.dll directly in my app (just giving the whole ASS text to libass for external subtitles only), so I created a wrapper around libass.dll x64 version with P/Invoke in a WPF app (because I couldn't with UWP), but I got stuck.

Another option I've been using for the past eight years is Win2D and its CanvasControl. I created a new renderer using ChatGPT, and it seems you can manipulate the fading effect via the color's alpha channel without touching anything else. (Take a look at this:)

Rec.0002.mp4

rmtjokar · 2024-11-15T03:39:05Z

So I found a project which uses libass:
https://github.com/hozuki/assassin/

And I could've use it in uwp with Image control and this is the result:
UWP 1>
https://1drv.ms/v/s!AjZLOqQJPNFqfOb53jwXZS4RITw?e=DbdygW

Pot Player >
https://1drv.ms/v/s!AjZLOqQJPNFqfagCs3E4oyehgxo?e=x2NCca

UWP 2 with sound>
https://1drv.ms/v/s!AjZLOqQJPNFqf5OHo1X5i1OD_WA?e=bm7fle

It’s actually quite good. It’s not as smooth as PotPlayer, but it’s still very good.
The font family feature doesn’t seem to work at all, but it’s a good start.

brabebhin · 2024-11-15T17:39:28Z

I think this ultimately comes down to whether we want to venture into the land of rendering subtitles. So far we've been strictly a demix+decoding library. MPE is quite limited when it comes to subtitles, and not much can be done about it.

I could devote some time to this once a decision is made.

lukasf · 2024-11-18T21:52:28Z

@softworkz Your changeset is exactly what would be needed for rendering static subtitle frames libass. But I think we all agree that static subs this is not the intention here. If we'd add libass, it is for the animated subtitles.

I am not such a big fan of writing our own renderers. While it is very flexible, it makes it more difficult to use our lib. Currently, our output can just be put in a MPE and it all works. If some of our features require custom renderers, we would break with that concept, and require devs to migrate their apps from MPE to our custom rendering. Also, I think it is difficult to synchronize the subtitle renderer with the video renderer. Decoding is decoupled from rendering, and we do not get much information about the actual playback position. Text subs rendering is easier, since we get events when a new sub is to be shown.

It would be great if we could find a way to use libass and somehow integrate it in our decoding chain - without killing performance. I am trying to brainstorm in that direction.

Idea 1: ffmpeg recently added vulkan as a new cross-platform hw decoder type, and included a buch of filters which can run directly on vulkan frames. There is a filter called "hwmap", which can be used to map frames from one hw decoder type to a different hw decoder type (or a different gpu device of same type). It has an option to "derive" its output hw context from the device which was used in the input frames. It seems that if the underlying device on input and output hw context is same and compatible, then hwmap can directly map the frame, without copying. Setting the mode to "direct" can enforce this. If this would indeed work, then we could achieve hw accelerated gpu rendering: We could render the frames from libass onto a transparent hw texture and overlay this on the video using hwmap(vulkan) -> overlay_vulkan -> hwmap(d3d11). Of course, I don't even know if the hwmap stuff really works like that, and how easy it is to get a ffmpeg build with vulkan support. A downside of the approach is that we would be locked to the video output resolution.

Idea 2: We could create a second MediaSource, which just contains the rendered subtitle as a video stream with alpha channel. Users of the lib would have to add a second MPE layered above the first one, and link them using a MediaTimelineController. I never tried MediaTimelineController, so don't know how well it does actually. But at least theoretically, it should take care of the syncing. Frankly, this also requires quite some modification on app side.

Not sure if any of this makes sense, just trying to explore some alternative approaches...

softworkz · 2024-11-19T00:41:49Z

Here's the full tree of options from my point of view:

mindmap
  root((Subtitle<br>Overlay))
    Burn into video
      **B1**<br>hw decode<br>hw download<br>sw burn-in<br>hw upload
      **B2**<br>sw decode<br>sw burn-in<br>hw upload
      **B3**<br>sw render blank frame<br>hw upload<br>++++++++++<br>hw decode<br>hw overlay
      **B4**<br>sw render half-size frame<br>hw upload<br>hw upscale<br>++++++++++<br>hw decode<br>hw overlay
      **B5**<br>sw render partial sprites<br>hw upload<br>hw upscale<br>++++++++++<br>hw decode<br>hw overlay
    Presentation Layering
      **L1**<br>render full frames<br>Copy to D3D surface<br>overlay manually
      **L2**<br>render partial sprites<br>Copy tp D3D surfaces<br>overlay manually

Burn into video options

B1

This is the worst of all options, because you need to copy every single uncompressed frame from GPU to CPU memory and then again from CPU memory to GPU memory.
An uncompressed 4k HDR frame is about 120MB. With a framerate of 60fps, this means you have a bandwidth for memory transfers of 14.4 GB/s.

B2

The advantage of hw decoding is often over-estimated. Other than in case of encoding, video decoding can be easily done by CPUs as well. The big advantage of using hw decoding is that the large amounts of data never need to be copied between system and gpu memory, because gpu memory is always the eventual target. HW decoding followed by immediate downloading to cpu memory almost never makes sense.
(Though, this can turn quickly, when you're doing at least one more thing at the GPU side, like hw tone mapping, deinterlacing or hw downscaling. Then it's usually justified to start in hardware, and download for sw processing then.)

But for this case of doing sw burn-in of subtitles, B2 is significantly better than B1, because the memory transfer hits in much harder than the sw (instead of hw) decoding.

B3

This is another scenario which requires my subtitles patchset and which will go into our server transcoding process shortly (it's there already, just not unlocked for the public).
It works by rendering subtitles onto transparent frames, then doing hwupload to the same hw context as the video and using the hw overlay filter of that context to burn-in the subtitles into the video.

While it still involves uploading the overlay frames from cpu to gpu memory, like in case of B2, there's still a massive advantage: You don't need to upload at the same rate as the video fps:

Many times, there are no changes to the subtitle overlay content. libass tells you that, and in turn, you don't need to do hwupload - you can rather designate the previously uploaded frame to be re-used for overlaying
You can also choose to use a lower fps for the subtitles - for example it typically suffices to render ASS animations at 30fps, even when the video is 60 fps

Unfortunately, in context of FFmpegInteropX it's not straightforward to go that way, becauzse you cannot use this with the DX11VA decoders. Instead, you would need to use either the vendor-specific hw contexts and fitlers (like overlay_qsv or overlay_cuda).

AFAIK, Vulkan is not stable enough on Windows yet. At least it doesn't work reliably with MPV player, even though it supports it.

The one other (stable) option is to use OpenCL. There's also an overlay_opencl filter in ffmpeg, I'm just not sure whether you can hwmap from a d3d11va context to OpenCL. I know that d3d11va-to-opencl works for AMD, I know that it works from qsv-to-opencl and from cuda-to-opencl. But I'm not sure whether d3d11va-to-opencl works with Intel and Nvidia gpus.

B4

This is a low-profile variant of B3. By using only half-size frames for the subtitle overlay, you save 75% of the memory bandwidth. You woluldn't do that for 720p or lower-res videos, but for 4k, it's a good way to optimize for performance

B5

That would be the "holy grail": Instead of full frames, you would use one or more smaller surfaces, to cover only the regions with subtitle content. It's difficult to implement though, because you are typically working with a pool of D3D surfaces and that becomes difficult to manage when you have sizes which are changing dynamically. This would require modificationsn to the overlay filters in ffmpeg.

Presentation Layering options

Basic

The idea of not touching the video frames at all has a lot of appeal, as the nature of subtitles is significantly different from the actual video:

Subtitles are rarely covering the full video area and even when they do, it's just for short moments
Subtitles rarely changing at the same rate (fps) as the video and when they do, it's just for short moments and limited to certain areas

Means, while the case of full-screen+full-fps needs to be accounted for, but the implementation doesn't need to permanently create full-size overlays at the same rate as the video.

ffmpeg Side

I don't think it's required to have a totally separate ffmpg instance for subtitles. This would cause a lot of problems and a lot of work. ffmpeg can have more than a single video output, so for this case, one output would be the video frames as D3D surfaces and the rendered subtitles as software frames (L1) or a collection of multiple areas per frame (L2).

How these eventually get on screen is an open question though, at this point. Maybe it's feasible to expose a secondary MediaSource from the main session?

Another throught I had at some point is whether the 3d stereo capability of Windows.Media could be "misused" for displaying a secondary layer on top of the main video...

"Rendering"

It's not clear to me what have been referred to as "custom renderer". The renderer is always libass: You give it a range of memory, representing the video frame and then it renders the subtitles into that memory - pixel by pixel.

The "only" thing that's left to do is to bring this rendered image on the screen.

Canvas2d

I've never used it and don'T knoiw about it's abilities. Maybe it's worth to try out whether the images generated by libass can be copied onto the canvas, but it's not clear to me how to do the switch from one image to the next one at the right moment in time. That's rather the domain of a

SwapChain

Having a second wapchain on top of the video swapchain seems to be the most natural approach.
The big question though - for whch I don't know the answer: Is it even possible in WinUI3 to have a swapchain panel on top of another swapchain panel with transparency overlay and composition?

This would need to be found out, because if not, then the only way would be to create a XAML island hwnd window on a separate thread or any other non-Winui3 technique to render the subs in a win32 window on top of the video so that the DWM (desktop window manager) does the composiion (typically using gpu overlay).

L1, L2

Same as with B5, a perfect implementation would work with individual areas rather than full-size frames, but it also adds a lot of complication.

brabebhin · 2024-11-19T07:16:01Z

Here's another idea.

If we keep this feature only when using DirectX decoders, then we can use compute shaders to burn in the image into the HW AVFrame before we send it to the MF pipeline. This will include an additional step which entitles a CPU->GPU memory copy of the subtitle, and another GPU->GPU operation. Might kill a frame or two.

Compute shaders should be available on all Windows devices that we target, since their feature level is a hard requirement for windows support. This is simialr to @softworkz 's L1 (nice graph btw), except it is done on our side.

@lukasf your first idea is technically possible. We can share memory between Vulkan and DirectX, there's something called VK_NV_external_memory in Vulkan which allows this kind of thing.

softworkz · 2024-11-19T07:37:58Z

If we keep this feature only when using DirectX decoders, then we can use compute shaders to burn in the image into the HW AVFrame before we send it to the MF pipeline. This will include an additional step which entitles a CPU->GPU memory copy of the subtitle, and another GPU->GPU operation.

Then you'll have to deal with all the HW formats that are being used for video frames on d3d surfaces when overlaying the subs.

Compute shaders should be available on all Windows devices that we target, since their feature level is a hard requirement for windows support. This is simialr to @softworkz 's L1 (nice graph btw), except it is done on our side.

It's similar to B3 because it would be applied into the video frame. L1/L2 means that video and subs remain separate blended only during presentation (like when you have one semitrasnparent window on top of another.

(nice graph btw),

It's not a bitmap. Use the three-dot menu and shoose edit to see it 😄

@lukasf your first idea is technically possible. We can share memory between Vulkan and DirectX, there's something called VK_NV_external_memory in Vulkan which allows this kind of thing.

ffmpeg has hw mapping to Vulkan currently only for VAAPI and CUDA.
OpenCL hw mapping is implemented for many hw contexts, including D3D11Va (as said, I just don't know whether it's working with all vendors).

softworkz · 2024-11-19T07:42:16Z

Compute shaders should be available on all Windows devices that we target, since their feature level is a hard requirement for windows support.

I'm not sure whether shaders are even needed for a trivial overlay.

brabebhin · 2024-11-19T09:19:44Z

The elephant mama in the room here is a GPU->CPU memory copy, which is what's going to kill performance no matter how you spin it.
The baby elephant is that most CPUs aren't capable of decoding 4K at acceptable performance, so that kind of endangers a software-only approach.

I guess we wouldn't need compute shaders, but the compute shader has the advantage that you sort of know when it will run.

For a custom MPE, we can stick to the same interface of the official MPE and only add new stuff, this will allow a drop-in replacement and ease adoption. We can use frame server mode to detect video position and render subtitles accordingly.

softworkz · 2024-11-19T10:11:30Z

The elephant mama in the room here is a GPU->CPU memory copy, which is what's going to kill performance no matter how you spin it.

Yes, it's this PLUS re-uploading again (B1). In case of B2, it's just one direction, so half of B1.

The baby elephant is that most CPUs aren't capable of decoding 4K at acceptable performance, so that kind of endangers a software-only approach.

Like I said above:

The advantage of hw decoding is often over-estimated. Other than in case of encoding, video decoding can be easily done by CPUs as well.

and this includes 4k videos. SW decoding alone is not an elephant (of any age ;-).

You can easily verify this by yourself. Just call ffmpeg like this:

ffmpeg -i "Your4kVideo.mkv" -f null -

Then you need to watch the "Speed" value in the output and also your CPU usage (because it's often not going to 100%). So for example, when you see 50% CPU usage and Speed of 6.0x, this means that your CPU is 12 times faster than needed for decoding the video in realtime (presentation at 1.0x).

softworkz · 2024-11-19T10:47:59Z

You might want to take a look at these two videos, demonstating dozens of ways for doing subtitle burn-in:

https://github.com/softworkz/SubtitleFilteringDemos/tree/master/TestRun1

brabebhin · 2024-11-19T10:53:14Z

Sure, a desktop CPU or a plugged in laptop CPU will deal with 4K just fine in software mode. However, as soon as you factor in mobile devices that are not always plugged in and older CPUs, things get complicated.
Even my i7 14th gen gaming laptop will not handle a 4K video smoothly on battery. It will eat through it plugged in though.

Ideally we should support both software and hardware anyways (we can skip over the system decoders as these are black boxes and only a MF filter will help us there).

softworkz · 2024-11-19T11:50:22Z

Even my i7 14th gen gaming laptop will not handle a 4K video smoothly on battery

Yes, but that's because of the memory transfer. I'm sure it will run the pure decoding (the ffmpeg command above) of 4k video comfortably above 1.0x speed, even on batteries.
It's the mem movement which is the tough part, that's the point I want to make.

Sure, a desktop CPU or a plugged in laptop CPU will deal with 4K just fine in software mode. However, as soon as you factor in mobile devices that are not always plugged in and older CPUs, things get complicated.

A laptop on batteries will hardly ever be used to drive a 4k display. Even full HD can be considered as a bit too much for a typical laptop screen. But still, FFmpegInteropX is moving around 4k frames when the source video is 4k., which is pretty bad, obviously.

But there exists another trick for those cases, which isn't even specific to subtile overlay but can generally improve performance in case of 4k playback when the output display isn't 4k anyway:

Above I said that there are no filtering capabilities available for the D3D11Va hw context, which applies to ffmpeg, but it's not the full truth. In fact there exists an API for video hw processing for D3D11Va and it's supported by all major GPU vendors (probably they wouldn't get certified for Windows without it). It's just that ffmpeg doesn't have an implementation for it.
Yet there exists an implementation from which you might be able to take inspiration for implementing a similar filter. I won't name or link it, because the license is incompatible to ours and anyway, there might be other implementations as well. You can just look up the corresponding class/inteface names in the Microsoft docs for D3D11Va video processors, and then do a GitHub search for implementations.

The two most important filtering capabilities that you get from this are hw deinterlacing and hw scaling, but let's forget about deinterlacing for now, and focus on scaling.
The scaling you get from it is quite special, because the computation cost is close to zero. It's implemented in fixed-function blocks (asics) on the GPU dies, (like hw decoders, encoders partially).

Having such a filter, would allow to optimize performance significantly in all cases where the source video resolution is larger than the presentation (or the max presentation) resolution. The detailed conditions need to be decided upon by every developer individually, but examples would be like:

scale down to be no larger than the largest display (in pixels)
scale down to not exceed the resolution of the display on which it is displayed currently
provide an option like "power saving" or "high efficiency", and scale down to half or 2/3 of the output display size
etc.

Like I mentioned above for B2: As soon as you are doing something with the data in hardware before downloading, the cost balance changes, which means: B1 with hw downscaling before hw downlaod becomes better than B2.

And even outside of the subtitles subject, this would massively improve performance and reduce energy consumption for 4k playback on non-4k screens.

brabebhin · 2024-11-19T12:37:36Z

It's the mem movement which is the tough part, that's the point I want to make.

And I completely agree. However, any real world scenario of decoding involves memory transfer at some point. And the CPU does bear some responsibility for it. Caching, memory controllers etc will all play a part in it.

A laptop on batteries will hardly ever be used to drive a 4k display. Even full HD can be considered as a bit too much for a typical laptop screen. But still, FFmpegInteropX is moving around 4k frames when the source video is 4k., which is pretty bad, obviously.

You can always find those ridiculously speced "business" laptops that rock a 4k display with iGPU that can barely handle windows animations smoothly at that resolution xD

We could technically implement that scaling optimization at our level. We know we can dynamically change the resolution of the video stream descriptors and MF will obey. However, wouldn't this downscaling already happen anyway? We are basically zero memory copy throughout all the decoding loops, MF will do the downscaling as it has to. I am not sure if we would actually win anything from this? It would just be us doing the down scaling instead of MF. I am speaking about the general implementation of this, not specifically for sub animations (that part is pretty clear).

softworkz · 2024-11-19T13:36:36Z

MF will do the downscaling as it has to

Media Foundation? How does that come into play?

AFAIU, FFmpegInteropX is decoding via ffmpeg using D3D11VA hw decoders and the output from ffmpeg are D3D surfaces. Each time, when the media player element fires its event, we give them one of the D3D surfaces.

Not right?

We could technically implement that scaling optimization at our level. We know we can dynamically change the resolution of the video stream descriptors and MF will obey. However, wouldn't this downscaling already happen anyway? We are basically zero memory copy throughout all the decoding loops,

Yes, that's a really good question. Using the hw scaling right after decoding has two advantages:

1. It reduces GPU memory consumption

There's always a pool of hw frames involved in decoding. The decoder needs to have a certain number of full-size frames to resolve references (forward and backward). These frames are a fixed requirement. The decoder doesn't produce exactly one frame right at the moment when it needs to be displayed. So there's another number of frames which are needed for queuing up between the decoder (between possible filters) and the final output of ffmpeg before they are actually provided for display.

And this second number of hw frames is where GPU memory is reduced when scaling down each frame immediately after it gets out of the decoder. Scaling down 4k to FHD reduces the amount of memory by 75%.

2. Fixed functoin block scaling: you can't get it any cheaper

Zero-copy sounds great, because copying is expensive, but what's even more expensive is scaling. When you supply the D3D surfaces to the media player element for display, and those are 4k while the display is just 1920, these surfaces need to be downscaled to the exact size of the element's panel. And who does perform that scaling => the gpu.
Downscaling 60 4k surfaces per second to roughly the half size, is not a small thing. For most recent gpus, still rather moderate, but for those mobile gpu versions found in laptops, it's quite a bit of effort to do that kind of scaling and you'll easily see 20-80% (or more) GPU load - for the scaling and the overlay (desktop composition).
That gpu percentage which you see in task manager doesn't include decoding. And it doesn't include fixed-function scaling. This is a capacity which is separate from the GPU compute units. It is always there and works independently from everything else. The GPU itself cannot use it. It's just for video scaling and we can use it at almost zero cost!

We probably cannot prevent the GPU scaling from happening at all (or maybe there's a property in the mp element?
But anyway - even when we cannot prevent that scaling from happening, we are still achieving a major difference: When we do the "free" fixed-function scaling, like from 4k to FHD, then we reduce the amount of data by 75% and the GPU needs to scale just from FHD to something smaller instead of having 4k frames at the input.
From a rough and simple calculation, this changes the projected gpu activity range of 20-80% down to 5-20%.

brabebhin · 2024-11-19T14:45:36Z

Media Foundation? How does that come into play?

AFAIU, FFmpegInteropX is decoding via ffmpeg using D3D11VA hw decoders and the output from ffmpeg are D3D surfaces. Each time, when the media player element fires its event, we give them one of the D3D surfaces.

Not right?

Without access to the MS source code it is impossible to know, but I believe the inner working is something similar to this:

FFmpegInteropx--> MediaPlayer --> MediaPlayerElement.

A MediaElement would basically be a MediaPlayerElement with an abstracted MediaPlayer attach to it.
Neither FFmpegInteroipX nor MediaPlayerElement belong to MF.
However, the magic happens inside MediaPlayer. My suspicion is that MediaPlayer is a combination over several MF functionalities, but most importantly, I believe it wraps the IMFMediaEngineEx API of MF.

MediaPlaybackItem will match MediaTopology.
MediaStreamSource will match MediaSource.
MediaPlaybackList is a MediaTopology with playlists, forgot the name.

I am pretty sure MediaPlayer will do the scaling you are referring to.
When you use it in frame server mode, you do not need to do scaling manually. It will detect the size of the direct3d surface you're trying to render to and it will scale accordingly.
I also believe the frame server mode is actually the natural way these things work, and the swap chain is something specifically done to allow easy building of things like a MPE.

softworkz · 2024-11-19T15:32:22Z

@brabebhin - I believe there are a number of inaccuracies in your post. Let's just wait for @lukasf to clear things up. 😄

brabebhin · 2024-11-19T16:28:57Z

There is a mistake, which I have since corrected ^^

lukasf · 2024-11-19T22:13:33Z

It is absolutely clear that MediaPlayer is based on MF. MF is the way how media is done in Windows, it is the replacement of DirectShow. All the error messages you get from MediaPlayer have MF error codes, you can register IMFByteStreamHandlers and they will be automatically pulled in by the MF engine. You can even obtain some of the MF services from the MediaSource, which is how we get the D3D device. I also assume that internally the MediaPlayer is a wrapper around IMFMediaEngine, which has a very similar API surface and was introduced in a similar time frame, as a replacement of the older MFPlay apis.

Sure the GPU can do scaling in no time. But of course, the same super fast scaling is used when rendering the HW frames on the same HW device. You won't gain any performance benefit by forcing a downscale after decode. In fact you will lose a (tiny) bit, because that means there will be two scale operations, on after decode, and a second scale to the actual target size (unless you exactly know it upfront). If you don't know the exact size, the double scaling will not only cost performance, but it will also introduce scaling artifacts which you don't have if you only scale once directly to the final target size (that would be the bigger concern for me here). VRAM is really not an issue, it is only a bunch of frames that are decoded upfront, so even 4K video is easily handled on iGPUs without any issues.

I totally disagree that HW decoding is overrated. Sure my high power dev machine can easily do it. But a vast majority of devices out there are old and rather poorly powered and will never be able to decode a high bitrate 4K HEVC on the CPU. A lot of devices are sold even with Celeron CPUs. HW decoding is the only way to bring smooth high res video to those devices. And even if a device can SW decode, it will use at least 10x more CPU power compared to the dedicated HW decoder engines. They are so much more efficient. That means, a laptop that has enough battery to easily play 2h of video on HW decoder will be probably out of battery after half an hour SW decoding. And it will make a lot more noise. I would never use a player which cannot do HW decoding on my laptop, because of noise and battery lifetime concerns.

softworkz · 2024-11-19T23:28:42Z

It is absolutely clear that MediaPlayer is based on MF. MF is the way how media is done in Windows, it is the replacement of DirectShow.

There's no doubt about that. But @brabebhin wrote that MF would downscale the video which can be understood in two ways:

The downscaling being an additional filter in the MediaFoundation graph (at the end)
this is not possible because there's no MF graph. What FFmpegInteropX does is to provide a custom processing chain which is a replacement for what's normally provided by an MF graph
MF would handle handle the scaling at the presentation level
this is not possible becase MF has nothing to do with the presentation layer

.

Sure the GPU can do scaling in no time. But of course, the same super fast scaling is used when rendering the HW frames on the same HW device.

Of course not. Incorrect.

You won't gain any performance benefit by forcing a downscale after decode.

Incorrect. You do.

it will use at least 10x more CPU power compared to the dedicated HW decoder engines.

These are impossible to compare, and that factor is pure fantasy.

It appears that you have mistakenly assumed that I'd have been spilling out some opinions and assumptions above.
You should know that I've spent a significant amount of my life on this subject during the past 8 years.

You can pick any of the details I stated above and I'll take you into that subject as deeply as necessary until you'll acknowledge that I'm right about it.

My intention was to share some of the knowledge I have gained over time, especially on things that are not like you would normally think they would be. Don't know how I seemingly created the impression of doing some gossip talk.

brabebhin · 2024-11-20T07:49:13Z

The downscaling being an additional filter in the MediaFoundation graph (at the end)
this is not possible because there's no MF graph. What FFmpegInteropX does is to provide a custom processing chain which is a replacement for what's normally provided by an MF graph

This is not what I mean, but the claim that there's no graph is likely incorrect.
I am saying likely because we have no way of knowing.

MF would handle handle the scaling at the presentation level
this is not possible becase MF has nothing to do with the presentation layer

This is also a claim that you cannot really make unless you have access to MS's source control, in which case I will likely bombard you with more questions haha. MF does have something to do with the presentation layer.
Since I have created my own MPE using the frame server, I am pretty sure that the scaling you are referring to happens inside the MediaPlayer, since I can provide any size for the direct3d surface and the MediaPlayer will scale the video accordingly. Effortlessly :)

For non frame server implementation, MediaPlayer likely uses something like this:

https://learn.microsoft.com/en-us/windows/win32/api/mfmediaengine/nf-mfmediaengine-imfmediaengineex-updatevideostream

Just because MediaPlayer isn't by itself an UI element, it doesn't mean it doesn't have anything to do with the presentation layer. Taking in some parameters to render to, as opposed to encapsulating them, is simply a separation of concerns thing.

softworkz · 2024-11-29T17:29:42Z

But this kind of sub should be supported already i think.

Not sure what you mean by that..

What still not clear to me is how animations are achieved. If i want to flip a letter in that image, will libass provide the full 60 bitmaps but with that letter rotated at an angle or will it just give the rotated letter in 1 bitmap and we need to update that in the existing subtitle frame?(Kind of like a keyframe).

No. The rule is: New frame - new game!

These "bitmaps" are not independent from any other. Neither does any of the bitmaps correspond to a specific element (like a letter). Please look at my image above with the grey backaground and the black/white regions. I've taken great effort in creating it to make this more understandable.

The only thing we can get from libass is an indication whether there has been a change from the previous frame. If not, we can re-use the overlay image that we have generated for the previous frame.

softworkz · 2024-11-29T17:51:58Z

One way to view that multi-bitmap output from libass is to view it as an "awkward" - yet compact - representation of an overlay image. If you look at the "do some math" figures I've given above, you need to acknowledge that - even though the number of 300 bitmaps looks weird - it takes only 15 MB of data to represent an overlay image which would normally have 120 MB.

Another way to look at it would need to remind us a bit about how UI drawing was done in earlier days. For example in case of Windows GDI drawing, you had to create a "Pen" of a distinct color and then you could draw pixels with that Pen. Once you are done, you release that Pen and then you create a Pen of a different color and do the painting for that color, and so on.
As you can see, this is exactly the scenario which the libass output enables you to do very efficiently.
In this regard, you can consider the libass output more like a set of "painting instructions" than a ready-to-use overlay bitmap.

Creating full bitmaps for overlay would have never been feasible in earlier days, and even today, you would rather want to avoid creating a 120 MB overlay image for each frame in a 4k video.

brabebhin · 2024-11-29T19:01:08Z

At this point I think we're just mumbling in the dark. The first step that needs to be done here is integrate libass in our build system. I currently don't have much time to look at this, so maybe either @lukasf or @softworkz can take this on. Otherwise I'll have to postpone it to around Christmas time (assuming anarchy doesn't take over here).

Once we have libass in we can better understand what's going on because we can actually see what it outputs.

Dusting off my compute shader skills, we can use structured buffers with the exact struct layout from libass_image struct. Yes, we need to pass the a continuous array and not a linkedlist, but that shouldn't be a problem.

Then we can use compute shaders and dispatch 1 kernel for each bitmap in the array to do the job. We can set a kernel size of 32 which should maximize hardware usage.

Context->Dispatch(num_bitmaps, 1, 1)
numthreads(32, 1, 1)

The texture2D will be created and will stay in GPU memory (so we save the 4K frame data transfer from CPU to GPU)
Total memory transfer from CPU to GPU is the 12 MB output by libass. We should we fine with it.

lukasf · 2024-11-29T19:03:20Z

it takes only 15 MB of data to represent an overlay image which would normally have 120 MB.

Can you elaborate on that? If I remember correctly, that sample calculation was about a text area covering 15% of a 4K image. A 4K image in ARGB32 takes 24MB of memory. 15% of that area is 3.7MB. And that's full true color image quality including alpha, compared to the strange 15MB output we get from libass, which sure has worse quality (though probably not very noticeable). How is the latter more efficient? And it's not only the amount of memory, but also the tedious creation of all those bitmap layers in the lib, and then having the client reading out all these bitmaps and applying to a different, temporary bitmap over and over before rendering. It would be vastly more efficient to just output an ARGB32 for every changed area. Less memory and a simple copy operation vs the tedious copying of about 4x as much memory.

Or am I missing something?

arch1t3cht · 2024-11-29T19:07:41Z

I'm not a libass developer, but I do work with related software (Aegisub) so I can clear up a couple of the misconceptions here.

Like shown further above, libass returns a linked list of ASS_Images, each of which is a single-color bitmap. However, it is not true that the alpha for that bitmap is a 1bpp bitmap with every pixel being fully opaque or fully transparent. Every ASS_Image has an alpha bitmap with 1 byte of alpha per pixel.

For rendering the image at the bottom, libass creates one bitmap for each color. These are the black/white images above.
Just that in reality, there aren't just nine, it can easily be a some hundred!

So this is not true (and if it was, it would indeed be extremely inefficient). Rather, this event would generate three bitmaps: One for the fill, one for the border, and one for the shadow.

As you can see, the subs are not very colorful, yet we get 60 bitmaps from libass per frame.

These subtitles use karaoke text to achieve per-syllable karaoke styling. Because of this, the text cannot be rendered to a single combined bitmap, and is instead split into individual syllables, with four bitmaps (fill, border, shadow, and karaoke fill) being output per syllable. However, these bitmaps are also only as large as the characters they contain, so this does not actually generate any significant overhead (compared to having four combined bitmaps for the entire line). As the statistics show, there are a total of 40.000 pixels, which corresponds to (e.g.) about four 20 x 500 images.

If you're looking for efficient ways to blend the images returned by libass, you could look at mpv (specifically https://github.com/mpv-player/mpv/blob/a283f66ede58e0182ac8cd4c930238144427fa74/sub/sd_ass.c). mpv packs all the alpha bitmaps into a single image, which is then passed to the GPU together with information about which rectangles should be blended where in which color.

lukasf · 2024-11-29T19:13:37Z

Hi @arch1t3cht, thanks a lot for the detailed clarifications! This is actually how I imagined it to work and it just makes so much more sense. I will also check out the mpv references. It sounds like the most efficient way to do the upload and blending in one operation. Thanks for the hints!

lukasf · 2024-11-29T19:18:00Z

Then we can use compute shaders and dispatch 1 kernel for each bitmap in the array to do the job. We can set a kernel size of 32 which should maximize hardware usage.

Context->Dispatch(num_bitmaps, 1, 1) numthreads(32, 1, 1)

I don't think we can use parallelization on that layer, since there is a strict order in which the bitmaps must be applied. But anyways, like @arch1t3cht just mentioned, we have a lot less images to render. We should look at how mpv does this, maybe we can follow a similar approach. I guess a pixel shader could be better suited here.

I can try to take a go on getting a libass build running, I hope I can find some spare time for that in the next weeks.

softworkz · 2024-11-29T19:18:31Z

Thanks @arch1t3cht !

@lukasf - The 120 MB I had in memory were for 4K 10bit HDR frames. 4k 8bit RGBA is not 24 but 3840 * 2160 * 4 = 32 MB

lukasf · 2024-11-29T19:24:50Z

@lukasf - The 120 MB I had in memory were for 4K 10bit HDR frames. 4k 8bit RGBA is not 24 but 3840 * 2160 * 4 = 32 MB

HDR typically has 10bit instead of 8bit per channel. Even in case of rare 12bit HDR, you get nowhere near that number, even for the full frame (and we were talking about the 15% overly area).

lukasf · 2024-11-29T19:34:29Z

I guess the confusion all comes from this ass_image header definition, which clearly states that the buffer is a 1bpp alpha buffer. 1bpp means 1bpp. But it seems that in reality this is a 8bpp alpha buffer, and this is the good news and saves us from a lot of pain and headaches! ^^

I had looked into the libass alloc functions, where it already seemed to be 8bpp to me. That's why I came up with it.

brabebhin · 2024-11-29T19:46:58Z

Then we can use compute shaders and dispatch 1 kernel for each bitmap in the array to do the job. We can set a kernel size of 32 which should maximize hardware usage.
Context->Dispatch(num_bitmaps, 1, 1) numthreads(32, 1, 1)

I don't think we can use parallelization on that layer, since there is a strict order in which the bitmaps must be applied. But anyways, like @arch1t3cht just mentioned, we have a lot less images to render. We should look at how mpv does this, maybe we can follow a similar approach. I guess a pixel shader could be better suited here.

I can try to take a go on getting a libass build running, I hope I can find some spare time for that in the next weeks.

Yes, what I said was based on the original info we had from that header. Since that is no longer true, then the approach doesn't make sense anymore.

softworkz · 2024-11-29T19:48:37Z

I guess the confusion all comes from this ass_image header definition, which clearly states that the buffer is a 1bpp alpha buffer. 1bpp means 1bpp. But it seems that in reality this is a 8bpp alpha buffer, and this is the good news and saves us from a lot of pain and headaches! ^^

I had looked into the libass alloc functions, where it already seemed to be 8bpp to me. That's why I came up with it.

And I had taken a quick look at whether ffmpeg uses the alpha from the RGBA color value (but no further). So there are two alpha values - the one from the color and the ones from the bitmap - which I don't quite understand then how they relate to each other.

arch1t3cht · 2024-11-29T19:56:16Z

So there are two alpha values - the one from the color and the ones from the bitmap - which I don't quite understand then how they relate to each other.

They should just both be applied (i.e. multiplied). The alpha in the bitmap comes from rasterization, blurring, and clipping, while the alpha part of the color is the one set by alpha tags or fades. Separating the two makes it easier for libass to cache the results of rasterization and blurring.

softworkz · 2024-11-29T19:58:23Z

HDR typically has 10bit instead of 8bit per channel. Even in case of rare 12bit HDR, you get nowhere near that number, even for the full frame

I meant uncompressed and unpacked frames like used when processing video (even though it's 60MB only, probably the 120 came from the calculation of data transfer that is needed for dealing with such frames with hwdownload/hwupload).

(and we were talking about the 15% overly area)

Sure, but the comparison I made was against the case when libass would output full overlay frames, and the linked list can have content for the full frame area, not for a a fixed subset of 15%.

softworkz · 2024-11-29T20:04:27Z

Yesterday I added some code to measure the timing for blending the ass output onto an image:
(it doesn't include the time that libass takes to generate the output)

[overlaytextsubs] ASS_Image count: 83  - Pixels: 488472  Execution time: 4.880 ms
[overlaytextsubs] ASS_Image count: 83  - Pixels: 488472  Execution time: 6.302 ms
[overlaytextsubs] ASS_Image count: 0  -  Pixels: 0       Execution time: 0.000 ms
[overlaytextsubs] ASS_Image count: 0  -  Pixels: 0       Execution time: 0.000
[overlaytextsubs] ASS_Image count: 29  - Pixels: 259454  Execution time: 4.441 ms
[overlaytextsubs] ASS_Image count: 29  - Pixels: 259454  Execution time: 2.819 ms
[overlaytextsubs] ASS_Image count: 27  - Pixels: 259454  Execution time: 2.659 ms
[overlaytextsubs] ASS_Image count: 27  - Pixels: 259454  Execution time: 2.599 ms
[overlaytextsubs] ASS_Image count: 29  - Pixels: 259454  Execution time: 5.155 ms
[overlaytextsubs] ASS_Image count: 3  -  Pixels: 241864  Execution time: 2.250 ms
[overlaytextsubs] ASS_Image count: 6  -  Pixels: 404860  Execution time: 3.832 ms
[overlaytextsubs] ASS_Image count: 6  -  Pixels: 404860  Execution time: 3.952 ms

What it shows is that the number of pixels affected plays a much larger role than the number of ass images.

softworkz · 2024-11-29T20:16:36Z

@arch1t3cht - If you are working on Aegisub, you must be familiar with VSFilter as well (which Aegisub) appears to be using by default). Since FFmpegInteropX is not cross-platform, what would be your advise regarding it, do you think it could be a better choice?

arch1t3cht · 2024-11-29T20:32:39Z

VSFilter is the reference implementation for the ASS format (though the situation is somewhat complicated since there exist various different versions of it), but it is not really actively developed any more. Libass on the other hand is actively maintained and in general faster than libass. In particular it outsources the blending to the user, who can then decide to blend on the GPU if they want - to my knowledge this is not possible with VSFilter.

So my recommendation would be to use libass, but it's possible that other libass developers have other thoughts for this (and note that I don't know much about FFmpegInteropX).

brabebhin · 2024-11-29T20:36:42Z

Blending in the GPU is probably the right approach for us as well.

softworkz · 2024-11-29T20:50:06Z

Blending in the GPU is probably the right approach for us as well.

Like said above, I don't think it makes sense to send all individual bitmaps to the GPU, there needs to be some preprocessing at the CPI side. Look at these figures from overlaying some more complex animated ass subtitles onto a 4k video:

21:34:54.775 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 40  - Pixels: 2263523  Execution time: 21.282 ms21 throttle=off speed=   0x
21:34:55.308 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 40  - Pixels: 2273822  Execution time: 32.239 ms32 throttle=off speed=   0x
21:34:55.810 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 72  - Pixels: 2294030  Execution time: 21.568 ms43 throttle=off speed=   0x
21:34:56.376 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 350  - Pixels: 3029086  Execution time: 28.015 ms=55 throttle=off speed=N/A
21:34:57.016 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 939  - Pixels: 3374044  Execution time: 34.418 ms65 throttle=off speed=   0x
21:34:57.841 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 1496  - Pixels: 2259728  Execution time: 27.154 msp=0 drop=81 throttle=off speed=0.111x
21:34:58.640 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 2226  - Pixels: 3782746  Execution time: 42.449 msp=0 drop=90 throttle=off speed=0.153x
21:34:59.916 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 2581  - Pixels: 5591588  Execution time: 56.263 msp=0 drop=106 throttle=off speed=0.188x
21:35:01.081 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 2006  - Pixels: 4419931  Execution time: 52.282 msp=0 drop=117 throttle=off speed= 0.2x
21:35:02.034 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 1417  - Pixels: 3671843  Execution time: 40.016 msp=0 drop=129 throttle=off speed=0.216x
21:35:02.843 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 983  - Pixels: 3010871  Execution time: 32.211 msup=0 drop=136 throttle=off speed=0.227x
21:35:03.513 [overlaytextsubs @ 000001fce4c7b080] ASS_Image count: 719  - Pixels: 2624063  Execution time: 26.926 msup=0 drop=154 throttle=off speed=0.254x
2

The count of those bitmaps alone is so huge (and proves that I didn't exaggerate) that it makes gpu processing inefficient as this cannot be parallelized reasonably.

It's like when you have a fleet of riding mowers (shader units): you can cut the crass of a football field very quickly (letting 10 of them drive in parallel), but it doesn't help for your family garden at home. It also doesn't help when needing to treat a hundred family gardens but you are forced to do one after another.

Most of those individual bitmaps are too small as that any parallel processing would help.

softworkz · 2024-11-29T21:13:59Z

VSFilter is the reference implementation for the ASS format (though the situation is somewhat complicated since there exist various different versions of it), but it is not really actively developed any more. Libass on the other hand is actively maintained and in general faster than libass.

I compared CPU usage when playing complex ass subs on a 4k screen with Aegisub (the video preview maximized) and there doesn't seem to be much difference between libass and VSFilter..?

In particular it outsources the blending to the user, who can then decide to blend on the GPU if they want - to my knowledge this is not possible with VSFilter.

Do you know what kind of output VSFilter provides? Normally it's a DirectShow filter but I couldn't detect an active filtergraph, it seems Aegisub is using it in a different way?

So my recommendation would be to use libass, but it's possible that other libass developers have other thoughts for this (and note that I don't know much about FFmpegInteropX).

If it would provide RGBA region images, as fast as libass creates its output, we would save the step of re-assembling the ass images, that's what I'm wondering about.

arch1t3cht · 2024-11-29T21:30:03Z

I compared CPU usage when playing complex ass subs on a 4k screen with Aegisub (the video preview maximized) and there doesn't seem to be much difference between libass and VSFilter..?

Aegisub is actually not too efficient at subtitle rendering since it blends libass subtitles on the CPU without any big optimizations (i.e. no SIMD, etc). Benchmarking subtitle rendering is hard, especially when comparing different renderers, but one option is to play the subtitles on MPC-HC (clsid's fork) with the internal subtitle renderer (a VSFilter variant) and libass, and compare the number of dropped frames.

It's also worth noting that this very much depends on what "complex" means here. As has also been noticed above, the main performance bottleneck in rendering is the total bitmap size. Movement and transforms are often seen as "complex" but aren't necessarily any more performance-heavy than static subtitles.

To give a concrete example of where libass is faster than VSFilter: Its blurring implementation is much more efficient. VSFilter's blurring gets much slower with increasing blur radius, while libass's blur implementation is effectively constant time with respect to the blur radius. Moreover, libass has much more extensive caching, so effects like clip-based color gradients may be more efficient.

Do you know what kind of output VSFilter provides? Normally it's a DirectShow filter but I couldn't detect an active filtergraph, it seems Aegisub is using it in a different way?

Aegisub uses VSFilter via CSRI, but I'm afraid I can't tell you much more than that.

brabebhin · 2024-11-29T21:42:06Z

Like said above, I don't think it makes sense to send all individual bitmaps to the GPU, there needs to be some preprocessing at the CPI side. Look at these figures from overlaying some more complex animated ass subtitles onto a 4k video:

The amount of CPU processing that will actually be needed remains to be seen. We should off-load as much as possible to the GPU.

Thanks to @arch1t3cht I have a fairly good idea how this will work but until I see the actual outputs and I can play with libass and render some of these frames myself, it really is hard to just imagine it and figure out the best approach :)

softworkz · 2024-11-29T22:28:03Z

It's also worth noting that this very much depends on what "complex" means here.

http://streams.videolan.org/samples/sub/SSA/subtitle_testing_complex.mkv

and a 4k version created like this:

ffmpeg -c:v:0 h264 -i http://streams.videolan.org/samples/sub/SSA/subtitle_testing_complex.mkv -filter_complex "scale=w=3840:h=2160" -map v:0 -map s:0 -an -c:s copy -c:v:0 libx264 -g:v:0 72 -preset:v:0 slow -profile:v:0 high -crf:v:0 23 subtitle_testing_complex_4k.mkv

Benchmarking subtitle rendering is hard, especially when comparing different renderers, but one option is to play the subtitles on MPC-HC (clsid's fork) with the internal subtitle renderer (a VSFilter variant) and libass, and compare the number of dropped frames.

It looks like MPC-HC is rendering the subtitles at a lower resolution and upscales them for display.
Should it have a libass option for subtitle rendering, I don't see any in the latest version.

Aegisub uses VSFilter via CSRI, but I'm afraid I can't tell you much more than that.

Can you tell what CSRI is? Never heard of that..

Finally, the question of all questions: What are you working on, the most recent version of Aegisub is 3.2.2 from 2014... Or is there any newer somewhere? 😆

astiob · 2024-11-30T00:43:59Z

VSFilter has many interfaces.

In particular [libass] outsources the blending to the user, who can then decide to blend on the GPU if they want - to my knowledge this is not possible with VSFilter.

XySubFilter uses SubRenderIntf, originally designed for madVR but nowadays also supported by MPC-HC, which outputs a list of RGBA bitmaps. I don’t know how/whether XySubFilter actually combines small bitmaps into bigger RGBA ones, but at any rate, the final blending onto video is done by the consumer.

MPC-HC’s internal VSFilter may have something similar of its own.

It looks like MPC-HC is rendering the subtitles at a lower resolution and upscales them for display.

When using the internal renderer as arch1t3cht suggested (or when using XySubFilter), it doesn’t. You may be using an external VSFilter/DirectVobSub: check your settings in Options → Playback → Output (or in older versions, directly in Options → Playback).

Should it have a libass option for subtitle rendering, I don't see any in the latest version.

Assuming you’re using the latest version from clsid2, the libass checkbox is tucked away in Options → Subtitles → Default style.

softworkz · 2024-11-30T00:44:04Z

I made some comparisons:

Burn-In via ffmpeg
This was too slow to play, you see the execution times above being 50ms per frame even without libass' rendering time, so the ffmpeg implementation is not something for us to look at
This is from the Aegisub preview window
Quality is good but it didn't play fluently (explained by @arch1t3cht)
From our app using MPV
Good quality, low CPU
VLC Player
Good quality, low CPU
MPC-HC (internal renderer, VSFilter based)
Works, but uses high CPU
5b. MPC-HC with libass enabled
Smaller fonts than all other cases, uses same high CPU like 5
Well, that's FFmpegInteropX currently

So, the places to look at are MPV and VLC. MPV does a lot of things with shaders which puts significant load on GPUs, VLC is the the most efficient player among all. Their use of libass might be more straightforward, but it's just a guess. In terms of what ASS rendering adds to the CPU and GPU loads appears to be similar.

astiob · 2024-11-30T00:46:45Z

[VLC’s] use of libass might be more straightforward

Performance aside, it may also be less correct. It certainly has been in the past. Exercise caution.

mpv is the exemplary existing user of libass that’s known to configure and use everything correctly.

softworkz · 2024-11-30T01:13:16Z

@astiob - Thanks a lot for the comment!

You were right, I needed to block loading of external VSFilter implementations, then it played fluently with the internal renderer and also with libass enabled (I've updated my post above accordingly). In both cases I've seen very high CPU load, very different from VLC and MPV.

Assuming you’re using the latest version from clsid2, the libass checkbox is tucked away in Options → Subtitles → Default style.

Yup, latest from clsid2. Found it, thanks, awkward placement indeed.

mpv is the exemplary existing user of libass that’s known to configure and use everything correctly.

Then it's definitely worth looking at it. I'm only familiar with ffmpeg's way of using it.

arch1t3cht · 2024-11-30T01:17:33Z

In particular, VLC always renders its subtitles at the video's storage resolution and blends them to a single RGBA image, which is then scaled to the display resolution. This can cause artifacts, in particular when the display resolution is lower than the storage resolution (edit: I mean VLC's scaling specifically here. In general there can be good reasons for rendering at storage resolution, in particular for typesetting). This may also be the reason why VLC appears faster than mpv to you: If you're watching subtitles on a 1080p video in fullscreen on a 4k display, VLC will render subtitles at 1080p while mpv will render at 4k, which is slower. (You can make mpv render subtitles at the video's storage resolution using blend-subtitles=video, though this only works on vo=gpu and not yet on vo=gpu-next.)

softworkz · 2024-11-30T01:31:05Z

This may also be the reason why VLC appears faster than mpv to you:

It didn't. I said it seems equal.
What I said about VLC being most efficient in general is not about subtitles.
(edit: while MPV may have better quality)

This can cause artifacts, in particular when the display resolution is lower than the storage resolution.

Right, I've seen that before, It's bad scaling algorithm in place.

This may also be the reason why VLC appears

From the screenshot images, you you can see that what you said doesn't apply to my test - in case you know that video: I had created a version upscaled to 4k, to avoid players rendering the subs at the original video resolution 😄

softworkz mentioned this issue Nov 29, 2024

About libass ASS_Image list output - History, reasoning and usage libass/libass#851

Open

libass support #439

libass support #439

Comments

rmtjokar commented Nov 9, 2024

brabebhin commented Nov 9, 2024

softworkz commented Nov 9, 2024

softworkz commented Nov 9, 2024

brabebhin commented Nov 9, 2024 • edited Loading

lukasf commented Nov 10, 2024

brabebhin commented Nov 10, 2024

lukasf commented Nov 10, 2024

brabebhin commented Nov 10, 2024 • edited Loading

softworkz commented Nov 10, 2024

rmtjokar commented Nov 15, 2024

rmtjokar commented Nov 15, 2024

brabebhin commented Nov 15, 2024 • edited Loading

lukasf commented Nov 18, 2024

softworkz commented Nov 19, 2024

Burn into video options

B1

B2

B3

B4

B5

Presentation Layering options

Basic

ffmpeg Side

"Rendering"

Canvas2d

SwapChain

L1, L2

brabebhin commented Nov 19, 2024 • edited Loading

softworkz commented Nov 19, 2024

softworkz commented Nov 19, 2024

brabebhin commented Nov 19, 2024

softworkz commented Nov 19, 2024

softworkz commented Nov 19, 2024

brabebhin commented Nov 19, 2024

softworkz commented Nov 19, 2024

brabebhin commented Nov 19, 2024

softworkz commented Nov 19, 2024 • edited Loading

brabebhin commented Nov 19, 2024 • edited Loading

softworkz commented Nov 19, 2024

brabebhin commented Nov 19, 2024

lukasf commented Nov 19, 2024

softworkz commented Nov 19, 2024 • edited Loading

brabebhin commented Nov 20, 2024

softworkz commented Nov 29, 2024

softworkz commented Nov 29, 2024

brabebhin commented Nov 29, 2024

lukasf commented Nov 29, 2024

arch1t3cht commented Nov 29, 2024

lukasf commented Nov 29, 2024

lukasf commented Nov 29, 2024

softworkz commented Nov 29, 2024

lukasf commented Nov 29, 2024

lukasf commented Nov 29, 2024

brabebhin commented Nov 29, 2024

softworkz commented Nov 29, 2024

arch1t3cht commented Nov 29, 2024

softworkz commented Nov 29, 2024

softworkz commented Nov 29, 2024

softworkz commented Nov 29, 2024 • edited Loading

arch1t3cht commented Nov 29, 2024

brabebhin commented Nov 29, 2024

softworkz commented Nov 29, 2024 • edited Loading

softworkz commented Nov 29, 2024 • edited Loading

arch1t3cht commented Nov 29, 2024

brabebhin commented Nov 29, 2024

softworkz commented Nov 29, 2024 • edited Loading

astiob commented Nov 30, 2024

softworkz commented Nov 30, 2024 • edited Loading

astiob commented Nov 30, 2024

softworkz commented Nov 30, 2024

arch1t3cht commented Nov 30, 2024 • edited Loading

softworkz commented Nov 30, 2024 • edited Loading

brabebhin commented Nov 9, 2024 •

edited

Loading

brabebhin commented Nov 10, 2024 •

edited

Loading

brabebhin commented Nov 15, 2024 •

edited

Loading

brabebhin commented Nov 19, 2024 •

edited

Loading

softworkz commented Nov 19, 2024 •

edited

Loading

brabebhin commented Nov 19, 2024 •

edited

Loading

softworkz commented Nov 19, 2024 •

edited

Loading

softworkz commented Nov 29, 2024 •

edited

Loading

softworkz commented Nov 29, 2024 •

edited

Loading

softworkz commented Nov 29, 2024 •

edited

Loading

softworkz commented Nov 29, 2024 •

edited

Loading

softworkz commented Nov 30, 2024 •

edited

Loading

arch1t3cht commented Nov 30, 2024 •

edited

Loading

softworkz commented Nov 30, 2024 •

edited

Loading