Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make fix_outline always produce correct visual output without aliasing or gaps #483

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

astiob
Copy link
Member

@astiob astiob commented Mar 4, 2021

I’ve ported my ancient alpha branch onto current master and fixed karaoke, so this should now really solve the problem for good.

This would supersede and close #480. It would fix #76 and fix #473.

My only question is whether this PR’s current approaches to caching and producing ASS_Bitmaps are right.

An alternative approach I considered was to include effect_timing in the cache key and have fix_outline take effect_timing and both alphas and internally process the bitmap in two parts like karaoke does it. This would have required no changes to render_text and avoided phantom ASS_Images. The big drawback would be that \kf syllables would effectively never be cached (except for very slowly-moving \kf).

Ideally, combined glyph/fill bitmaps would be cached separately from border and shadow bitmaps:

  • After this PR, borders and shadows of identical text that differs only in \1a and/or \2a would share (and avoid recomputing) their glyph bitmaps.
  • After this PR, border and shadow bitmaps could be optimally shared/cached regardless of karaoke states (by making simply a single alpha value as part of the key).
  • Regardless of this PR, glyph bitmaps could be shared for identical text that has some border and differs only in \be or \blur or fill-in flags.

To get rid of the phantom transparent ASS_Images that render_text now outputs for borders with \kf, render_glyph could skip images whose alpha is 255 (fully transparent). This could slightly help other cases, too, such as when the script has an explicit 255 alpha. On the other hand, if there is any client out there that wants to do something nontrivial with the ASS_Images it receives such as draw bounding boxes around them, it would then miss these images entirely, which seems undesirable. Alternatively, render_glyph could get an extra argument pointing to a second source bitmap, which would work well but require extra (easy) code changes in render_text.

else if (info->effect_type == EF_NONE || info->effect_timing > 0)
info->filter.alpha_secondary = info->filter.alpha_primary;
else
info->filter.alpha_primary = info->filter.alpha_secondary;
Copy link
Member Author

@astiob astiob Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For \k and \ko, I’m setting alpha_primary and alpha_secondary to the same number in the cache key. This means that the bitmap will be re-rendered when the karaoke time arrives (which happens at most once), but on the other hand this can avoid unnecessary re-renders (arbitrarily many) when the other alpha value changes. This may or may not be a desirable trade-off.

copy_bitmap(render_priv->engine, &v->bm_o_k, &v->bm_o);
fix_outline(&v->bm, &v->bm_o_k, k->filter.alpha_secondary, blurred_o);
}
fix_outline(&v->bm, &v->bm_o, k->filter.alpha_primary, blurred_o);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I’m not a huge fan of the code duplication here.

@astiob
Copy link
Member Author

astiob commented Mar 4, 2021

Also, after this PR, this code from #371 (comment):

if ((flags & FILTER_NONZERO_BORDER &&
     info->a_pre_fade[0] == 0 &&
     info->a_pre_fade[1] == 0 &&
     info->fade == 0) ||
    ...)
    flags |= FILTER_FILL_IN_BORDER;

technically starts to harm quality if \bord is small (smaller than a pixel), although I haven’t done any tests and wouldn’t be surprised if the difference was unnoticeable.

@astiob
Copy link
Member Author

astiob commented Mar 4, 2021

And yes, as mentioned on IRC, this PR does have the issue that it requires divisions. Many divisions. Slow.™ It’s not obvious how to alleviate this, although to me it’s also not necessarily obvious that this is actually a problem.

@astiob astiob force-pushed the alpha branch 2 times, most recently from 711f798 to ccb6b73 Compare March 5, 2021 18:29
@astiob
Copy link
Member Author

astiob commented Mar 5, 2021

After a long discussion on IRC and some performance tests by @MrSmile, I’ve replaced the integer division code with float division, because it’s actually faster, with an extra slight twist to try and assist autovectorization (which is entirely impossible for integer division on x86). I’ve updated the first commit message accordingly. I’ve also used this opportunity to reduce noisy changes between the two commits.

I’ve also added a code comment where I previously posted a GitHub comment.

// The cost (which we accept) with IEEE float32 is that just two out
// of the 33 million possible operand tuples round in the "wrong"
// direction after evaluating to almost exactly an integer and a half.
const float over65025 = nextafterf(65025, FLT_MAX);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ultimately decided to just go with nextafterf as the universally safest choice (with no assumptions about the floating point format). The drawbacks I considered are:

  • It may be absent on some older systems. But we make extensive use of lrint and I invoke lround once in karaoke code, and I’d like to hope that libraries that contain those functions also have nextafterf.
  • It’s not necessarily evaluated as a compile-time constant. But I’ve hoisted it out of the pixel-wise loop, so it’s now a single call per bitmap, which shouldn’t hurt.

I tried to emulate this particular nextafterf call through basic arithmetic, but couldn’t make it work without assuming things about the floating point format. Which is likely very safe in practice, but if I can avoid it at no significant cost, I’d rather avoid it.

Copy link
Member

@MrSmile MrSmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main problem is outputting of 4 bitmaps instead of 2 for \kf. Also a bunch of nitpicks.

libass/ass_bitmap.c Outdated Show resolved Hide resolved
@@ -30,7 +30,7 @@ typedef struct cache Cache;
// cache values

typedef struct {
Bitmap bm, bm_o, bm_s;
Bitmap bm, bm_o, bm_o_k, bm_s;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think simple symmetric bm_k is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t disagree that this naming is odd, but will it not be a concern that bm_k doesn’t quite explain what it actually is? Like, bitmap of karaoke what? (It’s specifically the outline, like bm_o, but exclusively for karaoke.) I suppose it should be fine with the comment edited…

@@ -106,6 +106,8 @@ START(filter, filter_desc)
GENERIC(int, be)
GENERIC(int, blur)
VECTOR(shadow)
GENERIC(uint8_t, alpha_primary)
GENERIC(uint8_t, alpha_secondary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alpha_* correspond to c[0] and c[1]. I think it's better to keep 0 and 1 literally instead of confusing off-by-one primary and secondary. Either in the form of alpha[2] (but I think cache macro doesn't support arrays) or alpha0/alpha1 or even a0/a1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Primary” and “secondary” come from the ASS style field names PrimaryColour and SecondaryColour, which denote their meanings. But sure, alpha0 and alpha1 could work too. At one point I also considered alpha_left and alpha_right.

@@ -789,6 +789,15 @@ static ASS_Image *render_text(ASS_Renderer *render_priv)
if ((info->effect_type == EF_KARAOKE_KO)
&& (info->effect_timing <= 0)) {
// do nothing
} else if ((info->effect_type == EF_KARAOKE_KF) && info->bm_o_k) {
tail =
render_glyph(render_priv, info->bm_o, info->x, info->y,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to rewrite render_glyph() to produce one output bitmap instead of two.

Bitmap *bm_s_source = &v->bm_o;
bool share_fix_outline =
!(flags & FILTER_FILL_IN_BORDER) && !(flags & FILTER_FILL_IN_SHADOW) &&
(k->filter.alpha_primary == 0xFF || k->filter.alpha_secondary == 0xFF);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to put 0xFF - alpha into filter? There are checks for 0xFF only and alpha gets reversed inside of fix_outline() too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the ASS-native alpha scale where 255 means transparent and 0 means opaque absolutely everywhere, I believe, so I think flipping it here would be confusing. For example, compare this to the code that originally decides whether to set FILTER_FILL_IN_SHADOW.

@@ -112,7 +112,7 @@ typedef struct {
BitmapRef *bitmaps;

int x, y;
Bitmap *bm, *bm_o, *bm_s; // glyphs, outline, shadow bitmaps
Bitmap *bm, *bm_o, *bm_o_k, *bm_s; // glyphs, outline, shadow bitmaps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment should be updated too.

@@ -60,24 +60,25 @@
#endif


void ass_synth_blur(const BitmapEngine *engine, Bitmap *bm,
bool ass_synth_blur(const BitmapEngine *engine, Bitmap *bm,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to not change this function and handle blurriness check outside of it. At least threshold here (0.001) is arbitrary in contrast with blur quantization that is based on mathematical considerations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I really want to know is what the actual bitmap contains by the time I get to fix_outline: whether it’s proper coverage values, which allow me to use the perfect formula, or approximate and nonmonotonic signal-processed samples, which necessitate the “border is already pixelated” formula. I don’t care so much (or at all) about whether blur was requested in the script, but rather about whether it was actually applied and applied successfully. Even in the rare case where we try to apply blur and fail due to lack of memory or something, I can still apply the perfect formula to the result, although it probably doesn’t really matter.

(Also, Gaussian blur already returns a bool status and we’ve been ignoring it all this time.)

libass/ass_bitmap.c Show resolved Hide resolved
Copy link
Member

@TheOneric TheOneric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version leads to an “use after free” in ART's karaoke tests: https://github.com/TheOneric/libass-regression-tests-tmp/runs/2319224287?check_suite_focus=true#step:10:85 (for some reason GitHub requires one to be logged in to see any logs)

To recreate enable at least ASAN; ART uses -fsanitize=address -fsanitize=undefined -fsanitize=float-cast-overflow -fno-sanitize-recover=all.

@astiob
Copy link
Member Author

astiob commented Apr 29, 2021

Fixed the memory bug. (Good catch!)

By the way, 216-vertical-2252.png in ART will need to be regenerated: it exhibits pretty bad aliasing that is fixed by this PR. (But it’s close enough that it earns it a “BAD” success rather than an outright failure: that’s kinda nice.)

Haven’t redone render_bitmap yet, though. I’m still at a loss as to the best way to do it.

@rcombs
Copy link
Member

rcombs commented Aug 12, 2021

My biggest question here is what this does to performance, since it potentially runs on quite a lot of data. In practice, are compilers smart enough to extract the blurred check? Does the use of floating-point division have any negative effects? Does this end up taking a significant amount of time on complex files now?

If it's fine on those fronts, I'm 👍 on this.

@astiob
Copy link
Member Author

astiob commented Aug 12, 2021

what this does to performance […] are compilers smart enough to extract the blurred check?

Back when this was discussed on IRC, MrSmile ran tests and saw autovectorization kick in on amd64, which I think implies hoisting of the blurred check. (But I’ll note that autovectorization requires -O3, while we currently default to -O2. Maybe it’s time to revisit that default.) And even if it isn’t hoisted, (as MrSmile also noted,) the branch should be predicted correctly almost all the time. The only potential danger is if a compiler is smart enough to autovectorize in general but fails to hoist this check and therefore fails to vectorize.

Quoting the IRC log with the exact numbers below.

@MrSmile Do you remember what the 10s number quoted at the beginning was? Looking at the log now, I’m not sure what the baseline number is: 10s or 16s? … Oh, was it perhaps 10s vectorized and 16s non-vectorized? Does the old/current code vectorize at all?

18:39:39 < MrSmile> Okay, I managed to create sample for fix_outline()
18:40:26 < MrSmile> after PR #483: 10s -> 26s and fix_outline() 71% in perf
18:40:43 < MrSmile> mostly on idiv instruction
18:50:11 < MrSmile> table-based approach isn't much better: 21s and 63%
18:54:51 < rcombs> how's it do if you switch to float
19:03:41 < MrSmile> much better: 12s and 25%
19:06:37 < MrSmile> but I should probably disable -Ofast
19:09:28 < MrSmile> without -ffast-math there is no sane float -> int conversion
19:23:39 < MrSmile> for -O3 it's the same and looks like for -O2 vectorization is disabled altogether
19:26:44 < MrSmile> the fastest method (at least for x86 with AVX): o[x] = num / (65025.0001f - alpha_g * g[x]) + 0.5f

20:05:08 < Oneric> yes, how much does #483 with floats  impact performance for -fno-fast-math builds?
20:33:16 < MrSmile> 10s -> 12s, 8% -> 23% fix_outline()

00:23:32 < Oneric> I think float-div should be faster than int-div on all CPU-Archs with a hardware FPU
00:25:24 < Chortos-2> MrSmile: so how does it actually fare without the vectorization?
00:25:46 < MrSmile> with -O2?
00:26:26 < Oneric> non -ffast-math, -O2
00:30:48 < MrSmile> FP: 20s 60%, int: 26s 72%
00:32:30 < MrSmile> table: 18s 56%
00:33:58 < Oneric> The 256 or the full 256^2 table?
00:34:09 < MrSmile> 256
00:34:24 < MrSmile> with x2 error from ideal aproach
00:35:14 < MrSmile> o[x] = num * div_table[(65025 + 128 - alpha_g * g[x]) >> 8] >> 10;
00:36:22 < Oneric> For reference: what's the time before the patch with same compiler option? (-O2 (no fast-math) …)
00:43:42 < MrSmile> o[x] = (o[x] > g[x]) ? o[x] - (g[x] / 2) : 0  ===> 16s 46%
00:45:10 < MrSmile> o[x] = (o[x] > g[x]) ? o[x] - g[x] : 0  ===> 14s 36%
00:56:04 < Oneric> thanks. so for default-options and fix_outline heavy samples either
00:56:04 < Oneric> 62% penalty with only int
00:56:04 < Oneric> 25% but pretty exact (float)
00:56:05 < Oneric> 13% with some approximation and table
00:57:42 < Chortos-2> to be clear, is this latest float result with 65025.004f and +0.5f, or with the FFMAX(den,1) & all-integers-and-only-division-in-float approach I wondered about?
00:58:03 < MrSmile> 65025.004f
01:21:38 < MrSmile> o[x] = num / (float) FFMAX(den, 1) + 0.5f;  ===> 20.3s 61%
01:22:42 < Chortos-2> there's no need for the +0.5f if you add den/2 to num first
01:23:46 < MrSmile> o[x] = num / (float) FFMAX(den, 1);  ===> 19.5s 58%
01:24:59 < Chortos-2> overall, it looks like all of these float variants are about the same speed?
01:25:19 < Chortos-2> just the autovectorizability differs
01:26:29 < MrSmile> o[x] = (num + den / 2) / (float) FFMAX(den, 1);  ===> 20.8s 61%
01:28:34 < MrSmile> o[x] = num / (65025.004f - alpha_g * g[x]) + 0.5f;  ===> 19.9s 60%
01:29:01 < MrSmile> slightly better and also can vectorize

@TheOneric
Copy link
Member

TheOneric commented Aug 12, 2021

But I’ll note that autovectorization requires -O3, while we currently default to -O2. Maybe it’s time to revisit that default.

If I'm not mistaken -ftree-vectorize should be sufficient to auto-vectorise assuming the instruction baseline contains vector-operations. (For gcc this is currently enabled by default for -O3 but not -O2, not sure about clang)

@astiob
Copy link
Member Author

astiob commented Aug 12, 2021

Tentative bottom line, assuming that 10s/16s was resp. vectorized/unvectorized:

  • The consensus is that hardware float is faster than any equivalent int code in any case.
  • This PR slows down fix_outline in any case.
  • In a constructed heavy sample, full rendering takes extra 20–25% time.
  • Measured alone, fix_outline takes extra 60% time unvectorized (vs unvectorized old code) or extra 250% vectorized (vs vectorized old code, which must’ve been pretty fast indeed!).
  • The new vectorized code is faster than the old unvectorized code.
  • Vectorization helps immensely, so we should probably enable it in our default CFLAGS.

@MrSmile
Copy link
Member

MrSmile commented Aug 13, 2021

Oh, was it perhaps 10s vectorized and 16s non-vectorized?

Yes, 10s for -Ofast and 16s for -O2.

But I’ll note that autovectorization requires -O3, while we currently default to -O2. Maybe it’s time to revisit that default.

I'm planning to create assembly version when this PR lands, so It's not a strict necessity. On the other hand I usually build libass with -Ofast and haven't encountered problems so far, maybe it's worth to increase default optimization level, especially for non-x86 architectures.

@astiob
Copy link
Member Author

astiob commented Aug 23, 2022

Oh man, I’ve just rediscovered that this was in my very first batch of patches that was my first attempt to contribute to libass 😂 Even before other things that I thought came earlier. And even listed as the very first item. And it still isn’t finished.

@TheOneric
Copy link
Member

But I’ll note that autovectorization requires -O3, while we currently default to -O2. Maybe it’s time to revisit that default

Update on this: since GCC 12 autovectorization is part of -O2 (with -fvect-cost-model=very-cheap which is switched for -fvect-cost-model=dynamic in -O3).
clang appears to also enable some forms of autovectorisation with -O2 already.

Whenever a glyph has a border, the original glyph's own ASS_Image
is applied on top of the border's ASS_Image. Let's mathematically
devise such alpha values for the border image that the resulting
colors exactly match (modulo rounding) the colors produced by a
hypothetical ASS_Image that contains both the glyph and the border.

Let a_i be the opacity of a single pixel within an ASS_Image,
where i is either g for glyph or o for outline (border). Further,
let A_i be the opacity and c_i the color of the entire ASS_Image.

This commit only considers cases when no blur is applied,
so that the border is a solid, expanded copy of the glyph.
Blur will be addressed in a separate commit.

Think of a_i as the _area_ covered by the glyph/border,
measured in pixels. We have 0 <= a_g <= a_o <= 1.
We also have 0 <= A_g, A_o <= 1.

We need a background color. Let's denote it as c_b. The background
color includes any effects from applying the shadow of the glyph.

The final color we want to achieve is:
  * on the area covered by the original glyph,
                                       A_g c_g + (1 - A_g) c_b.
  * on the area covered by the border, A_o c_o + (1 - A_o) c_b.
  * on the remaining area,             c_b.
Averaging the color over the pixel's area gives:
           a_g  (A_g c_g + (1 - A_g) c_b) +
    (a_o - a_g) (A_o c_o + (1 - A_o) c_b) +
      (1 - a_o)  c_b

The color we actually get after blending the border with the
background and then blending the original glyph with the result is:
        (A_g a_g) c_g +
    (1 - A_g a_g) ((A_o a_o) c_o + (1 - A_o a_o) c_b)

We want to replace a_o in the actual color's expression with a
value that makes it match the expected color. Let's call this value
a_o'. Equating both expressions and solving for a_o' gives:
       A_o (c_o - c_b) = 0 and a_o' does not matter
    or A_g a_g = 1 and a_o' does not matter
    or a_o' = (a_o - a_g) / (1 - A_g a_g)

Use the solution formula from the third case to calculate a_o',
as the first two cases allow a_o' to be anything.

The calculation itself is performed and rounded (half towards
+infinity) using floating-point division because this is
actually faster than integer division and because this is
vectorizable on x86 (massively improving the speed further).

The only issue is if the 1 - A_g a_g denominator should be zero.
A_g a_g can only equal 1 if A_g = a_g = 1. Since a_g <= a_o <= 1,
a_o must also be 1. As it happens, libass already handles this case
separately: as a precaution, it checks that a_g < a_o and sets a_o'
to zero if this condition fails to hold. Retain this check for the
numerator, still as a precaution. Unfortunately, using this check
for the denominator or the whole formula seems to break automatic
vectorization of the loop, so use a trick: instead of 1 - A_g a_g,
use nextafter(1) - A_g a_g and accept the very rare and minor
precision loss. This is just enough to avoid division by zero
and allows the loop to be autovectorized. Since in this case a_o'
can be any value, this does not adversely affect our equation
solving. Moreover, if A_g a_g = 1, then the numerator must be 0,
so we will, in fact, produce the same value (0) as we do now with
the a_g < a_o check.

The formula requires knowledge of A_g. To enable this, the opacity of
the original glyph is passed to fix_outline and (via CompositeHashKey)
to ass_composite_construct. Generated composite bitmaps depend on it,
so composites with different primary/secondary color alpha (depending
on karaoke) are now distinguished by the composite cache.

NB: this commit produces phantom, fully transparent ASS_Images
that duplicate parts of border images when \kf is in effect.
This may be a problem if any consumer assigns any meaning
to fully transparent ASS_Images.
Blurred alphas are not coverage values. The area that is not
covered by the original glyph is a mix of the border and the
background, governed by the blurred border opacities. To get
an exact result, we would need to integrate the blurred
opacities on fragments of pixels that exclude the original
glyph (or, equivalently, on fragments that are covered by it,
as these fragments are complementary). This is too hard,
and in fact blur itself already employs approximations
to avoid integration even on whole pixels.

Give up on exact values when blur is applied, and treat
each pixel of the blurred border image as a uniformly flat
translucent square.

Note that we can in this case get non-zero opacities even when
the glyph's coverage is larger than the border's uncorrected
opacity, invalidating our current o[x] > g[x] sanity check.

Writing and solving equations like before, we want:
         a_g  (A_g c_g + (1 - A_g) c_b) +
    (1 - a_g) ((A_o a_o) c_o + (1 - A_o a_o) c_b)

and we obtain:
       A_o (c_o - c_b) = 0 and a_o' does not matter
    or A_g a_g = 1 and a_o' does not matter
    or a_o' = (a_o - a_g a_o) / (1 - A_g a_g)

Handle the first case like the third,
and handle the second case by setting a_o' = 0.

The observable difference for non-blurred bitmaps
was not large in my tests, but the code is simple,
so the exact non-blurred calculations are retained.
@astiob
Copy link
Member Author

astiob commented Feb 16, 2023

Rebased on master and addressed two nits. (More substantial changes still pending.)

Meanwhile, we’ve run performance tests on several other CPUs. Full times of profile built from this source code run on this test script:

AMD Ryzen 7 5800H (Zen 3; released in early 2021), GCC 12.2.0 (Rev10, Built by MSYS2 project), Windows 11 (x86_64-w64-mingw32, UCRT):

For this test, I recompiled only ass_bitmap.c; the rest of libass used the default -O2. For ass_bitmap.c, all tests included -march=native.

  • vectorized master: 12.4 s
  • vectorized (& loop-unswitched) float: 14.1 s (-ffast-math didn’t show a significant difference)
  • nonvectorized master: 17.4 s
  • nonvectorized float: 22.1 s
  • (nonvectorized) integer: 24.9 s, and statistical tests claim it’s significantly different from float, but there was more variance in the measurements than I’d have liked, so I’m not entirely convinced this difference truly isn’t just an unfortunate coincidence

Despite vectorization being on by default in GCC 12, it doesn’t actually show any benefit here at -O2 until two additional flags (which are part of -O3) are enabled: -funswitch-loops -fvect-cost-model=dynamic. No other flags I tried showed any significant difference.

Yeah, turns out -funswitch-loops is only on at -O3 in GCC. So maybe I really should unswitch that blurred check manually… or (gasp) accept mathematical imperfection and get rid of the !blurred path altogether.

On another note, neither GCC nor Clang (according to Godbolt) ever try using 32-by-16-bit integer division instructions in this code on x86; instead, they use 64-by-32-bit division. There’s a chance that they’d be faster, although I’m guessing (without knowing anything) maybe not on recent CPUs.

M1 Max (AArch64/ARMv8.5-A), Apple clang version 14.0.0 (clang-1400.0.29.202), arm64-apple-darwin22.3.0 (tests by @rcombs):

branchless int, -O3:       15.33s user 3.75s system 19.109 total
branching  int, -O3:       16.84s user 3.61s system 20.471 total

branching  int, -O2:       20.27s user 6.26s system 26.648 total
branchless int, -O2:       27.52s user 4.82s system 32.397 total

float, -O3:                 9.23s user 3.01s system 12.307 total
float, -O2 (vectorized):   11.82s user 3.76s system 15.620 total

float, -O3 -fno-vectorize: 18.51s user 4.51s system 23.056 total
float, -Os:                23.89s user 3.20s system 27.107 total
float, -O2 -fno-vectorize: 24.11s user 4.08s system 28.295 total

Elsewhere on the Internet, microbenchmarks have shown ARM64 to have slower float div than int div:

It is possible that these benchmarks are setup incorrectly per se, or perhaps they’re just not directly applicable to our use case because we do additional arithmetic here besides the division itself (even if not a lot)—or perhaps even because of the need to work around division by zero in one way or another, which adds instructions in the integer case but comes for free with floats.

Overall, it seems that float continues to fare better, or at least not worse, than integer division in all cases. Actually, non-vectorized float on the M1 does seem slower than int? Whew, that restores a bit of my sanity.

On the other hand, we should probably have hand-written assembly code for this to avoid depending on finicky compiler optimizations that have an effect this big. And at that point, maybe it would be wiser, actually, to keep the C code based on pure integers? Or not?

We haven’t tried LUTs here (as @MrSmile did last time); maybe we should.

@MrSmile
Copy link
Member

MrSmile commented Feb 16, 2023

I really should unswitch that blurred check manually…

Actually it's not a boolean parameter: there're small enough blurs to be between true and false. I'm thinking of replacing boolean with interpolation parameter (0–255):

unsigned num = 255 * FFMAX(o[x] - g[x], 0);
num = 255 * num + (o[x] * (255 - g[x]) - num) * blurred;

If there's no significant performance impact, I would prefer continuous version instead of two discrete (also it's one asm function instead of two).

neither GCC nor Clang (according to Godbolt) ever try using 32-by-16-bit integer division instructions in this code on x86

It's executed on the same hardware anyway, so mostly instructions size matter in this case.

maybe it would be wiser, actually, to keep the C code based on pure integers?

There's a tendency to vectorize only float divisions, so I think generic float version would be better.

I leave here my old table version if anybody wants to experiment with it.

@astiob
Copy link
Member Author

astiob commented Feb 16, 2023

Actually it's not a boolean parameter: there're small enough blurs to be between true and false. I'm thinking of replacing boolean with interpolation parameter (0–255)

What would be the meaning of this parameter, and how would you compute it? I don’t know how to assign meaning to anything other than a Boolean.

It's executed on the same hardware anyway, so mostly instructions size matter in this case.

Old CPUs have distinctly lower timings for shorter-operand variants in Agner Fog’s tables. Recent CPUs, though, have them mostly the same (for all lengths except 128-div-64) or even a little worse for the 32-div-16 variant than for 64-div-32; maybe that’s why GCC and Clang aren’t using it.

@astiob
Copy link
Member Author

astiob commented Feb 16, 2023

It has just occurred to me that there’s another opportunity to explore here. Most of the pixels have either g[x] == 255, where the entire pixel is covered by the glyph body and the outline is invisible, or g[x] == 0, where the pixel is fully outside the glyph body and contains only the border. In either case, division can be avoided entirely.

Code here. This does prevent any kind of vectorization, but it is almost as fast with integers as the vectorized float version and faster than current unvectorized master on my Ryzen:

  • vectorized float: 14.1 s
  • nonvectorized int with short-circuit branches avoiding division: 15.1 s

@MrSmile
Copy link
Member

MrSmile commented Feb 16, 2023

What would be the meaning of this parameter, and how would you compute it?

It would be "effective blurriness". I'm going to do some math to find a good approximation for it depending on blur radius, but we can use only 0 and 255 initially. Simplified lerp formula:

unsigned num = FFMAX(65025 * (o[x] - g[x]) + blurred * (255 - o[x]) * g[x], 0);

It has just occurred to me that there’s another opportunity to explore here.

Not sure that more complicated source worth the gain. I think that better to postpone all optimizations til assembly implementation. On the current stage we should decide a good reference algorithm to simplify future assembly testing. Do float and int branch produce exactly the same results?

Comment on lines +236 to +237
float den = over65025 - alpha_g * g[x];
o[x] = num / den + 0.5f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can match the integer version perfectly, at negligible perf impact, like this:

Suggested change
float den = over65025 - alpha_g * g[x];
o[x] = num / den + 0.5f;
float den = 65025 - alpha_g * g[x];
o[x] = FFMIN(num / den, 255.f) + 0.5f;

Copy link
Member Author

@astiob astiob Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes we have IEEE… uh… no, does this even work?

If den is 0, then num is 0, too. With IEEE float, num / den is NaN, and this FFMIN returns NaN (if I’m reading its definition correctly—as much as it doesn’t expect NaNs as inputs in the first place), and the conversion to integer upon assignment is UB by the standard (and not-zero in practice: mind, the exact result value doesn’t even matter if the ASS_Images are overlaid one on top of the other as this code assumes, but this doesn’t match the integer code).

TL;DR isn’t the FFMIN useless due to NaN

Copy link
Member Author

@astiob astiob Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and not-zero in practice

Actually, it probably is zero, at least on x86? The conversion probably goes through int32, giving INT32_MIN on x86, and then truncated to int8, giving zero?

In that case, this is actually a valid trick to use in hand-written assembler code that could target any C implementation that explicitly guards against division by zero.

But in C itself, no, I don’t think this is allowed. In fact, §6.5.5 Multiplicative operators explicitly declares any division by zero UB in any case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, instead of FFMIN we could use

float den = 65025 - alpha_g * g[x];
o[x] = num / (den + 1e-30f) + 0.5f;

This should have a significantly lower impact than over65025.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could do float divres = (num / den); o[x] = (divres >= 0.f ? divres : 0.f) + 0.5f;.

Are we aware of any real targets we care about that don't provide IEEE floats? I'm perfectly comfortable having those as a requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t know, but I’m not a specialist in other platforms. My only worthless contribution is that IIRC GPUs don’t (or didn’t), but I’d be surprised if libass could be compiled for a GPU.

But for starters, -ffast-math, which e. g. MrSmile uses for his own builds, is a common way to throw NaN explicitly out the window. It probably wouldn’t affect this line simply because the most trivial thing for the compiler to do is exactly the thing we find most desirable, but I’m not a fan of “probably usually maybe correct but explicitly unguaranteed” code. This function doesn’t seem worth bumping libass’s requirements to include IEEE compliance, and it might well harm us in other places (I assume MrSmile uses -ffast-math not for nothing).

For another thing, neither GCC nor Clang (nor MSVC) actually fully support Annex F nor define __STDC_IEC_559__ (although some platforms have system headers that define it instead—which isn’t correct), so this remains technically undefined behaviour. Indeed, Clang even used to complain about it with -fsanitize=undefined. But again, it’s true that, in practice, their optimizers do ensure 0/0=NaN, except when this is disabled via -ffast-math etc. (and so Clang’s UBSan excludes this check by default since Clang 9). [17000] [19535]

den + 1e-30f

I like this idea (if we stick to dividing every pixel).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a significantly lower impact than over65025.

Remind me: what is the impact of over65025 in this context? The nextafterf call itself? But it happens only once, whereas + 1e-30f would have to happen in each iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc the call to nextafterf was very expensive, while the constant adds are quite cheap? But also, the nextafterf doesn't quite match the int implementation in all cases, while the add does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me: what is the impact of over65025 in this context?

Result change. Basically + 1e-30 only matters for den == 0, while over65025 can shift values even in nonzero case.

@astiob
Copy link
Member Author

astiob commented Feb 16, 2023

I’ve rerun some tests in a slightly more stable environment. I ran each test 30 times on the Ryzen. Most of the results still annoyingly failed normality tests, so take my significance tests with a grain pile of salt, but I tried dropping some (at worst as many as half) of the worst times to get the rest to resemble normality and analysing just them. Statistical tests keep saying all sets were significantly different, but I don’t trust that the smaller differences were actually caused by the changes in the code:

  • 11.3 s: unconditional vectorized division as in this PR, think (float) num / nextafter_den + 0.5f, with -fno-fast-math (using vdivps)
  • 11.4 s: conditional division with (float) (num + den / 2) / den
  • 11.7 s: conditional division with (float) num / den + 0.5f
  • 11.7 s: unconditional vectorized division as in this PR with -ffast-math (using vrcpps)
  • 13.2 s: conditional division with (num + den / 2) / den in integers

Note how integer division yet again continues to appear slower than floating-point division even when no vectorization is in play.


It would be "effective blurriness". I'm going to do some math to find a good approximation for it depending on blur radius

This is the part I don’t understand: what is the physical meaning of this? what is the reference value you’re going to try approximating? how?

If it’s just “what subjectively looks good”, in all honesty I expect the current blurred branch already probably looks plenty good in all cases. But if it’s something more physically profound, it isn’t obvious to me how anything better can be achieved here with a single parameter per image.

Not sure that more complicated source worth the gain. I think that better to postpone all optimizations til assembly implementation. On the current stage we should decide a good reference algorithm to simplify future assembly testing.

I disagree. Surely we want the fastest code, not the simplest? We’ve already added the float stuff only for this reason, which IMO makes it more complicated: just look at the size of that comment explaining over65025 alone. And if the assembler code is gonna be fast, does that mean we can make the C code slow? In fact, shouldn’t we be doing the opposite and optimizing the C code for maximum performance—but on platforms where we aren’t gonna have any hand-written assembler code? Branches may be “slow”, but they’re surely faster than division, plus some platforms don’t even have hardware division.

By the way, if it seems scarily complicated due to the switch, I could also rewrite it using if (I figured the switch was clearer) and/or even keep the variable initializers unconditional (and hope the optimizer reorders them) and use an inline conditional in the assignment.

And given just how fast it is, maybe we don’t actually need any hand-written assembler code at all?

Then again, I suppose you could hand-write vectorized assembly that also performs this check against the whole vector: if it’s all zeros or all 255s, no division is necessary. That might be faster still.

Do float and int branch produce exactly the same results?

No, not with the nextafter(65025) to avoid the division by zero, which causes some quotients to be rounded towards zero in the float code but away from zero in the integer code. We may or may not be able to replace it by an added conditional somewhere like @rcombs tried in the inline comment above.

All pure-integer variants do match each other, including the conditional-division code I proposed in my last comment, and so does the conditional-division code using float for the division itself.

Speaking of rounding, it probably isn’t even optimal to always round to the nearest integer to begin with: ideally we’d round to the nearby integer that gives the more accurate value for the final colour after both images are blended together. But I haven’t tried doing the maths for this.

On another note, @rcombs reported another speedup (77% total time, or 1.3× total speed) on the M1 Max by doing all computations in float16 in the range 0‥1. Of course, this implies some precision loss, and given that we’re working with just 8 bits of precision to begin with, I’m slightly reluctant to accept it, but it’s probably entirely fine.

@astiob
Copy link
Member Author

astiob commented Feb 16, 2023

On the M1:

vectorized float:   9.27s user 3.06s system 99% cpu 12.365 total
conditional int:   11.35s user 2.59s system 99% cpu 13.957 total
conditional float: 11.51s user 2.67s system 99% cpu 14.233 total

@TheOneric
Copy link
Member

For this test, I recompiled only ass_bitmap.c; the rest of libass used the default -O2. For ass_bitmap.c, all tests included -march=native.

fyi, gcc has a optimize function attribute to apply a different optimisation level and supposedly also many of the -f… option only to a specific function. Perhaps it’s worth using this to avoid tainting the result by changes in other bitmap functions?
For affecting all following functions there’s also pragma GCC optimize("O0") for gcc and #pragma clang optimize off for clang (see fuzz/fuzz.c). I don't know whether clang also supports optimize as a function attribute.

@astiob
Copy link
Member Author

astiob commented Feb 17, 2023

I suppose you could hand-write vectorized assembly that also performs this check against the whole vector: if it’s all zeros or all 255s, no division is necessary. That might be faster still.

FWIW I’ve now tried and measured this. With a simple 8-pixel handling loop (using AVX2 intrinsics) and with the added conditional skip on “all eight bytes = 0 or all eight bytes = 255”, the time has gone down to 10.4 s (vs 11.3–11.7 s for the unconditional vectorized code). And I won’t be surprised if it can be written better and faster still by someone who is better at SIMD assembly.

@MrSmile
Copy link
Member

MrSmile commented Feb 17, 2023

the time has gone down to 10.4 s

It's mostly due to synthetic nature of sample. I suspect that in real world cases with letters it would be actually worse (for small letters due to never hitting uniform 0/255 case and for large letters due to branch mispredictions).

Actually, even scalar 0/255 variant can be slower on real world cases, better to test it first.

@TheOneric
Copy link
Member

TheOneric commented Feb 18, 2023

Summarising 32bit ARM tests, mostly already discussed on IRC:
(Skip to the conclusion if it’s too much text)

On all 32bit ARM except armv7ve (extended version of armv7-a), there exists no native instruction for integer division; instead a library routine gets called. Any ARM CPU with a hardware FPU provides an instruction for floating point division.

Since we already concluded a FPU is required to use libass at reasonable performance anyway, we probably don't need to be concerned about performance on software-float CPUs.
At this point it might seem like for 32bit ARM, fp would be better since surely a native fp div instruction should be faster than a library routine implementing integer division... However it turns out there is actually at least one common'ish CPUs(+OS) for which this does not hold:

For all tests:

  • Compiler options applied to the whole build
  • for libass tests: times are in milliseconds and averages of several runs

32bit tests

  • Alpine Linux 3.17.2 "armv7" userland (armv7-a with hard-float ABI)
  • gcc 12.2.1_git20220924-r4
  • clang 15.0.7
  • profile settings: start=0, end=3, fps=10
  • per config identical profile binaries were used across all machines

Config options:

armv7-a+fp+neon_O3_clang        ./configure --enable-profile CC="clang -O3 -march=armv7-a+fp -mfpu=neon" CFLAGS="-O3 -g" --host="armv7-unknown-linux-musl"
armv7-a+fp+neon_O3_gcc          ./configure --enable-profile CC="gcc -O3 -march=armv7-a+fp+simd -mfpu=neon" CFLAGS="-O3 -g" --host="armv7-unknown-linux-musl"
armv7ve+fp+neon_O3_clang        ./configure --enable-profile CC="clang -O3 -march=armv7ve+fp -mfpu=neon" CFLAGS="-O3 -g" --host="armv7-unknown-linux-musl"
armv7ve+fp+neon_O3_gcc          ./configure --enable-profile CC="gcc -O3 -march=armv7ve+fp+simd -mfpu=neon" CFLAGS="-O3 -g" --host="armv7-unknown-linux-musl"

clang doesn't accept +simd; gcc docs state this enables "Advanced SIMD (Neon)"; -mfpu=neon should still ensure it is part of the arch baseline.

.

Average Timings on the Cortex-A73 have a run-to-run variance of <30ms. For the A53 it’s a bit more but not enough to change the relative order between int and float.

Cortex-A53 running a 32bit (RaspberryPi 3 B+) @ 1.4GHz:
libass:

armv7-a+fp+neon_O3_clang_float          9173
armv7-a+fp+neon_O3_clang_int-noBranch   6861
armv7-a+fp+neon_O3_gcc_float            9416
armv7-a+fp+neon_O3_gcc_int-noBranch     7148

armv7ve+fp+neon_O3_clang_float          9096
armv7ve+fp+neon_O3_clang_int-noBranch   4924
armv7ve+fp+neon_O3_gcc_float            9413
armv7ve+fp+neon_O3_gcc_int-noBranch     5394

The results have been omited, but even int_withBranching managed to beat float here — even without a native int div instruction!

lemire.me:

armv7ve+fp
    int32_constant  0.007 clock units
    int64           0.295
    float64         0.048
    int32           0.017 (!!)
    int32_viaf64    0.055
    float32         0.045
armv7-a+fp
    int32_constant  0.009 clock units
    int64           0.301
    float64         0.047
    int32           0.096 (!!)
    int32_viaf64    0.055
    float32         0.045

As in previous benchmarks, the results from the simple testcase don't replicate the ordering of actual ass_fix_outline implementations. Just this time it goes in the other direction for armv7-a.

Cortex-A73 running 32bit @ 1.8GHz

armv7-a+fp+neon_O3_clang_float          2360
armv7-a+fp+neon_O3_clang_int-noBranch   3125
armv7-a+fp+neon_O3_gcc_float            2594
armv7-a+fp+neon_O3_gcc_int-noBranch     3421

armv7ve+fp+neon_O3_clang_float          2388
armv7ve+fp+neon_O3_clang_int-noBranch   2154
armv7ve+fp+neon_O3_gcc_float            2597
armv7ve+fp+neon_O3_gcc_int-noBranch     2512

Notes:

  • We afterwards noticed both compilers never autovectorised the code. Looking over the docs again I discovered:

    If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.

  • Tests on Godbolt still showed no usage of q* or d* registers for the float division even with an additional -funsafe-math-optimizations parameter.

  • Looking at the disassembled __udivsi3 routine used in armv7-a int implementations, it doesn't seem like it detects availablitiy of and just uses the native instruction.

  • Yes, the binaries do use hardware floating point instructions.

64bit test

For comparison, the same Cortex-A73 running an aarch64 build:

  • -O3 -g
  • clang 15.0.7
  • bionic libc
  • profile settings: start=0, end=10, fps=20

Cortex-A73 aarch64 @ 1.8GHz
libass:

float:          13906
int_noBranch:   14945
int_withBranch: 19791

I didn't record disassembly for those binaries, but tests on Godbolt suggests no autovectorisation of the float div occured. Perhaps the better float performance is due to fundamental differences between 32bit and 64bit ARM.

lemire.me:

    int32_constant  0.001 clock units
    int64           0.004
    float64         0.010
    int32           0.004
    int32_viaf64    0.010
    float32         0.004

Here too, the simple div-only-function benchmark from lemire does not correlate to the performance of the actual int/float-based ass_fix_outline.

Conclusion?

Cortex-A53’s FPU performs surprisingly poor, falling behind even software integer division. A more recent Cortex-A73 fares better in 32bit mode, beating non-native int and almost matching native int. In 64bit mode float again outperforms (native) int on the A73 (no 64bit tests easily possible atm for the A53).
Both tested CPUs support aarch64.

Compilers apparently won't use NEON for autovectorisation by default due to incompatibilities with IEEE fp.

Ordering in simple float/int div benchmarks doesn't map to the same ordering in actual float/int-based ass_fix_outline.

Cortex-A53 constitutes the first instance we've seen where float performs worse than int on real-world hardware with hardware float support. Yet, this will in the end probably depend on the particular model even for 32bit-only CPUs.

As it can go either way this probably doesn't help in deciding on whether to use integers or float for the C implementation.

@TheOneric
Copy link
Member

Now also results for a genuine 32-bit-only CPU (Cortex-A9, ARMv7-a+fp+neon+sec+mp, no fp16), though I doubt this could handle video playback and non-trivial subs at 720p. Average timings for libass tests vary by about 50ms between sets of runs. Within a run set, sometimes a few runs took 100-250ms longer, presumably due to some service process(es) competing for CPU.

perf settings: start=0, end=3, fps=3

armv7-a+fp+neon_O3_clang_int-noBranch  2506.00
armv7-a+fp+neon_O3_gcc_int-noBranch    2733.33
armv7-a+fp+neon_O3_clang_float         3221.67
armv7-a+fp+neon_O3_gcc_float           3243.33

    short add: 2.729927
    short sub: 2.702403
    short mul: 3.256053
    short div: 6.704432
     long add: 1.623491
     long sub: 1.617539
     long mul: 2.691853
     long div: 5.008054
long long add: 2.876212
long long sub: 3.234222
long long mul: 5.526823
long long div: 8.787050
    float add: 3.226642
    float sub: 3.230616
    float mul: 3.672505
    float div: 5.923851
   double add: 3.224513
   double sub: 3.228489
   double mul: 3.766996
   double div: 8.625440

@TheOneric
Copy link
Member

Not directly related to the patch itself, but mentioning here since all discussion about future default compiler flags also happened in this PR.

With checkasm --bench I just encountered an instance in libass where -O3 signififcantly regresses performance over -O2 (+ auto-vec) with gcc and slightly with clang (but clang didn't perform as good to begin with):

Rough timing averages ± about one second

gcc 10
======
-O2 -g (default)        75s (does *not* include any auto-vectorisation)
-O3 -g -ffast-math      47s (with and without LTO)
-O3 -g                  48s
-O2 -ftree-vectorize    26s (default cost-model was "cheap"

gcc 12
======
-O2 -g (default)        30s (includes auto-vectorisation with cost-model "very-cheap")
-O2 -g -ftree-vectorize -fvect-cost-modely=cheap
                        27s
-O2 -g -ftree-vectorize -fvect-cost-modely=dynamic
                        25s
-O2 -g -ftree-vectorize -fvect-cost-modely=dynamic -march=native -mtune=native
                        23s
-O3 -g                  51s
-O3 -g -march=native -mtune=native
                        49s

clang-15
========
-O2 -g (default)        34s
-O2 -g + LTO            32s
-O3 -g                  36s
-O3 -g -march=native -mtune=native
                        37s

To be fair, I haven't seen -O3 regresses performance in actual profile runs before — but those all used assembly, so the perf-critical functions tested in checkasm weren't influenced by the compiler to begin with.

Full logs with timings of individual functions for gcc-12 -O(2|3) -g -fvect-cost-model=dynamic and clang-15 -O(2|3) -g in case anyone’s interested: 483_checkasm-bench-logs.tar.gz

@rcombs
Copy link
Member

rcombs commented Mar 19, 2023

Ultimately, this just means this function is an even stronger candidate for explicit vectorization in assembly.

@astiob
Copy link
Member Author

astiob commented Apr 7, 2024

It would be "effective blurriness". I'm going to do some math to find a good approximation for it depending on blur radius, but we can use only 0 and 255 initially.

@MrSmile Is this still something you’re willing to look into—is it worth implementing this into the code (C and assembly) now? I’m guessing it’s an additional minor slowdown, so it would be a bit of a waste to add it and not use it; but on the other hand, modifying the several implementations across C & assembly later also wouldn’t be too fun.

@MrSmile
Copy link
Member

MrSmile commented Apr 8, 2024

I think for now it would be better to use blur radius (well, we have two, so some average value like (blur_r2x + blur_r2y) / 2) directly instead of boolean blurred. At first we can trigger one of the two implementations based on some separation value (I'll calculate the exact value later). In the future we can switch to more continuous approach.

For 1D case the exact solution would be something like this.
Helper function:
$$F(x)=\frac x2 + \frac x2\mathrm{erf}\left(\frac x{\sqrt{2}}\right)+\frac1{\sqrt{2\pi}}e^{-\frac{x^2}2}$$
Solve for $x$:
$$F\left(x+\frac1r\right)-F(x)=\frac or$$
Calculate output value:
$$F\left(x+\frac{1-g}r\right)-F(x)=\frac {o_1}r$$
Here $o$ and $g$ are input values, $o_1$ — output, $r$ — blur radius.
I'll try to devise some crude approximation for this later, but I think PR can be merged without that.

UPD: switching between blurred = false and blurred = true should occur at $\sigma\approx0.34$ (standard deviation of gaussian blur) and maximal approximation error (in case model assumptions holds) would be ≤ 32 pixel value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants