Fix fuzzer-found issues (Part 1) #126

LaurenzV · 2024-07-09T18:45:27Z

Okay, so what I've been doing is the following: The basic idea is to try to fuzz against as many fonts as possible and check whether the output from harfbuzz matches the output from rustybuzz. For that, we need two things:

A corpus of fonts
A collection of inputs

Regarding 1., the most obvious choice was to download the Google Fonts collection, since it's freely available. The disadvantage is that we are not including CFF fonts this way, but it's still a very solid starting point. So I basically downloaded the fonts and excluded and fonts that contain keywords such as "bold", "italic" etc. so as to not test the same variant of a font multiple times. Overall, this still lead to 1000+ (more or less) unique fonts to choose from.

Regarding 2., in the beginning my idea was to basically generate random sequences of unicode inputs based on the cmap table in the font. However, after a while I realized that this probably wouldn't be very efficient, because the odds of a random sequence triggering specific lookups in GPOS/GSUB is rather low. This made me realize that we should probably base your inputs on real words, since this is what will be shaped in practice after all. Luckily for us, I chanced upon a test corpus by harfbuzz which contains extensive word lists in many different languages scripts scrapped from Wikipedia. A perfect input for what we want to achieve.

I removed some Latin-based scripts because they are more or less the same, and I also truncated some Latin-based input files because they were just so long (the English one had 22 million lines). In the end, I ended up using the following languages with the following corresponding number of lines (=words):

[('or.txt', 42329),
 ('lo.txt', 53644),
 ('pa.txt', 60747),
 ('as.txt', 67460),
 ('dv.txt', 98777),
 ('am.txt', 118901),
 ('ur.txt', 189985),
 ('bo.txt', 208469),
 ('si.txt', 271847),
 ('km.txt', 299124),
 ('mr.txt', 344638),
 ('bn.txt', 354188),
 ('hy.txt', 360014),
 ('gu.txt', 366457),
 ('en.txt', 500000),
 ('ru.txt', 500000),
 ('sr.txt', 500000),
 ('tr.txt', 500000),
 ('vi.txt', 500000),
 ('uk.txt', 500000),
 ('ko.txt', 500000),
 ('is.txt', 500000),
 ('ka.txt', 644301),
 ('hi.txt', 707394),
 ('el.txt', 897133),
 ('kn.txt', 951913),
 ('te.txt', 970573),
 ('ml.txt', 1048334),
 ('ta.txt', 1091754),
 ('bg.txt', 1117542),
 ('my.txt', 1123883),
 ('he.txt', 1332080),
 ('fa.txt', 1360750),
 ('ar.txt', 1850081),
 ('th.txt', 4000000)]

The ones with 500.000 are the ones I truncated because they otherwise would just be too long. You might ask why I then kept some other languages with more than a million, and the simple answer is that

Those languages have less fonts in the corpus, so to counter balance this we use more inputs for each font.
I felt like the odds of finding errors for Latin-based scripts is not much higher with much larger inputs, since the script is pretty simple... Other languages are way more interesting and more error-prone.
See below.

After I had this, the most challenging part was to figure out which font to use for which languages. Trying all combinations is not feasible time-wise and also a waste (e.g. if we tried to shape some Arabic text with a font that only covers Latin characters, we would only get .notdefs anyway).

My basic approach for this was: For each text file, I get a sample of 100 lines. In these 100 lines, I collected all of the characters that appeared. For each font, I check whether it's cmap table covers more than 80% of the characters, and if so I use this combination as a test case. Overall that seemed to work pretty nice, but a problem was that nearly all fonts contained Latin characters, so any Latin-based language would get a lot of fonts, so that's another reason I excluded many Latin-based scripts, and I also ensured that fonts in general are only matched with one language, excluding a number of languages that didn't have many fonts assigned to them. By doing this, I still had a lot of "garbage assignments" (e.g. NotoSansTaiTham being used for English), but at least I could ensure that every font that does support one of the smaller languages is also used for it.

You can find the resulting pairings here: https://gist.github.com/LaurenzV/1d528deabfe4e7d00d248e2f7281482a

And now the last step is to just go through all those combinations and compare the outputs! So far, I've already been able to find around 8 bugs in rustybuzz and (potentially?) 2 bugs in harfbuzz, which is not too impressive but not too bad either. Some of those bugs were really niche though (for example, one was caused by one wrong letter in the indic table!), so I do feel like this is a pretty effective method for testing the crate and should give us much more confidence about its correctness. And I'm still far from being done yet, although the remaining languages are mostly "simple" languages I think where I don't expect too many bugs to be present, but we will see. But I will probably split it up over multiple PRs, depending on how many bugs I can still find.

For each bug I find, I'm also adding a new test case. I try my best to always subset them, but unfortunately so far subsetting the font nearly always "destroyed" the bug, so I had to include the full one. But they are pretty small anyway, so I hope that's okay.

Future work (sorted in priority, although no promises when or even if I will work on it) includes:

Finish going through all combinations above.
Try fuzzing against Windows system fonts.
Try fuzzing against MacOS fonts.
Try to vary other aspects in the fuzzer, such as variation coordinates for variable fonts or text direction.

LaurenzV · 2024-07-10T08:15:50Z

@RazrFalcon Do you know if there is a particular reason we don't use the unicode_normalization crate for the composing/decomposing characters? Or has just no one bothered to switch to it?

LaurenzV · 2024-07-10T08:28:03Z

Looks like you consciously removed it: f0e5a766

However, there is a new crate from the icu4x folks, anything speaking against using it directly? The reason I'm asking is that there is something wrong with our current table. 😅 I presume this could be fixed by improving the generation, but I don't see why we should do that if someone else already did it. It does depend on tinyvec, but no dependencies otherwise.

RazrFalcon · 2024-07-10T08:33:13Z

As you can guess, I do not remember. It was a long time ago. But I do remember that we had some issues with external crates. Either they weren't low-level enough or were producing different output to HB.

If you can replace embedded Unicode tables - I'm all for it.

In general, a rule of thumb when it comes to RB: if something is strange then it's because we had to match HB output.

RazrFalcon · 2024-07-10T08:34:48Z

Also remember that HB/RB has its own unicode normalization algorithm. We cannot use a third-party crate for that.

LaurenzV · 2024-07-10T08:36:50Z

Also remember that HB/RB has its own unicode normalization algorithm. We cannot use a third-party crate for that.

Yep, that I know. But perhaps I know the reason why now, it seems like harfbuzz always decomposes a character into 2 units, while the unicode_normalization crate always decomposes as low as possible which could be more than 2... So I'll have to see if I can figure it out.

LaurenzV · 2024-07-10T10:15:49Z

@behdad Is it expected that HB_NO_OT_RULESETS_FAST_PATH changes the shaping result? With the following font when running

hb-shape NotoSerifGujarati-VariableFont_wght.ttf --no-glyph-names --unicodes U+0ABE,U+0AA8,U+0ACD,U+200D,U+0AA4,U+0ABF
I get
[414=0+596|60=0+251|61=1+251|186=1+293|3=1+0|38=1+543]

while if I enable HB_NO_OT_RULESETS_FAST_PATH I get
[414=0+596|60=0+251|102=1+251|186=1+293|3=1+0|38=1+543]

RazrFalcon · 2024-07-10T15:01:44Z

it seems like harfbuzz always decomposes a character into 2 units

Yes, this rings a bell.

LaurenzV · 2024-07-12T15:05:21Z

@RazrFalcon See the description at the top for a more in-depth explanation, I think this first part PR should be ready now (also if possible merge unsquashed, as I tried my best to make each fix a separate commit).

Blocked by harfbuzz/ttf-parser#164.

Fixes for harfbuzz/rustybuzz#126

RazrFalcon · 2024-07-13T07:17:59Z

Once again I cannot thank you enough for your work.

I completely agree with your methodology. I've tried fuzzing RB long time ago via AFL fuzz, but it was mostly useless. Simply throwing random data at a shaper doesn't work that well. And guided fuzzing is beyond my level.

If only we had something like resvg-test-suite, but for shaping. HB test suite is close, but as you saw barely scratches the surface.

Also, some of the bugs you have fixed a very strange. No idea how I was able to mess up feature flags like F_MANUAL_ZWJ. This was mostly a copy-pasted code with Rust flavor. Either it was changed later or I've messed up badly.

And no, even a single fixed bug is more than enough. 8 is beyond good.
After all, the goal of RB is to be 1:1 with HB.

The disadvantage is that we are not including CFF fonts this way

Google fonts do not use CFF? That's news to me.
On the other hand glyf/CFF should not affect shaping in 99% of the cases.

Overall, this still lead to 1000+ (more or less) unique fonts to choose from.

macOS alone has like 800 fonts pre-installed and most of them are insane and worth testing against. You will not be able to include them into tests, aka subset, but it's still worth testing.

LaurenzV · 2024-07-13T07:26:20Z

You will not be able to include them into tests, aka subset, but it's still worth testing.

We can't include them in the repo, but since we have a MacOS CI now we can test them there. :) But one step at a time. 😄

LaurenzV added 2 commits July 9, 2024 20:52

Fix incorrect assignment of glyph props

c201d0d

Update unicode_norm table

7997bbe

LaurenzV force-pushed the fuzzer-fixes branch from 0897d80 to cd799f7 Compare July 9, 2024 18:52

Remove exclusions to unicode norm table

3f602c6

LaurenzV force-pushed the fuzzer-fixes branch from cd799f7 to 3f602c6 Compare July 10, 2024 09:26

LaurenzV added 7 commits July 10, 2024 19:53

Fix bug in anchor matrix handling

73af5c9

Regenerate machines

a38d39f

Fix wrong category override in gen-indic-table.py

fda9bcc

Fix wrong assignment of flags in myanmar features

9d217f9

Fix another wrong feature assignment

8752ca7

Fix discrepancy in glyph bounding box calculation

3cae96a

Add missing font

c717d5b

qarmin mentioned this pull request Jul 11, 2024

Problematic files found by fuzzer #129

Open

LaurenzV added 3 commits July 12, 2024 14:56

Map glyphs in out info as well

5c1d655

Fix bug in buffer sync

3981f30

Add licenses

c5ac0e4

LaurenzV marked this pull request as ready for review July 12, 2024 15:04

LaurenzV changed the title ~~Fix fuzzer-found issues~~ Fix fuzzer-found issues (Part 1) Jul 12, 2024

RazrFalcon added a commit to harfbuzz/ttf-parser that referenced this pull request Jul 13, 2024

Merge pull request #164 from LaurenzV/fixes

85c1ff2

Fixes for harfbuzz/rustybuzz#126

Update ttf-parser dependency

dc61469

RazrFalcon merged commit eb9638d into harfbuzz:master Jul 15, 2024
2 checks passed

LaurenzV deleted the fuzzer-fixes branch July 15, 2024 08:02

LaurenzV mentioned this pull request Jul 15, 2024

Could rustybuzz be ported to run on fontations? googlefonts/fontations#956

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fuzzer-found issues (Part 1) #126

Fix fuzzer-found issues (Part 1) #126

LaurenzV commented Jul 9, 2024 •

edited

Loading

LaurenzV commented Jul 10, 2024

LaurenzV commented Jul 10, 2024

RazrFalcon commented Jul 10, 2024

RazrFalcon commented Jul 10, 2024

LaurenzV commented Jul 10, 2024

LaurenzV commented Jul 10, 2024

RazrFalcon commented Jul 10, 2024

LaurenzV commented Jul 12, 2024 •

edited

Loading

RazrFalcon commented Jul 13, 2024

LaurenzV commented Jul 13, 2024

Fix fuzzer-found issues (Part 1) #126

Fix fuzzer-found issues (Part 1) #126

Conversation

LaurenzV commented Jul 9, 2024 • edited Loading

LaurenzV commented Jul 10, 2024

LaurenzV commented Jul 10, 2024

RazrFalcon commented Jul 10, 2024

RazrFalcon commented Jul 10, 2024

LaurenzV commented Jul 10, 2024

LaurenzV commented Jul 10, 2024

RazrFalcon commented Jul 10, 2024

LaurenzV commented Jul 12, 2024 • edited Loading

RazrFalcon commented Jul 13, 2024

LaurenzV commented Jul 13, 2024

LaurenzV commented Jul 9, 2024 •

edited

Loading

LaurenzV commented Jul 12, 2024 •

edited

Loading