Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std.zig.tokenizer: simplification and spec conformance #20885

Merged
merged 3 commits into from
Aug 1, 2024

Conversation

andrewrk
Copy link
Member

@andrewrk andrewrk commented Jul 31, 2024

I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms:

  • Removes the "invalid byte" compile error note.
  • Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise.
  • Removes UTF-8 validation.
  • Moves some character validation logic to std.zig.parseCharLiteral.

Removing UTF-8 validation is a regression of #663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code, \r always followed by \n, etc.

Other changes included in this commit:

  • Deprecate std.unicode.utf8Decode and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse.
  • Make utf8Decode2 and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system.

After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates crashes of this nature.

Benchmark 1 (100 runs): before/zig ast-check /home/andy/dev/zig/src/Sema.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          50.0ms ± 2.04ms    48.7ms … 57.4ms         14 (14%)        0%
  peak_rss           60.0MB ±  147KB    59.4MB … 60.2MB          3 ( 3%)        0%
  cpu_cycles          232M  ±  745K      230M  …  234M           3 ( 3%)        0%
  instructions        522M  ± 24.3       522M  …  522M           1 ( 1%)        0%
  cache_references   6.55M  ±  120K     6.39M  … 7.45M           2 ( 2%)        0%
  cache_misses        205K  ± 3.47K      198K  …  215K           1 ( 1%)        0%
  branch_misses      2.86M  ± 10.3K     2.80M  … 2.87M           9 ( 9%)        0%
Benchmark 2 (104 runs): after/zig ast-check /home/andy/dev/zig/src/Sema.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          48.3ms ±  250us    48.1ms … 50.0ms         10 (10%)        ⚡-  3.3% ±  0.8%
  peak_rss           62.4MB ±  142KB    62.1MB … 62.6MB          0 ( 0%)        💩+  4.1% ±  0.1%
  cpu_cycles          227M  ±  637K      226M  …  230M           7 ( 7%)        ⚡-  1.9% ±  0.1%
  instructions        501M  ± 44.8       501M  …  501M           7 ( 7%)        ⚡-  4.0% ±  0.0%
  cache_references   6.65M  ±  141K     6.45M  … 7.67M           4 ( 4%)          +  1.5% ±  0.5%
  cache_misses        208K  ± 3.79K      201K  …  226K           3 ( 3%)          +  1.3% ±  0.5%
  branch_misses      2.84M  ± 8.62K     2.81M  … 2.86M           1 ( 1%)          -  0.4% ±  0.1%

Spec Conformance

I also noticed that the tokenizer did not conform to ziglang/zig-spec#38 so I fixed it.

@andrewrk andrewrk added breaking Implementing this issue could cause existing code to no longer compile or have different behavior. standard library This issue involves writing Zig code for the standard library. labels Jul 31, 2024
@Rexicon226
Copy link
Contributor

riscv backend failures unblocked by #20888

@andrewrk andrewrk changed the title std.zig.tokenizer: simplify std.zig.tokenizer: simplification and spec conformance Jul 31, 2024
@andrewrk andrewrk force-pushed the simplify-tokenizer branch from e31b4a8 to b61d9db Compare July 31, 2024 21:44
andrewrk added 3 commits July 31, 2024 16:57
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon
inspection, I was dissatisfied with the implementation. This commit
removes several mechanisms:
* Removes the "invalid byte" compile error note.
* Dramatically simplifies tokenizer recovery by making recovery always
  occur at newlines, and never otherwise.
* Removes UTF-8 validation.
* Moves some character validation logic to `std.zig.parseCharLiteral`.

Removing UTF-8 validation is a regression of #663, however, the existing
implementation was already buggy. When adding this functionality back,
it must be fuzz-tested while checking the property that it matches an
independent Unicode validation implementation on the same file. While
we're at it, fuzzing should check the other properties of that proposal,
such as no ASCII control characters existing inside the source code.

Other changes included in this commit:

* Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This
  function has an awkward API that is too easy to misuse.
* Make `utf8Decode2` and friends use arrays as parameters, eliminating a
  runtime assertion in favor of using the type system.

After this commit, the crash found by fuzzing, which was
"\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea"
no longer causes a crash. However, I did not feel the need to add this
test case because the simplified logic eradicates most crashes of this
nature.
these are illegal according to the spec
@andrewrk andrewrk force-pushed the simplify-tokenizer branch from b61d9db to c2b8afc Compare July 31, 2024 23:57
@andrewrk andrewrk added the release notes This PR should be mentioned in the release notes. label Aug 1, 2024
@andrewrk andrewrk enabled auto-merge August 1, 2024 02:22
@andrewrk andrewrk merged commit eb1a199 into master Aug 1, 2024
10 checks passed
@andrewrk andrewrk deleted the simplify-tokenizer branch August 1, 2024 02:52
Techatrix added a commit to zigtools/zls that referenced this pull request Aug 2, 2024
Techatrix added a commit to zigtools/zls that referenced this pull request Aug 2, 2024
sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 14, 2024
sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 14, 2024
sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 15, 2024
sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 15, 2024
sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 15, 2024
sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 15, 2024
ianprime0509 pushed a commit to ianprime0509/zig-gobject that referenced this pull request Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Implementing this issue could cause existing code to no longer compile or have different behavior. release notes This PR should be mentioned in the release notes. standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants