std.zig.tokenizer: simplification and spec conformance #20885

andrewrk · 2024-07-31T19:06:25Z

I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms:

Removes the "invalid byte" compile error note.
Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise.
Removes UTF-8 validation.
Moves some character validation logic to std.zig.parseCharLiteral.

Removing UTF-8 validation is a regression of #663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code, \r always followed by \n, etc.

Other changes included in this commit:

Deprecate std.unicode.utf8Decode and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse.
Make utf8Decode2 and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system.

After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates crashes of this nature.

Benchmark 1 (100 runs): before/zig ast-check /home/andy/dev/zig/src/Sema.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          50.0ms ± 2.04ms    48.7ms … 57.4ms         14 (14%)        0%
  peak_rss           60.0MB ±  147KB    59.4MB … 60.2MB          3 ( 3%)        0%
  cpu_cycles          232M  ±  745K      230M  …  234M           3 ( 3%)        0%
  instructions        522M  ± 24.3       522M  …  522M           1 ( 1%)        0%
  cache_references   6.55M  ±  120K     6.39M  … 7.45M           2 ( 2%)        0%
  cache_misses        205K  ± 3.47K      198K  …  215K           1 ( 1%)        0%
  branch_misses      2.86M  ± 10.3K     2.80M  … 2.87M           9 ( 9%)        0%
Benchmark 2 (104 runs): after/zig ast-check /home/andy/dev/zig/src/Sema.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          48.3ms ±  250us    48.1ms … 50.0ms         10 (10%)        ⚡-  3.3% ±  0.8%
  peak_rss           62.4MB ±  142KB    62.1MB … 62.6MB          0 ( 0%)        💩+  4.1% ±  0.1%
  cpu_cycles          227M  ±  637K      226M  …  230M           7 ( 7%)        ⚡-  1.9% ±  0.1%
  instructions        501M  ± 44.8       501M  …  501M           7 ( 7%)        ⚡-  4.0% ±  0.0%
  cache_references   6.65M  ±  141K     6.45M  … 7.67M           4 ( 4%)          +  1.5% ±  0.5%
  cache_misses        208K  ± 3.79K      201K  …  226K           3 ( 3%)          +  1.3% ±  0.5%
  branch_misses      2.84M  ± 8.62K     2.81M  … 2.86M           1 ( 1%)          -  0.4% ±  0.1%

Spec Conformance

I also noticed that the tokenizer did not conform to ziglang/zig-spec#38 so I fixed it.

Rexicon226 · 2024-07-31T21:02:15Z

riscv backend failures unblocked by #20888

I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms: * Removes the "invalid byte" compile error note. * Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise. * Removes UTF-8 validation. * Moves some character validation logic to `std.zig.parseCharLiteral`. Removing UTF-8 validation is a regression of #663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code. Other changes included in this commit: * Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse. * Make `utf8Decode2` and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system. After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates most crashes of this nature.

these are illegal according to the spec

See ziglang/zig#20885

Ref: ziglang/zig#20885

andrewrk added breaking Implementing this issue could cause existing code to no longer compile or have different behavior. standard library This issue involves writing Zig code for the standard library. labels Jul 31, 2024

andrewrk changed the title ~~std.zig.tokenizer: simplify~~ std.zig.tokenizer: simplification and spec conformance Jul 31, 2024

andrewrk force-pushed the simplify-tokenizer branch from e31b4a8 to b61d9db Compare July 31, 2024 21:44

andrewrk added 3 commits July 31, 2024 16:57

remove hard tabs from source code

a702949

these are illegal according to the spec

tokenizer: tabs and carriage returns spec conformance

c2b8afc

andrewrk force-pushed the simplify-tokenizer branch from b61d9db to c2b8afc Compare July 31, 2024 23:57

andrewrk added the release notes This PR should be mentioned in the release notes. label Aug 1, 2024

andrewrk enabled auto-merge August 1, 2024 02:22

andrewrk merged commit eb1a199 into master Aug 1, 2024
10 checks passed

andrewrk deleted the simplify-tokenizer branch August 1, 2024 02:52

Techatrix mentioned this pull request Aug 1, 2024

unhelpful error message on invalid bytes inside multi line string literal #20900

Open

Raiden1411 mentioned this pull request Aug 1, 2024

build: update to latest master karlseguin/websocket.zig#40

Merged

Techatrix added a commit to zigtools/zls that referenced this pull request Aug 2, 2024

update tests for sourceIndexToTokenIndex and tokenToIndex

7c38ff9

See ziglang/zig#20885

Techatrix added a commit to zigtools/zls that referenced this pull request Aug 2, 2024

update tests for sourceIndexToTokenIndex and tokenToIndex

88bc2b1

See ziglang/zig#20885

karlseguin mentioned this pull request Aug 3, 2024

bizarre zig compilation error on commented out Go code in zig src file karlseguin/zul#5

Closed

ianprime0509 mentioned this pull request Aug 12, 2024

Tried compiling zig master from source using the lastest compiled zig version. Got a bunch of sub-compilation errors #21055

Closed

castholm mentioned this pull request Aug 18, 2024

Invalid token in commented code #21126

Closed

sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 14, 2024

zig_writer: Replace tabs with spaces when outputting

1e5985d

Ref: ziglang/zig#20885

sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 14, 2024

zig_writer: Replace tabs with spaces when outputting

e1f45f0

Ref: ziglang/zig#20885

sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 15, 2024

zig_writer: Replace tabs with spaces when outputting

7e882d7

Ref: ziglang/zig#20885

sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 15, 2024

zig_writer: Replace tabs with spaces when outputting

c0b42ff

Ref: ziglang/zig#20885

sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 15, 2024

zig_writer: Replace tabs with spaces when outputting

c382658

Ref: ziglang/zig#20885

sin-ack added a commit to sin-ack/zig-gobject that referenced this pull request Sep 15, 2024

zig_writer: Replace tabs with spaces when outputting

2c3a7b8

Ref: ziglang/zig#20885

ianprime0509 pushed a commit to ianprime0509/zig-gobject that referenced this pull request Sep 18, 2024

zig_writer: Replace tabs with spaces when outputting

808db93

Ref: ziglang/zig#20885

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std.zig.tokenizer: simplification and spec conformance #20885

std.zig.tokenizer: simplification and spec conformance #20885

andrewrk commented Jul 31, 2024 •

edited

Loading

Rexicon226 commented Jul 31, 2024

std.zig.tokenizer: simplification and spec conformance #20885

std.zig.tokenizer: simplification and spec conformance #20885

Conversation

andrewrk commented Jul 31, 2024 • edited Loading

Spec Conformance

Rexicon226 commented Jul 31, 2024

andrewrk commented Jul 31, 2024 •

edited

Loading