-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std.zig.tokenizer: simplification and spec conformance #20885
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
andrewrk
added
breaking
Implementing this issue could cause existing code to no longer compile or have different behavior.
standard library
This issue involves writing Zig code for the standard library.
labels
Jul 31, 2024
riscv backend failures unblocked by #20888 |
andrewrk
changed the title
std.zig.tokenizer: simplify
std.zig.tokenizer: simplification and spec conformance
Jul 31, 2024
andrewrk
force-pushed
the
simplify-tokenizer
branch
from
July 31, 2024 21:44
e31b4a8
to
b61d9db
Compare
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms: * Removes the "invalid byte" compile error note. * Dramatically simplifies tokenizer recovery by making recovery always occur at newlines, and never otherwise. * Removes UTF-8 validation. * Moves some character validation logic to `std.zig.parseCharLiteral`. Removing UTF-8 validation is a regression of #663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code. Other changes included in this commit: * Deprecate `std.unicode.utf8Decode` and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse. * Make `utf8Decode2` and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system. After this commit, the crash found by fuzzing, which was "\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea" no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates most crashes of this nature.
these are illegal according to the spec
andrewrk
force-pushed
the
simplify-tokenizer
branch
from
July 31, 2024 23:57
b61d9db
to
c2b8afc
Compare
andrewrk
added
the
release notes
This PR should be mentioned in the release notes.
label
Aug 1, 2024
Techatrix
added a commit
to zigtools/zls
that referenced
this pull request
Aug 2, 2024
Techatrix
added a commit
to zigtools/zls
that referenced
this pull request
Aug 2, 2024
sin-ack
added a commit
to sin-ack/zig-gobject
that referenced
this pull request
Sep 14, 2024
sin-ack
added a commit
to sin-ack/zig-gobject
that referenced
this pull request
Sep 14, 2024
sin-ack
added a commit
to sin-ack/zig-gobject
that referenced
this pull request
Sep 15, 2024
sin-ack
added a commit
to sin-ack/zig-gobject
that referenced
this pull request
Sep 15, 2024
sin-ack
added a commit
to sin-ack/zig-gobject
that referenced
this pull request
Sep 15, 2024
sin-ack
added a commit
to sin-ack/zig-gobject
that referenced
this pull request
Sep 15, 2024
ianprime0509
pushed a commit
to ianprime0509/zig-gobject
that referenced
this pull request
Sep 18, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
breaking
Implementing this issue could cause existing code to no longer compile or have different behavior.
release notes
This PR should be mentioned in the release notes.
standard library
This issue involves writing Zig code for the standard library.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I pointed a fuzzer at the tokenizer and it crashed immediately. Upon inspection, I was dissatisfied with the implementation. This commit removes several mechanisms:
std.zig.parseCharLiteral
.Removing UTF-8 validation is a regression of #663, however, the existing implementation was already buggy. When adding this functionality back, it must be fuzz-tested while checking the property that it matches an independent Unicode validation implementation on the same file. While we're at it, fuzzing should check the other properties of that proposal, such as no ASCII control characters existing inside the source code,
\r
always followed by\n
, etc.Other changes included in this commit:
std.unicode.utf8Decode
and its WTF-8 counterpart. This function has an awkward API that is too easy to misuse.utf8Decode2
and friends use arrays as parameters, eliminating a runtime assertion in favor of using the type system.After this commit, the crash found by fuzzing, which was
"\x07\xd5\x80\xc3=o\xda|a\xfc{\x9a\xec\x91\xdf\x0f\\\x1a^\xbe;\x8c\xbf\xee\xea"
no longer causes a crash. However, I did not feel the need to add this test case because the simplified logic eradicates crashes of this nature.Spec Conformance
I also noticed that the tokenizer did not conform to ziglang/zig-spec#38 so I fixed it.