-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ascii #2171
Ascii #2171
Conversation
Digit, // '0'...'9' | ||
Lower, // 'a'...'z' | ||
Upper, // 'A'...'Z' | ||
Punct, // ASCII and !DEL and !AlNum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I recommend not making abbreviations like Punctuation/Control -> Punct/Cntrl
— it may make it unnecessarily more difficult for non-native english speakers to understand the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These abbreviations come from the C99 standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, my mistake 🎩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I still stand by my original suggestion because these are not constants which map to any C code values (from what I can tell). I feel like abbreviations like that are usually only good when you are exposing/wrapping around an already defined C value like EPERM
, whereas a newly defined value should probably be ErrorNotPermitted
. Anyway, not to get too sucked into a style discussion, I'm done :)
26281e5
to
1f97324
Compare
Why remove support for ignoring characters? e.g. base64 code often gets hard-word-wrapped (see e.g. PEM encoded certificates) |
It doesn't support comptime either due to limitations in the compiler. I will save that patch for later. |
I think it would be cleaner to have that functionality separate, but in a way that could be composed with any decoder. |
3571ab3
to
35dbe50
Compare
2c8cb30
to
9bdc69e
Compare
std/ascii.zig
Outdated
return inTable(c, tIndex.Blank); | ||
} | ||
|
||
pub fn isZig(c: u8) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this? Can you document this better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is documented elsewhere in the file, but I pushed some doc friendly documentation. I was using it in the self-hosted compiler to validate zig source code encoding. (plus other stuff to validate the utf-8/unicode)
|
||
const value = swtch[c]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimizations must come with tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which was in the commit description. I feel some of the problems come from github's interface. I am splitting things into different commits for a reason.
Also, the way you are lowering ranged switch statements inside zig means that llvm can never optimize these sort of things. (even if it doesn't well currently)
Tested against glibc.
benchmarks are here ziglang#2128 (comment)
isspace considers a few more white space characters that were not considered (and are not valid in zig code, so will have no effect).
unless I am missing something it appears that the self-hosted compiler was not compliant as it did not take upper case hex digits
/// see doc/langref.html.in online at https://ziglang.org/documentation/master/#Source-Encoding | ||
/// Does not validate UTF-8 or check for prohibited Unicode code-points, | ||
/// is why it is called isntZig() rather than isZig(). | ||
pub fn isntZig(c: u8) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind adding this? It is surprising that there would be std.ascii.isntZig
. What is the intended usage? Self-hosted tokenizer? What would a different language tokenizer be expected to do, since there wouldn't be, for example, std.ascii.java, std.ascii.perl, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-hosted tokenizer?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting it here saves 256 bytes. It isn't much, but it is something. If it is too ugly, then so be it. Zig is unlikely to ever tokenize java or perl, and no other language I know of has character requirements quite like zig. for C you need to support tri-graphs for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, by putting this stuff together, a future vectored streaming version can all use the same code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't see a reason to expose this in the stdlib. if it's for parsing zig it should be a private implementation detail in std.zig.parse.
There is too much unrelated stuff in this pull request, and it introduces this "is/isnt zig" API that doesn't seem to belong in this module. I'm starting to get a lot of pull requests to zig, and so I need them to become easier for me to review/edit/merge. This pull request is Too Hard for me to review/edit/merge, and so I'm going to close it. You are welcome to open a new pull request with the following criteria:
I will also reiterate my suggestion from IRC: I think you want to go fast and write some exploratory code. That's great. I think you should maintain a fork of zig with all your experiments. Periodically you could demo some cool stuff that your fork is capable of that upstream is not, and entice me, or others, to upstream some of your code. However it's not going to be possible for you to go as fast as you want to go, directly in upstream Zig. You're going to have to meet the criteria outlined above. |
This removes the IgnoreCharacter version of Base64.
It also adds a ascii.isZIg function to be used by stage2. (I was also working on a vectorized version)