Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ascii #2171

Closed
wants to merge 6 commits into from
Closed

Ascii #2171

wants to merge 6 commits into from

Conversation

shawnl
Copy link
Contributor

@shawnl shawnl commented Apr 2, 2019

This removes the IgnoreCharacter version of Base64.

It also adds a ascii.isZIg function to be used by stage2. (I was also working on a vectorized version)

Digit, // '0'...'9'
Lower, // 'a'...'z'
Upper, // 'A'...'Z'
Punct, // ASCII and !DEL and !AlNum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I recommend not making abbreviations like Punctuation/Control -> Punct/Cntrl — it may make it unnecessarily more difficult for non-native english speakers to understand the source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These abbreviations come from the C99 standard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my mistake 🎩

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I still stand by my original suggestion because these are not constants which map to any C code values (from what I can tell). I feel like abbreviations like that are usually only good when you are exposing/wrapping around an already defined C value like EPERM, whereas a newly defined value should probably be ErrorNotPermitted. Anyway, not to get too sucked into a style discussion, I'm done :)

@shawnl shawnl force-pushed the ascii branch 4 times, most recently from 26281e5 to 1f97324 Compare April 3, 2019 14:30
@daurnimator
Copy link
Contributor

Why remove support for ignoring characters? e.g. base64 code often gets hard-word-wrapped (see e.g. PEM encoded certificates)

@shawnl
Copy link
Contributor Author

shawnl commented Apr 3, 2019

It doesn't support comptime either due to limitations in the compiler. I will save that patch for later.

@andrewrk andrewrk self-requested a review April 3, 2019 16:47
@jayschwa
Copy link
Contributor

jayschwa commented Apr 6, 2019

Why remove support for ignoring characters?

I think it would be cleaner to have that functionality separate, but in a way that could be composed with any decoder.

@shawnl shawnl force-pushed the ascii branch 4 times, most recently from 3571ab3 to 35dbe50 Compare April 10, 2019 12:38
@shawnl shawnl changed the title Ascii and Base64 Ascii Apr 10, 2019
@shawnl shawnl force-pushed the ascii branch 4 times, most recently from 2c8cb30 to 9bdc69e Compare April 11, 2019 02:26
std/ascii.zig Outdated
return inTable(c, tIndex.Blank);
}

pub fn isZig(c: u8) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? Can you document this better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is documented elsewhere in the file, but I pushed some doc friendly documentation. I was using it in the self-hosted compiler to validate zig source code encoding. (plus other stuff to validate the utf-8/unicode)


const value = swtch[c];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimizations must come with tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2128 (comment)

Which was in the commit description. I feel some of the problems come from github's interface. I am splitting things into different commits for a reason.

Also, the way you are lowering ranged switch statements inside zig means that llvm can never optimize these sort of things. (even if it doesn't well currently)

shawnl added 5 commits April 16, 2019 19:37
isspace considers a few more white space characters that were not considered
(and are not valid in zig code, so will have no effect).
unless I am missing something it appears that the self-hosted compiler
was not compliant as it did not take upper case hex digits
/// see doc/langref.html.in online at https://ziglang.org/documentation/master/#Source-Encoding
/// Does not validate UTF-8 or check for prohibited Unicode code-points,
/// is why it is called isntZig() rather than isZig().
pub fn isntZig(c: u8) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind adding this? It is surprising that there would be std.ascii.isntZig. What is the intended usage? Self-hosted tokenizer? What would a different language tokenizer be expected to do, since there wouldn't be, for example, std.ascii.java, std.ascii.perl, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-hosted tokenizer?

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it here saves 256 bytes. It isn't much, but it is something. If it is too ugly, then so be it. Zig is unlikely to ever tokenize java or perl, and no other language I know of has character requirements quite like zig. for C you need to support tri-graphs for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, by putting this stuff together, a future vectored streaming version can all use the same code.

Copy link
Contributor

@emekoi emekoi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see a reason to expose this in the stdlib. if it's for parsing zig it should be a private implementation detail in std.zig.parse.

@andrewrk andrewrk added the work in progress This pull request is not ready for review yet. label Apr 25, 2019
@andrewrk
Copy link
Member

There is too much unrelated stuff in this pull request, and it introduces this "is/isnt zig" API that doesn't seem to belong in this module. I'm starting to get a lot of pull requests to zig, and so I need them to become easier for me to review/edit/merge. This pull request is Too Hard for me to review/edit/merge, and so I'm going to close it.

You are welcome to open a new pull request with the following criteria:

  • The PR description describes everything that the PR does. Nothing is in the code changes that isn't mentioned in the description.
  • If it changes behavior, it comes with tests, or otherwise explains why automated tests are impractical, and explains what testing was performed.
  • If it deals with performance, it comes with benchmarks & timing outputs in the description. A good example of that is here: Add Sha2 functions #687
  • Before you make the pull request, review your own code for mistakes. Try to save me, and others, some time here. Try to predict what I'm going to say, and fix it ahead of time, or address it in the description.

I will also reiterate my suggestion from IRC: I think you want to go fast and write some exploratory code. That's great. I think you should maintain a fork of zig with all your experiments. Periodically you could demo some cool stuff that your fork is capable of that upstream is not, and entice me, or others, to upstream some of your code.

However it's not going to be possible for you to go as fast as you want to go, directly in upstream Zig. You're going to have to meet the criteria outlined above.

@andrewrk andrewrk closed this Apr 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
work in progress This pull request is not ready for review yet.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants