Skip to content

Changes in Ascii #14496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Changes in Ascii #14496

wants to merge 4 commits into from

Conversation

aochagavia
Copy link
Contributor

Breaking changes:

  • Rename to_byte to as_byte
  • Rename to_char to as_char
  • Deprecated eq_ignore_case (I don't think we want to have a special function for this when it is so easy to write a.to_lowercase() == b.to_lowercase()).

Other changes (functions copied from char):

  • Added to_digit function
  • Added is_digit_radix function
  • Added tests for these functions

Questions (I didn't know where to ask, so I write them here)

  • Should we derive Copy for Ascii? It sounds reasonable since it is just a u8.
  • Should we remove to_ascii (which fails if the character is not ascii) and replace it by to_ascii_opt?
  • Many of the functions available for Ascii are also available for char. Maybe we should unify them in a trait. At this moment we have the Char trait, but it has some functions specially meant for UTF8 like is_XID_start and is_XID_continue. Maybe we should split this trait in two: Char and Utf8Char.
  • I think there is some sort of chaos regarding strings of Ascii characters (you have some traits for slices of &[Ascii] and Vec<Ascii> which offer conversion to normal strings or bytes). I would like to implement an AsciiString to bring order to this. Should I just do it and open a PR?

[breaking-change]

@lilyball
Copy link
Contributor

Note that I already have an outstanding PR #14436 that makes changes to Ascii.

@lilyball
Copy link
Contributor

  • Rename to_byte to as_byte
  • Rename to_char to as_char

.as_* methods are generally used when converting types without changing the in-memory representation. For example, String.as_bytes() gives a &[u8] that refers to the underlying memory of the String. Whereas .to_* methods are used for conversions that may allocate and/or copy data.

It's not so obvious what's correct when talking about these small primitive values, as all conversion is cheap, but .to_* is still technically the correct naming for .to_char(), as it does copy data to a new memory representation (even if it doesn't allocate data).

For .to_byte() it's less clear, because Ascii actually is a u8 under the hood, but that's really an implementation detail. There have been proposals to allow integral types of all widths instead of just 8/16/32/64, and if we ever do that Ascii would be a u7 instead. But ignoring that for a sec, arguments can be made for both .to_byte() and .as_byte(), and since there's no clear winner, and we already have .to_byte() (and it's consistent with .to_char()) my recommendation is to avoid the breaking change and just leave it as .to_byte().

@lilyball
Copy link
Contributor

  • Added to_digit function
  • Added is_digit_radix function

Why? I can say a.to_char().to_digit(10) already.

@aochagavia
Copy link
Contributor Author

@kballard I thought it would be interesting to have to_digit and is_digit_radix in as functions of Ascii. However, I see now that it is not a real benefit (there is no performance gain).

What you say about as_byte and as_char is true. The naming is debatable and we dont need to change it.

What do you think about the questions at the end?

@lilyball
Copy link
Contributor

  • Should we derive Copy for Ascii? It sounds reasonable since it is just a u8.

We don't have opt-in built-in bounds yet. That's still pending See #13231. Today Ascii is already Copy.

Many of the functions available for Ascii are also available for char. Maybe we should unify them in a trait.

There apparently already is one. core::char::Char. Ascii should just implement that.

@aochagavia
Copy link
Contributor Author

@kballard Thanks for your feedback!

The problem with the current Char trait is that it has some functions which don't make sense for Ascii. That is why I speak of splitting it into two traits.

@lilyball
Copy link
Contributor

I would like to implement an AsciiString to bring order to this.

What do you mean? std::ascii::AsciiStr already exists and is implemented on &[Ascii]. If you have Vec<Ascii> and want to use these methods, you can say v.as_slice().to_lower(). Post-DST, Vec<T> will almost certainly implement Deref<[T]> which means that Vec<Ascii> will then be able to call all of the &[Ascii] methods without .as_slice().

@lilyball
Copy link
Contributor

@aochagavia The only reason to use a trait instead of just implementing the same methods is if you want to be generic over the trait. Char exists right now just because you can't impl methods on char directly. Personally, I don't see that there's a need for a trait to unify this, because I don't think it's very likely that you'll need a function that's generic over both char and Ascii using just these functions, but if you do want to use a trait, then we may as well use the one we already have.

Looking at it, the only methods that I'm guessing you don't think make sense are len_utf8_bytes(), encode_utf8(), and encode_utf16() (note that the two escape methods actually do make sense for Ascii). But if you're going to be generic over a trait, being able to encode the value as UTF-8/UTF-16 actually seems like a reasonable thing to want, so I don't see the problem with providing this functionality for Ascii.

@aochagavia
Copy link
Contributor Author

Actually my concern is naming. If you implement the same trait, then you can be sure that naming will remain consistent.

@aochagavia
Copy link
Contributor Author

Also, when I wrote about AsciiString I was actually thinking of a struct (similar to the way in which String is implemented). I think that would be more clear than implementing traits on &[Ascii].

Thanks again for your feedback!

@lilyball
Copy link
Contributor

Actually my concern is naming. If you implement the same trait, then you can be sure that naming will remain consistent.

Ok, I suppose that's a good reason. Given that, I guess we can split char::Char into two traits, one for e.g. CharType containing all the is_* methods (minus is_digit_radix, and we should add is_hex which Ascii currently has), and then leave the rest in the current Char.

@lilyball
Copy link
Contributor

@aochagavia What would this proposed AsciiString struct do? Reimplement all of String methods on top of Vec<Ascii> instead of Vec<u8>? That seems complicated, hard to keep in sync with String (and no, putting all String methods in a trait to enforce naming is not a good idea), and just largely doesn't seem very useful compared to using Vec<Ascii> directly.

@lilyball
Copy link
Contributor

  • Should we remove to_ascii (which fails if the character is not ascii) and replace it by to_ascii_opt?

We probably should. There's precedence in the other various primitive conversion methods (e.g. ToPrimitive, char::from_u32(), etc) for only exposing an Option variant.

@lilyball
Copy link
Contributor

Because that's not really a character type classification. But I don't really have a compelling reason for omitting it, besides the fact that it doesn't correspond with any of the normal character classes (such as alpha, digit, punct, etc).

@aochagavia
Copy link
Contributor Author

Sorry, I deleted my comment before you gave answer.

I think I will open a new pull request... Thanks!

@lilyball
Copy link
Contributor

@aochagavia Have you looked at my #14436? I'm concerned that you submitting a separate PR will conflict. Perhaps we should build your proposed changes on top of my PR instead of competing.

@aochagavia
Copy link
Contributor Author

I think it is a good idea, but I have never done it before. Can I just push commits to your pull request?

@aochagavia aochagavia closed this May 28, 2014
@lilyball
Copy link
Contributor

@aochagavia You can't, but you can fetch my branch from my fork, add commits to it, push that to your own fork, and send me a link and I can update my PR.

@aochagavia
Copy link
Contributor Author

I think I will wait until your pr is merged. Alternatively you can
implement the changes I mentioned if you want.

Thanks again for your help with this!
On May 28, 2014 10:36 PM, "Kevin Ballard" notifications@github.com wrote:

@aochagavia https://github.com/aochagavia You can't, but you can fetch
my branch from my repo, add commits to it, and send me a link and I can
update my PR.


Reply to this email directly or view it on GitHubhttps://github.com//pull/14496#issuecomment-44461046
.

@lilyball
Copy link
Contributor

@aochagavia I'm on board with implementing the changes myself. Just to clarify, this is what you want to do:

  • Split char::Char into two traits, one of them being char::CharType (or maybe CharClassify?)
  • Implement CharType (or CharClassify) on Ascii
  • Deprecate Ascii.eq_ignore_case

Anything else that I'm missing? I think we should probably also go ahead and remove .to_ascii() in favor of .to_ascii_opt(), as you had suggested.

Also, opinion on is_digit_radix()? Does it make sense to put it in CharType, because it's potentially useful, or leave it out because it doesn't correspond to one of the common character classes?

@aochagavia aochagavia deleted the pr branch May 29, 2014 08:34
@aochagavia
Copy link
Contributor Author

@kballard No problem, go ahead!

Actually I would like the Char trait to be implemented by both char and Ascii, and a new Utf8Char trait implemented only by char. However, I guess your suggestion of a CharType trait will be good as well. My only concern here is naming, because CharType seems redundant and CharClassify doesn't convince me (the functions you implement are not only used to classify chars).

Additionaly, I think it would be good to have is_digit_radix(). If you have it, then you can define is_digit() and is_hex() in terms of it. This way you would only need to implement one function, and you would get the other two for free.

Another concern I have is the fact that is_digit_radix() will fail if the radix is greater than 36. The problem here is that it could be confusing to return None because it would look like the character is not a digit, while the real problem is that the radix is invalid. IMO returning a Result would complicate things too much, so I think that we should keep failing (I can't see a better alternative). If you can find any better solution for this, please add it to your pr.

@lilyball
Copy link
Contributor

Failing when the radix is greater than 36 is deliberate, because that's considered a programmer error, not an input error.

I'm still not convinced that is_digit_radix() really makes sense here, but I suppose I'll add it for convenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants