Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for ASCII trim functions on byte slices #94035

Closed
2 of 5 tasks
dbrgn opened this issue Feb 15, 2022 · 18 comments · Fixed by #124928
Closed
2 of 5 tasks

Tracking Issue for ASCII trim functions on byte slices #94035

dbrgn opened this issue Feb 15, 2022 · 18 comments · Fixed by #124928
Labels
A-Unicode Area: Unicode C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. relnotes Marks issues that should be documented in the release notes of the next release. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@dbrgn
Copy link
Contributor

dbrgn commented Feb 15, 2022

Feature gate: #![feature(byte_slice_trim_ascii)]

This is a tracking issue for ASCII trim functions on byte slices.

Public API

The feature adds three new methods to byte slices ([u8]):

  • const fn trim_ascii_start(&self) -> &[u8]: Remove leading ASCII whitespace
  • const fn trim_ascii_end(&self) -> &[u8]: Remove trailing ASCII whitespace
  • const fn trim_ascii(&self) -> &[u8]: Remove leading and trailing ASCII whitespace

For deciding what bytes to treat as whitespace, u8::is_ascii_whitespace is used. See the linked docs for more details.

Examples:

assert_eq!(b" \t hello world\n".trim_ascii_start(), b"hello world\n");
assert_eq!(b"\r hello world\n ".trim_ascii_end(), b"\r hello world");
assert_eq!(b"\r hello world\n ".trim_ascii(), b"hello world");

Steps / History

Unresolved Questions

@dbrgn dbrgn added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Feb 15, 2022
@nagisa
Copy link
Member

nagisa commented Feb 16, 2022

If we're adding ascii variants to [u8], we should also add equivalent operations to str at the same time. I feel this is pretty important because otherwise the APIs made available by libstd can come across like a grab bag of random stuff.

It seems reasonable enough to want to trim ASCII whitespace from a &str too (there's already split_ascii_whitespace), and so today one would resort to trim_matches or with this implementation they could convert between [u8] and str. The need to do latter can come across as a particularly frustrating API surface omission.


naming bikeshed: might trim_ascii_start and trim_ascii_end be more consistent with other functions already in libstd? There's some prior art in naming of str::split_ascii_whitespace vs str::split_whitespace, {char,u8}::is_ascii_* vs {char,u8}::is_* etc.

@dbrgn
Copy link
Contributor Author

dbrgn commented Feb 16, 2022

If we're adding ascii variants to [u8], we should also add equivalent operations to str at the same time. I feel this is pretty important because otherwise the APIs made available by libstd can come across like a grab bag of random stuff.

Since str is validated unicode, trim_start and trim_end could be used, which uses the unicode definition of whitespace. On the other hand, we have char::is_ascii_whitespace, so it should be fairly easy to add these functions to string slices as well. However, I don't really see a use case for this. Why would someone use trim_start_ascii over trim_start?

naming bikeshed: might trim_ascii_start and trim_ascii_end be more consistent with other functions already in libstd? There's some prior art in naming of str::split_ascii_whitespace vs str::split_whitespace, {char,u8}::is_ascii_* vs {char,u8}::is_* etc.

Yes, you have a point there! There's also str::to_lowercase and str::to_ascii_lowercase. I will update the proposal and implementation PR.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 18, 2022
…iplett

core: Implement ASCII trim functions on byte slices

Hi `@rust-lang/libs!` This is a feature that I wished for when implementing serial protocols with microcontrollers. Often these protocols may contain leading or trailing whitespace, which needs to be removed. Because oftentimes drivers will operate on the byte level, decoding to unicode and checking for unicode whitespace is unnecessary overhead.

This PR adds three new methods to byte slices:

- `trim_ascii_start`
- `trim_ascii_end`
- `trim_ascii`

I did not find any pre-existing discussions about this, which surprises me a bit. Maybe I'm missing something, and this functionality is already possible through other means? There's rust-lang/rfcs#2547 ("Trim methods on slices"), but that has a different purpose.

As per the [std dev guide](https://std-dev-guide.rust-lang.org/feature-lifecycle/new-unstable-features.html), this is a proposed implementation without any issue / RFC. If this is the wrong process, please let me know. However, I thought discussing code is easier than discussing a mere idea, and hacking on the stdlib was fun.

Tracking issue: rust-lang#94035
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 19, 2022
…iplett

core: Implement ASCII trim functions on byte slices

Hi ``@rust-lang/libs!`` This is a feature that I wished for when implementing serial protocols with microcontrollers. Often these protocols may contain leading or trailing whitespace, which needs to be removed. Because oftentimes drivers will operate on the byte level, decoding to unicode and checking for unicode whitespace is unnecessary overhead.

This PR adds three new methods to byte slices:

- `trim_ascii_start`
- `trim_ascii_end`
- `trim_ascii`

I did not find any pre-existing discussions about this, which surprises me a bit. Maybe I'm missing something, and this functionality is already possible through other means? There's rust-lang/rfcs#2547 ("Trim methods on slices"), but that has a different purpose.

As per the [std dev guide](https://std-dev-guide.rust-lang.org/feature-lifecycle/new-unstable-features.html), this is a proposed implementation without any issue / RFC. If this is the wrong process, please let me know. However, I thought discussing code is easier than discussing a mere idea, and hacking on the stdlib was fun.

Tracking issue: rust-lang#94035
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 19, 2022
…iplett

core: Implement ASCII trim functions on byte slices

Hi ```@rust-lang/libs!``` This is a feature that I wished for when implementing serial protocols with microcontrollers. Often these protocols may contain leading or trailing whitespace, which needs to be removed. Because oftentimes drivers will operate on the byte level, decoding to unicode and checking for unicode whitespace is unnecessary overhead.

This PR adds three new methods to byte slices:

- `trim_ascii_start`
- `trim_ascii_end`
- `trim_ascii`

I did not find any pre-existing discussions about this, which surprises me a bit. Maybe I'm missing something, and this functionality is already possible through other means? There's rust-lang/rfcs#2547 ("Trim methods on slices"), but that has a different purpose.

As per the [std dev guide](https://std-dev-guide.rust-lang.org/feature-lifecycle/new-unstable-features.html), this is a proposed implementation without any issue / RFC. If this is the wrong process, please let me know. However, I thought discussing code is easier than discussing a mere idea, and hacking on the stdlib was fun.

Tracking issue: rust-lang#94035
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 19, 2022
…iplett

core: Implement ASCII trim functions on byte slices

Hi ````@rust-lang/libs!```` This is a feature that I wished for when implementing serial protocols with microcontrollers. Often these protocols may contain leading or trailing whitespace, which needs to be removed. Because oftentimes drivers will operate on the byte level, decoding to unicode and checking for unicode whitespace is unnecessary overhead.

This PR adds three new methods to byte slices:

- `trim_ascii_start`
- `trim_ascii_end`
- `trim_ascii`

I did not find any pre-existing discussions about this, which surprises me a bit. Maybe I'm missing something, and this functionality is already possible through other means? There's rust-lang/rfcs#2547 ("Trim methods on slices"), but that has a different purpose.

As per the [std dev guide](https://std-dev-guide.rust-lang.org/feature-lifecycle/new-unstable-features.html), this is a proposed implementation without any issue / RFC. If this is the wrong process, please let me know. However, I thought discussing code is easier than discussing a mere idea, and hacking on the stdlib was fun.

Tracking issue: rust-lang#94035
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 19, 2022
…iplett

core: Implement ASCII trim functions on byte slices

Hi `````@rust-lang/libs!````` This is a feature that I wished for when implementing serial protocols with microcontrollers. Often these protocols may contain leading or trailing whitespace, which needs to be removed. Because oftentimes drivers will operate on the byte level, decoding to unicode and checking for unicode whitespace is unnecessary overhead.

This PR adds three new methods to byte slices:

- `trim_ascii_start`
- `trim_ascii_end`
- `trim_ascii`

I did not find any pre-existing discussions about this, which surprises me a bit. Maybe I'm missing something, and this functionality is already possible through other means? There's rust-lang/rfcs#2547 ("Trim methods on slices"), but that has a different purpose.

As per the [std dev guide](https://std-dev-guide.rust-lang.org/feature-lifecycle/new-unstable-features.html), this is a proposed implementation without any issue / RFC. If this is the wrong process, please let me know. However, I thought discussing code is easier than discussing a mere idea, and hacking on the stdlib was fun.

Tracking issue: rust-lang#94035
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 19, 2022
…iplett

core: Implement ASCII trim functions on byte slices

Hi ``````@rust-lang/libs!`````` This is a feature that I wished for when implementing serial protocols with microcontrollers. Often these protocols may contain leading or trailing whitespace, which needs to be removed. Because oftentimes drivers will operate on the byte level, decoding to unicode and checking for unicode whitespace is unnecessary overhead.

This PR adds three new methods to byte slices:

- `trim_ascii_start`
- `trim_ascii_end`
- `trim_ascii`

I did not find any pre-existing discussions about this, which surprises me a bit. Maybe I'm missing something, and this functionality is already possible through other means? There's rust-lang/rfcs#2547 ("Trim methods on slices"), but that has a different purpose.

As per the [std dev guide](https://std-dev-guide.rust-lang.org/feature-lifecycle/new-unstable-features.html), this is a proposed implementation without any issue / RFC. If this is the wrong process, please let me know. However, I thought discussing code is easier than discussing a mere idea, and hacking on the stdlib was fun.

Tracking issue: rust-lang#94035
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 19, 2022
…iplett

core: Implement ASCII trim functions on byte slices

Hi ```````@rust-lang/libs!``````` This is a feature that I wished for when implementing serial protocols with microcontrollers. Often these protocols may contain leading or trailing whitespace, which needs to be removed. Because oftentimes drivers will operate on the byte level, decoding to unicode and checking for unicode whitespace is unnecessary overhead.

This PR adds three new methods to byte slices:

- `trim_ascii_start`
- `trim_ascii_end`
- `trim_ascii`

I did not find any pre-existing discussions about this, which surprises me a bit. Maybe I'm missing something, and this functionality is already possible through other means? There's rust-lang/rfcs#2547 ("Trim methods on slices"), but that has a different purpose.

As per the [std dev guide](https://std-dev-guide.rust-lang.org/feature-lifecycle/new-unstable-features.html), this is a proposed implementation without any issue / RFC. If this is the wrong process, please let me know. However, I thought discussing code is easier than discussing a mere idea, and hacking on the stdlib was fun.

Tracking issue: rust-lang#94035
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Feb 19, 2022
…iplett

core: Implement ASCII trim functions on byte slices

Hi ````````@rust-lang/libs!```````` This is a feature that I wished for when implementing serial protocols with microcontrollers. Often these protocols may contain leading or trailing whitespace, which needs to be removed. Because oftentimes drivers will operate on the byte level, decoding to unicode and checking for unicode whitespace is unnecessary overhead.

This PR adds three new methods to byte slices:

- `trim_ascii_start`
- `trim_ascii_end`
- `trim_ascii`

I did not find any pre-existing discussions about this, which surprises me a bit. Maybe I'm missing something, and this functionality is already possible through other means? There's rust-lang/rfcs#2547 ("Trim methods on slices"), but that has a different purpose.

As per the [std dev guide](https://std-dev-guide.rust-lang.org/feature-lifecycle/new-unstable-features.html), this is a proposed implementation without any issue / RFC. If this is the wrong process, please let me know. However, I thought discussing code is easier than discussing a mere idea, and hacking on the stdlib was fun.

Tracking issue: rust-lang#94035
@Kage-Yami
Copy link

Kage-Yami commented Jun 19, 2022

On the other hand, we have char::is_ascii_whitespace, so it should be fairly easy to add these functions to string slices as well. However, I don't really see a use case for this. Why would someone use trim_start_ascii over trim_start?

Naively (without knowing the internals at all), I could think of "performance" as a reason.

That being said, does this unresolved question really need to block this from being stabilised? After all, it's specifically about &[u8], not slices in general. If we can agree to that, then what's stopping this from proceeding to FCP? (I'm having trouble finding documentation for the process for tracking issue FCPs, could only find something for RFCs...)

@BurntSushi
Copy link
Member

Both byte slices and string slices can have both Unicode-aware trimming and ASCII-aware trimming. Unicode support needn't be limited to str and ASCII support needn't be limited to [u8].

With that said, I think adding these methods starts us down the path of treating byte slices as byte strings. I think I am in general in favor of that, but I do think we want to proceed deliberately here. It might make sense to block this on a higher level policy question of 1) do we want byte slices to be byte strings? and 2) if so, what do we want to add? (2) is particularly relevant to make sure we end up with good names I think.

@golddranks
Copy link
Contributor

golddranks commented Oct 10, 2022

I have a use case where I have to trim a bunch of b'\0's from the end of a byte slice. To me, trim_matches-like methods for slices in all generality would be warranted. I wonder whether people agree with that?

@kornelski
Copy link
Contributor

kornelski commented May 10, 2023

I think these would be useful (including also on str). I often find myself writing mini-parsers that deal with identifiers and syntaxes that are definitely ASCII-only. I don't want to pay code size increase for proper Unicode processing. I also don't want to create APIs or data formats that unnecessarily allow complexity of all of Unicode whitespace.

@workingjubilee workingjubilee added the A-Unicode Area: Unicode label Jul 22, 2023
@c410-f3r
Copy link
Contributor

What is the status of this feature?

@dbrgn
Copy link
Contributor Author

dbrgn commented Aug 25, 2023

With that said, I think adding these methods starts us down the path of treating byte slices as byte strings. I think I am in general in favor of that, but I do think we want to proceed deliberately here.
(...)
do we want byte slices to be byte strings?

I don't think that this a policy decision like this is needed. Byte slices may contain byte strings, or something else. Of course trimming ASCII whitespace results in nonsensical results, if the slices aren't actually ASCII strings, but that's all.

Note that there are already other slice methods that treat bytes as potential strings, for example:

What would be needed to move this proposal into FCP?

@okaneco
Copy link
Contributor

okaneco commented Dec 16, 2023

I filed a libs ACP, which was accepted, for extending these functions to &str rust-lang/libs-team#313

Implementation PR #118523

  • Add trim_ascii_start, trim_ascii_end, and trim_ascii functions to &str for trimming ASCII whitespace
  • Add #[inline] to [u8] trim_ascii functions

I was instructed to use this tracking issue.

workingjubilee added a commit to workingjubilee/rustc that referenced this issue Dec 16, 2023
Add ASCII whitespace trimming functions to `&str`

- Add `trim_ascii_start`, `trim_ascii_end`, and `trim_ascii` functions to `&str` for trimming ASCII whitespace
- Add `#[inline]` to `[u8]` `trim_ascii` functions

These functions are feature-gated by `#![feature(byte_slice_trim_ascii)]` rust-lang#94035
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Dec 16, 2023
Rollup merge of rust-lang#118523 - okaneco:trim_ascii, r=Mark-Simulacrum

Add ASCII whitespace trimming functions to `&str`

- Add `trim_ascii_start`, `trim_ascii_end`, and `trim_ascii` functions to `&str` for trimming ASCII whitespace
- Add `#[inline]` to `[u8]` `trim_ascii` functions

These functions are feature-gated by `#![feature(byte_slice_trim_ascii)]` rust-lang#94035
@okaneco
Copy link
Contributor

okaneco commented Feb 7, 2024

Stabilization Report

Implementation History

API Summary

The following six functions would be stabilized: three functions on &[u8] and three functions on &str.
u8::is_ascii_whitespace is used to determine whitespace.

  • Remove leading and trailing ASCII whitespace
    • const fn trim_ascii(&self) -> &[u8]
    • const fn trim_ascii(&self) -> &str
  • Remove trailing ASCII whitespace
    • const fn trim_ascii_end(&self) -> &[u8]
    • const fn trim_ascii_end(&self) -> &str
  • Remove leading ASCII whitespace
    • const fn trim_ascii_start(&self) -> &[u8]
    • const fn trim_ascii_start(&self) -> &str

Examples:

// &[u8]
assert_eq!(b"\r hello world\n ".trim_ascii(), b"hello world");
assert_eq!(b"\r hello world\n ".trim_ascii_end(), b"\r hello world");
assert_eq!(b" \t hello world\n".trim_ascii_start(), b"hello world\n");
// &str
assert_eq!("\r hello world\n ".trim_ascii(), "hello world");
assert_eq!("\r hello world\n ".trim_ascii_end(), "\r hello world");
assert_eq!(" \t hello world\n".trim_ascii_start(), "hello world\n");

Possibly Unresolved Questions

Question: Does a policy decision need to be made about treating byte slices as byte strings? What methods should be added?

With that said, I think adding these methods starts us down the path of treating byte slices as byte strings. I think I am in general in favor of that, but I do think we want to proceed deliberately here. It might make sense to block this on a higher level policy question of 1) do we want byte slices to be byte strings? and 2) if so, what do we want to add? (2) is particularly relevant to make sure we end up with good names I think.

#94035 (comment)

Response:

I don't think that this a policy decision like this is needed. Byte slices may contain byte strings, or something else. Of course trimming ASCII whitespace results in nonsensical results, if the slices aren't actually ASCII strings, but that's all.

Note that there are already other slice methods that treat bytes as potential strings, for example:

#94035 (comment)

For more context, escape_ascii was stabilized in 1.60 and the other four methods were stabilized in 1.23.


If this is adequate to start FCP, I can make the stabilization PR.

@vacuus
Copy link
Contributor

vacuus commented Apr 28, 2024

Seems like this is more than ready for stabilization.

@jhpratt jhpratt added the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label Apr 28, 2024
@jhpratt
Copy link
Member

jhpratt commented Apr 28, 2024

Nominating for T-libs-api, as there seem to be no blockers to stabilization, with a report already being written.

@BurntSushi
Copy link
Member

This overall seems reasonable to me. We've clearly already started down the path of "&[u8] is sometimes a byte string." I also think we should continue to that path.

@rfcbot fcp merge

@rfcbot
Copy link

rfcbot commented Apr 29, 2024

Team member @BurntSushi has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Apr 29, 2024
@rfcbot
Copy link

rfcbot commented Apr 29, 2024

🔔 This is now entering its final comment period, as per the review above. 🔔

@m-ou-se m-ou-se removed the I-libs-api-nominated Nominated for discussion during a libs-api team meeting. label Apr 30, 2024
@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this PR / Issue. and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels May 9, 2024
@rfcbot
Copy link

rfcbot commented May 9, 2024

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

@rfcbot rfcbot added the to-announce Announce this issue on triage meeting label May 9, 2024
@apiraino apiraino removed the to-announce Announce this issue on triage meeting label May 10, 2024
@bors bors closed this as completed in 03ff775 May 11, 2024
rust-timer added a commit to rust-lang-ci/rust that referenced this issue May 11, 2024
Rollup merge of rust-lang#124928 - okaneco:trim_ascii, r=workingjubilee

Stabilize `byte_slice_trim_ascii` for `&[u8]`/`&str`

Remove feature from documentation examples
Update intra-doc link for `u8::is_ascii_whitespace` on `&[u8]` functions

Closes rust-lang#94035

FCP has successfully completed rust-lang#94035 (comment)
@jonhoo
Copy link
Contributor

jonhoo commented May 12, 2024

🎉 Is this worth an explicit relnotes tag?

@jhpratt jhpratt added the relnotes Marks issues that should be documented in the release notes of the next release. label May 12, 2024
@jhpratt
Copy link
Member

jhpratt commented May 12, 2024

@jonhoo Definitely; I've added it.

RalfJung pushed a commit to RalfJung/miri that referenced this issue May 12, 2024
Stabilize `byte_slice_trim_ascii` for `&[u8]`/`&str`

Remove feature from documentation examples
Update intra-doc link for `u8::is_ascii_whitespace` on `&[u8]` functions

Closes #94035

FCP has successfully completed rust-lang/rust#94035 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. relnotes Marks issues that should be documented in the release notes of the next release. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.