Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect size_hint() on EncodeUtf16 #113897

Closed
ajtribick opened this issue Jul 20, 2023 · 1 comment · Fixed by #113898
Closed

Incorrect size_hint() on EncodeUtf16 #113897

ajtribick opened this issue Jul 20, 2023 · 1 comment · Fixed by #113898
Labels
A-iterators Area: Iterators C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@ajtribick
Copy link
Contributor

I tried this code:

println!("{:?}", "12345678901234".encode_utf16().size_hint());

let mut it = "\u{101234}".encode_utf16();
it.next().unwrap();
println!("{:?}", it.size_hint());

I expected to see this happen:

(5, Some(14))
(1, Some(1))

Instead, this happened:

(4, Some(28))
(0, Some(0))

Meta

rustc --version --verbose:

rustc 1.73.0-nightly (39f42ad9e 2023-07-19)
binary: rustc
commit-hash: 39f42ad9e8430a8abb06c262346e89593278c515
commit-date: 2023-07-19
host: x86_64-pc-windows-msvc
release: 1.73.0-nightly
LLVM version: 16.0.5

The reason is that the EncodeUtf16 iterator calculates its size hint in terms of the contained Chars iterator size hint, assuming that each character can correspond to either 1 or 2 code units.

In the case that the iterator is NOT in the middle of a surrogate pair, this leads to too-low lower bounds and too high upper-bounds.
In the case that the iterator IS in the middle of a surrogate pair, the remaining code unit is not taken into account as the iterator has advanced past this point.

The actual calculation should be done in terms of the remaining bytes:

  • The lower bound is achieved by assuming the remaining bytes consist of as many 3-byte sequences as possible, optionally followed by a 1 or 2-byte sequence, leading to a lower bound of (bytes_remaining + 2) / 3
  • The upper bound is achieved by assuming the remaining bytes consist of 1-byte sequences, leading to an upper bound of bytes_remaining.

In the case of the iterator being positioned in the middle of a surrogate pair, both these values should be increased by 1.

@ajtribick ajtribick added the C-bug Category: This is a bug. label Jul 20, 2023
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jul 20, 2023
@the8472
Copy link
Member

the8472 commented Jul 20, 2023

This is not a bug, size hints are allowed to be looser than necessary. As the documentation says

The default implementation returns (0, None) which is correct for any iterator.

But if it can be made tighter while still being correct for all values that's fine.

Edit:

(0, Some(0))

Oh, yeah. This one is actually a bug since the iterator is not exhausted at that point.

@the8472 the8472 added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs Relevant to the library team, which will review and decide on the PR/issue. A-iterators Area: Iterators C-bug Category: This is a bug. and removed C-bug Category: This is a bug. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 20, 2023
@bors bors closed this as completed in 65b5cba Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-iterators Area: Iterators C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants