Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize pointer alignment in utf8 validation #61339

Merged
merged 2 commits into from
Jul 17, 2019

Conversation

jridgewell
Copy link
Contributor

@jridgewell jridgewell commented May 30, 2019

This uses (and reuses) the u8 arrays's inherent block alignment when checking whether the current index is block aligned.

I initially thought that this would just move the expensive align_offset call out of the while loop and replace it with a subtraction and bitwise AND. But it appears this optimizes much better, too...

before: https://rust.godbolt.org/z/WIPvWl
after: https://rust.godbolt.org/z/-jBPoW

Benchmarks

https://github.com/jridgewell/faster-from_utf8/tree/pointer-alignment

test from_utf8_2_bytes_fast      ... bench:         310 ns/iter (+/- 42) = 1290 MB/s
test from_utf8_2_bytes_regular   ... bench:         309 ns/iter (+/- 24) = 1294 MB/s

test from_utf8_3_bytes_fast      ... bench:       1,027 ns/iter (+/- 62) = 1168 MB/s
test from_utf8_3_bytes_regular   ... bench:       1,513 ns/iter (+/- 611) = 793 MB/s

test from_utf8_4_bytes_fast      ... bench:       1,788 ns/iter (+/- 26) = 1342 MB/s
test from_utf8_4_bytes_regular   ... bench:       1,907 ns/iter (+/- 181) = 1258 MB/s

test from_utf8_all_bytes_fast    ... bench:       3,463 ns/iter (+/- 97) = 1155 MB/s
test from_utf8_all_bytes_regular ... bench:       4,083 ns/iter (+/- 89) = 979 MB/s

test from_utf8_ascii_fast        ... bench:          88 ns/iter (+/- 4) = 28988 MB/s
test from_utf8_ascii_regular     ... bench:          88 ns/iter (+/- 8) = 28988 MB/s

test from_utf8_cyr_fast          ... bench:       7,707 ns/iter (+/- 531) = 665 MB/s
test from_utf8_cyr_regular       ... bench:       8,202 ns/iter (+/- 135) = 625 MB/s

test from_utf8_enwik8_fast       ... bench:   1,135,756 ns/iter (+/- 84,450) = 8804 MB/s
test from_utf8_enwik8_regular    ... bench:   1,145,468 ns/iter (+/- 79,601) = 8730 MB/s

test from_utf8_jawik10_fast      ... bench:  12,723,844 ns/iter (+/- 473,247) = 785 MB/s
test from_utf8_jawik10_regular   ... bench:  13,384,596 ns/iter (+/- 666,997) = 747 MB/s

test from_utf8_mixed_fast        ... bench:       2,321 ns/iter (+/- 123) = 2081 MB/s
test from_utf8_mixed_regular     ... bench:       2,702 ns/iter (+/- 408) = 1788 MB/s

test from_utf8_mostlyasc_fast    ... bench:         249 ns/iter (+/- 10) = 14666 MB/s
test from_utf8_mostlyasc_regular ... bench:         276 ns/iter (+/- 5) = 13231 MB/s

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @TimNN (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 30, 2019
@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:25b5bbb8:start=1559187520651196332,finish=1559187521388138197,duration=736941865
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
$ export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55
---
[00:04:36]    Compiling backtrace v0.3.25
[00:04:37] error[E0425]: cannot find value `ptr` in this scope
[00:04:37]     --> src/libcore/str/mod.rs:1420:17
[00:04:37]      |
[00:04:37] 1420 |     let align = ptr.align_offset(usize_bytes);
[00:04:37] 
[00:04:40]    Compiling compiler_builtins v0.1.15
[00:04:40]    Compiling cmake v0.1.38
[00:04:40]    Compiling backtrace-sys v0.1.27

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

This uses (and reuses) the u8 arrays's inherent block alignment when checking whether the current index is block aligned.

I initially thought that this would just move the expensive `align_offset` call out of the while loop and replace it with a subtraction and bitwise AND. But it appears this optimizes much better, too...

before: https://rust.godbolt.org/z/WIPvWl
after: https://rust.godbolt.org/z/-jBPoW

https://github.com/jridgewell/faster-from_utf8/tree/pointer-alignment

```
test from_utf8_2_bytes_fast      ... bench:         310 ns/iter (+/- 42) = 1290 MB/s
test from_utf8_2_bytes_regular   ... bench:         309 ns/iter (+/- 24) = 1294 MB/s

test from_utf8_3_bytes_fast      ... bench:       1,027 ns/iter (+/- 62) = 1168 MB/s
test from_utf8_3_bytes_regular   ... bench:       1,513 ns/iter (+/- 611) = 793 MB/s

test from_utf8_4_bytes_fast      ... bench:       1,788 ns/iter (+/- 26) = 1342 MB/s
test from_utf8_4_bytes_regular   ... bench:       1,907 ns/iter (+/- 181) = 1258 MB/s

test from_utf8_all_bytes_fast    ... bench:       3,463 ns/iter (+/- 97) = 1155 MB/s
test from_utf8_all_bytes_regular ... bench:       4,083 ns/iter (+/- 89) = 979 MB/s

test from_utf8_ascii_fast        ... bench:          88 ns/iter (+/- 4) = 28988 MB/s
test from_utf8_ascii_regular     ... bench:          88 ns/iter (+/- 8) = 28988 MB/s

test from_utf8_cyr_fast          ... bench:       7,707 ns/iter (+/- 531) = 665 MB/s
test from_utf8_cyr_regular       ... bench:       8,202 ns/iter (+/- 135) = 625 MB/s

test from_utf8_enwik8_fast       ... bench:   1,135,756 ns/iter (+/- 84,450) = 8804 MB/s
test from_utf8_enwik8_regular    ... bench:   1,145,468 ns/iter (+/- 79,601) = 8730 MB/s

test from_utf8_jawik10_fast      ... bench:  12,723,844 ns/iter (+/- 473,247) = 785 MB/s
test from_utf8_jawik10_regular   ... bench:  13,384,596 ns/iter (+/- 666,997) = 747 MB/s

test from_utf8_mixed_fast        ... bench:       2,321 ns/iter (+/- 123) = 2081 MB/s
test from_utf8_mixed_regular     ... bench:       2,702 ns/iter (+/- 408) = 1788 MB/s

test from_utf8_mostlyasc_fast    ... bench:         249 ns/iter (+/- 10) = 14666 MB/s
test from_utf8_mostlyasc_regular ... bench:         276 ns/iter (+/- 5) = 13231 MB/s
```
@jridgewell jridgewell force-pushed the pointer-alignment branch from 9792561 to 3d2c4ff Compare May 30, 2019 03:52
@Centril
Copy link
Contributor

Centril commented May 30, 2019

r? @SimonSapin

@rust-highfive rust-highfive assigned SimonSapin and unassigned TimNN May 30, 2019
@jridgewell
Copy link
Contributor Author

We could probably make this alignment calculation into a real API? Something like:

struct Alignment {
    block_size: usize,
    align: usize,
}

impl Alignment {
    pub fn is_aligned(&self, offset: usize) -> bool {
        self.align.wrapping_sub(offset) % self.block_size == 0
    }
}

And tied to the lifetime of the pointer.

verification of alignment calculation
[src/main.rs:22] index = 0
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 1
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 2
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 3
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 4
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 5
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 6
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 7
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 8
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 9
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 10
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 11
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 12
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 13
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 14
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 15
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 16
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 17
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 18
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 19
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 20
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 21
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 22
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 23
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 24
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 25
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 26
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 27
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 28
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 29
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 30
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 31
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 32
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 33
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 34
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 35
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 36
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 37
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 38
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 39
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 40
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 41
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 42
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 43
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 44
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 45
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 46
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 47
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 48
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 49
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 50
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 51
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 52
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 53
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 54
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 55
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 56
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 57
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 58
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 59
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 60
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 61
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 62
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 63
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 64
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 65
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 66
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 67
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 68
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 69
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 70
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 71
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 72
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 73
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 74
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 75
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 76
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 77
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 78
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 79
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 80
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 81
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 82
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 83
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 84
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 85
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 86
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 87
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 88
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 89
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 90
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 91
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 92
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

[src/main.rs:22] index = 93
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 4
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 4

[src/main.rs:22] index = 94
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 3
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 3

[src/main.rs:22] index = 95
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 2
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 2

[src/main.rs:22] index = 96
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 1
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 1

[src/main.rs:22] index = 97
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 0
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 0

[src/main.rs:22] index = 98
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 7
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 7

[src/main.rs:22] index = 99
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 6
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 6

[src/main.rs:22] index = 100
[src/main.rs:24] align.wrapping_sub(index) % usize_bytes = 5
[src/main.rs:25] unsafe { v.as_ptr().add(index).align_offset(usize_bytes) } = 5

@SimonSapin
Copy link
Contributor

I’m on vacation this week and this looks non-obvious.

r? @BurntSushi

@jridgewell
Copy link
Contributor Author

Friendly ping @BurntSushi

Sample test to prove that the calculation is the same
#[test]
fn test_pointer_alignment() {
    let bytes = ASCII.as_bytes();
    assert_ne!(intrinsic_alignment(bytes), 0);
    test(bytes);
}

fn intrinsic_alignment(bytes: &[u8]) -> usize {
    let usize_bytes = std::mem::size_of::<usize>();
    bytes.as_ptr().align_offset(usize_bytes)
}

fn test(bytes: &[u8]) {
    let usize_bytes = std::mem::size_of::<usize>();
    let align = bytes.as_ptr().align_offset(usize_bytes);
    let len = bytes.len();
    let mut index = 0;

    while index < len {
        let wrapping_sub = align.wrapping_sub(index) % usize_bytes;
        let align_offset = unsafe { bytes.as_ptr().add(index).align_offset(usize_bytes) };
        assert_eq!(wrapping_sub, align_offset);

        index += 1;
    }
}

static ASCII: &'static str = "\
hello this is a test \
hello this is a test \
hello this is a test \
hello this is a test \
hello this is a test \
hello this is a test \
";

@Dylan-DPC-zz
Copy link

ping from triage @BurntSushi awaiting your review on this

@BurntSushi
Copy link
Member

I buy it! @bors r+

Note that it is fairly straight-forward to write a SIMD-accelerated version of an ASCII check using only SSE2 instructions, which means that should work in core on x86_64 only. Here's my version of said algorithm: https://github.com/BurntSushi/bstr/blob/ac4e670d95ab88e41375e0e4f7914e995e93fbed/src/ascii.rs#L118-L197

Also, more generally, I've found that code like this results in both better readability and better codegen if one just commits to using raw pointers everywhere, instead of trying to manage a back-and-forth between raw pointers and indices.

@tesuji
Copy link
Contributor

tesuji commented Jul 5, 2019

Bors doesn't respond to the approvement action.

@Centril Centril closed this Jul 5, 2019
@Centril Centril reopened this Jul 5, 2019
@Centril
Copy link
Contributor

Centril commented Jul 5, 2019

@bors r=BurntSushi

@bors
Copy link
Contributor

bors commented Jul 5, 2019

📌 Commit 3d2c4ff has been approved by BurntSushi

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 5, 2019
@bors
Copy link
Contributor

bors commented Jul 5, 2019

⌛ Testing commit 3d2c4ff with merge 1437609c16a41fee5391dc06309830cd38b72854...

Centril added a commit to Centril/rust that referenced this pull request Jul 5, 2019
…ntSushi

Optimize pointer alignment in utf8 validation

This uses (and reuses) the u8 arrays's inherent block alignment when checking whether the current index is block aligned.

I initially thought that this would just move the expensive `align_offset` call out of the while loop and replace it with a subtraction and bitwise AND. But it appears this optimizes much better, too...

before: https://rust.godbolt.org/z/WIPvWl
after: https://rust.godbolt.org/z/-jBPoW

## Benchmarks

https://github.com/jridgewell/faster-from_utf8/tree/pointer-alignment

```
test from_utf8_2_bytes_fast      ... bench:         310 ns/iter (+/- 42) = 1290 MB/s
test from_utf8_2_bytes_regular   ... bench:         309 ns/iter (+/- 24) = 1294 MB/s

test from_utf8_3_bytes_fast      ... bench:       1,027 ns/iter (+/- 62) = 1168 MB/s
test from_utf8_3_bytes_regular   ... bench:       1,513 ns/iter (+/- 611) = 793 MB/s

test from_utf8_4_bytes_fast      ... bench:       1,788 ns/iter (+/- 26) = 1342 MB/s
test from_utf8_4_bytes_regular   ... bench:       1,907 ns/iter (+/- 181) = 1258 MB/s

test from_utf8_all_bytes_fast    ... bench:       3,463 ns/iter (+/- 97) = 1155 MB/s
test from_utf8_all_bytes_regular ... bench:       4,083 ns/iter (+/- 89) = 979 MB/s

test from_utf8_ascii_fast        ... bench:          88 ns/iter (+/- 4) = 28988 MB/s
test from_utf8_ascii_regular     ... bench:          88 ns/iter (+/- 8) = 28988 MB/s

test from_utf8_cyr_fast          ... bench:       7,707 ns/iter (+/- 531) = 665 MB/s
test from_utf8_cyr_regular       ... bench:       8,202 ns/iter (+/- 135) = 625 MB/s

test from_utf8_enwik8_fast       ... bench:   1,135,756 ns/iter (+/- 84,450) = 8804 MB/s
test from_utf8_enwik8_regular    ... bench:   1,145,468 ns/iter (+/- 79,601) = 8730 MB/s

test from_utf8_jawik10_fast      ... bench:  12,723,844 ns/iter (+/- 473,247) = 785 MB/s
test from_utf8_jawik10_regular   ... bench:  13,384,596 ns/iter (+/- 666,997) = 747 MB/s

test from_utf8_mixed_fast        ... bench:       2,321 ns/iter (+/- 123) = 2081 MB/s
test from_utf8_mixed_regular     ... bench:       2,702 ns/iter (+/- 408) = 1788 MB/s

test from_utf8_mostlyasc_fast    ... bench:         249 ns/iter (+/- 10) = 14666 MB/s
test from_utf8_mostlyasc_regular ... bench:         276 ns/iter (+/- 5) = 13231 MB/s
```
@Centril
Copy link
Contributor

Centril commented Jul 5, 2019

@bors retry rolled up.

@bors
Copy link
Contributor

bors commented Jul 5, 2019

⌛ Testing commit 3d2c4ff with merge d8a21c4d14accefd54b2483f4bae01ca4045fb3e...

@alexreg
Copy link
Contributor

alexreg commented Jul 5, 2019

@bors retry (same reason as @Centril's above)

@bors
Copy link
Contributor

bors commented Jul 5, 2019

⌛ Testing commit 3d2c4ff with merge 4ef8922d6e2cab03a7329c185fd87f9860a2080e...

Centril added a commit to Centril/rust that referenced this pull request Jul 5, 2019
…ntSushi

Optimize pointer alignment in utf8 validation

This uses (and reuses) the u8 arrays's inherent block alignment when checking whether the current index is block aligned.

I initially thought that this would just move the expensive `align_offset` call out of the while loop and replace it with a subtraction and bitwise AND. But it appears this optimizes much better, too...

before: https://rust.godbolt.org/z/WIPvWl
after: https://rust.godbolt.org/z/-jBPoW

## Benchmarks

https://github.com/jridgewell/faster-from_utf8/tree/pointer-alignment

```
test from_utf8_2_bytes_fast      ... bench:         310 ns/iter (+/- 42) = 1290 MB/s
test from_utf8_2_bytes_regular   ... bench:         309 ns/iter (+/- 24) = 1294 MB/s

test from_utf8_3_bytes_fast      ... bench:       1,027 ns/iter (+/- 62) = 1168 MB/s
test from_utf8_3_bytes_regular   ... bench:       1,513 ns/iter (+/- 611) = 793 MB/s

test from_utf8_4_bytes_fast      ... bench:       1,788 ns/iter (+/- 26) = 1342 MB/s
test from_utf8_4_bytes_regular   ... bench:       1,907 ns/iter (+/- 181) = 1258 MB/s

test from_utf8_all_bytes_fast    ... bench:       3,463 ns/iter (+/- 97) = 1155 MB/s
test from_utf8_all_bytes_regular ... bench:       4,083 ns/iter (+/- 89) = 979 MB/s

test from_utf8_ascii_fast        ... bench:          88 ns/iter (+/- 4) = 28988 MB/s
test from_utf8_ascii_regular     ... bench:          88 ns/iter (+/- 8) = 28988 MB/s

test from_utf8_cyr_fast          ... bench:       7,707 ns/iter (+/- 531) = 665 MB/s
test from_utf8_cyr_regular       ... bench:       8,202 ns/iter (+/- 135) = 625 MB/s

test from_utf8_enwik8_fast       ... bench:   1,135,756 ns/iter (+/- 84,450) = 8804 MB/s
test from_utf8_enwik8_regular    ... bench:   1,145,468 ns/iter (+/- 79,601) = 8730 MB/s

test from_utf8_jawik10_fast      ... bench:  12,723,844 ns/iter (+/- 473,247) = 785 MB/s
test from_utf8_jawik10_regular   ... bench:  13,384,596 ns/iter (+/- 666,997) = 747 MB/s

test from_utf8_mixed_fast        ... bench:       2,321 ns/iter (+/- 123) = 2081 MB/s
test from_utf8_mixed_regular     ... bench:       2,702 ns/iter (+/- 408) = 1788 MB/s

test from_utf8_mostlyasc_fast    ... bench:         249 ns/iter (+/- 10) = 14666 MB/s
test from_utf8_mostlyasc_regular ... bench:         276 ns/iter (+/- 5) = 13231 MB/s
```
@Centril
Copy link
Contributor

Centril commented Jul 5, 2019

@bors retry rolled up.

@Centril
Copy link
Contributor

Centril commented Jul 5, 2019

Let's wait until the issue has resolved itself, @bors r-

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jul 5, 2019
@RalfJung
Copy link
Member

RalfJung commented Jul 5, 2019

I opened an issue for proper discussion: #62420

In this PR, we might discuss whether it is possible to change it such that it will fall back to the unaligned case if align_offset failed.

Copy link
Contributor Author

@jridgewell jridgewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found that code like this results in both better readability and better codegen if one just commits to using raw pointers everywhere, instead of trying to manage a back-and-forth between raw pointers and indices.

I also think it might be better as pointers, but that's a larger code change to do. I can try that in a follow up PR?

ptr.add(index).align_offset(usize_bytes)
};
if align == 0 {
if align.wrapping_sub(index) % usize_bytes == 0 {
Copy link
Contributor Author

@jridgewell jridgewell Jul 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have read align_offset's doc better, I didn't realize it might not be possible to align the offsets. I can manually check for align != usize::max_value() here to avoid this. Interestingly, it seems that check will disappear entirely on supported platforms? Compare with and without, which generate the same assembly.

This could also be added to a Alignment impl, if we wanted to go that route instead.

@BurntSushi
Copy link
Member

I also think it might be better as pointers, but that's a larger code change to do. I can try that in a follow up PR?

Oh yeah, of course! I was just throwing it out there. :)

@RalfJung
Copy link
Member

RalfJung commented Jul 5, 2019

I should have read align_offset's doc better, I didn't realize it might not be possible to align the offsets. I can manually check for align != usize::max_value() here to avoid this. Interestingly, it seems that check will disappear entirely on supported platforms? Compare with and without, which generate the same assembly.

Thanks! Good to hear that this optimizes away, that means there should be no perf cost. :)

This new version looks good to me from a Miri perspective.

@jridgewell
Copy link
Contributor Author

Does this need anything more from me? I think the miri issue is resolved, just needs a final review and merge.

@RalfJung
Copy link
Member

Agreed. @BurntSushi the changes since your last r+ are tiny, could you have a look?

@BurntSushi
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented Jul 17, 2019

📌 Commit 3b6e8ed has been approved by BurntSushi

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jul 17, 2019
@bors
Copy link
Contributor

bors commented Jul 17, 2019

⌛ Testing commit 3b6e8ed with merge e34d4ae...

bors added a commit that referenced this pull request Jul 17, 2019
Optimize pointer alignment in utf8 validation

This uses (and reuses) the u8 arrays's inherent block alignment when checking whether the current index is block aligned.

I initially thought that this would just move the expensive `align_offset` call out of the while loop and replace it with a subtraction and bitwise AND. But it appears this optimizes much better, too...

before: https://rust.godbolt.org/z/WIPvWl
after: https://rust.godbolt.org/z/-jBPoW

## Benchmarks

https://github.com/jridgewell/faster-from_utf8/tree/pointer-alignment

```
test from_utf8_2_bytes_fast      ... bench:         310 ns/iter (+/- 42) = 1290 MB/s
test from_utf8_2_bytes_regular   ... bench:         309 ns/iter (+/- 24) = 1294 MB/s

test from_utf8_3_bytes_fast      ... bench:       1,027 ns/iter (+/- 62) = 1168 MB/s
test from_utf8_3_bytes_regular   ... bench:       1,513 ns/iter (+/- 611) = 793 MB/s

test from_utf8_4_bytes_fast      ... bench:       1,788 ns/iter (+/- 26) = 1342 MB/s
test from_utf8_4_bytes_regular   ... bench:       1,907 ns/iter (+/- 181) = 1258 MB/s

test from_utf8_all_bytes_fast    ... bench:       3,463 ns/iter (+/- 97) = 1155 MB/s
test from_utf8_all_bytes_regular ... bench:       4,083 ns/iter (+/- 89) = 979 MB/s

test from_utf8_ascii_fast        ... bench:          88 ns/iter (+/- 4) = 28988 MB/s
test from_utf8_ascii_regular     ... bench:          88 ns/iter (+/- 8) = 28988 MB/s

test from_utf8_cyr_fast          ... bench:       7,707 ns/iter (+/- 531) = 665 MB/s
test from_utf8_cyr_regular       ... bench:       8,202 ns/iter (+/- 135) = 625 MB/s

test from_utf8_enwik8_fast       ... bench:   1,135,756 ns/iter (+/- 84,450) = 8804 MB/s
test from_utf8_enwik8_regular    ... bench:   1,145,468 ns/iter (+/- 79,601) = 8730 MB/s

test from_utf8_jawik10_fast      ... bench:  12,723,844 ns/iter (+/- 473,247) = 785 MB/s
test from_utf8_jawik10_regular   ... bench:  13,384,596 ns/iter (+/- 666,997) = 747 MB/s

test from_utf8_mixed_fast        ... bench:       2,321 ns/iter (+/- 123) = 2081 MB/s
test from_utf8_mixed_regular     ... bench:       2,702 ns/iter (+/- 408) = 1788 MB/s

test from_utf8_mostlyasc_fast    ... bench:         249 ns/iter (+/- 10) = 14666 MB/s
test from_utf8_mostlyasc_regular ... bench:         276 ns/iter (+/- 5) = 13231 MB/s
```
@bors
Copy link
Contributor

bors commented Jul 17, 2019

☀️ Test successful - checks-azure
Approved by: BurntSushi
Pushing e34d4ae to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Jul 17, 2019
@bors bors merged commit 3b6e8ed into rust-lang:master Jul 17, 2019
@jridgewell jridgewell deleted the pointer-alignment branch July 17, 2019 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.