Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use memchr for str::find(char) #46735

Merged
merged 14 commits into from
Jan 1, 2018
Merged

Use memchr for str::find(char) #46735

merged 14 commits into from
Jan 1, 2018

Conversation

Manishearth
Copy link
Member

@Manishearth Manishearth commented Dec 14, 2017

This is a 10x improvement for searching for characters.

This also contains the patches from #46713 . Feel free to land both separately or together.

cc @mystor @alexcrichton

r? @bluss

fixes #46693

@Manishearth
Copy link
Member Author

I haven't really tested this much, there probably are failures. Will do a second pass at self-review once I know we pass all tests (from travis)

@Manishearth
Copy link
Member Author

The memchr crate is even faster because it links to glibc's memchr, which uses SIMD and other fancy stuff. libcore can't link to this so to get these wins we'll have to do a SIMD impl ourselves.

#[bench]
fn find_char(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(x.find('/')));
}


#[bench]
fn find_char_memchr(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(memchr::memchr(b'/', x.as_bytes())));
}

Before:

running 2 tests
test find_char        ... bench:         593 ns/iter (+/- 201)
test find_char_memchr ... bench:           9 ns/iter (+/- 1)

After:

running 2 tests
test find_char        ... bench:          57 ns/iter (+/- 12)
test find_char_memchr ... bench:           9 ns/iter (+/- 1)

@Manishearth
Copy link
Member Author

This does not bring improvements for multibyte chars or for str::find(str). We can bring improvement for these, but it's tricky.

For str when it starts with an ASCII char we can do similar stuff as here (and then use the original algorithm to finish the match.

When the thing we're searching for is not ASCII we can still search for the first byte. However for most UTF8 text the first byte will generally be pretty uniform; i.e. if it's Arabic text will usually be 0xD8 or 0xD9, Korean will be 0xEA, 0xEB, 0xEC, or 0xED, Devanagari is usually 0xE0, etc. This means that memchr will have lots of false positives; we'll get lots of hits on the first byte and then have to check the second byte. This amount of stutter will probably make memchr's (minor) fixed overhead significant, and destroy any perf gains which we may get.

Searching for the second byte or even better, the last byte, might work better. But I'm not sure if I want to write that code right now, and the tradeoffs are a bit trickier there :)

@kennytm kennytm added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Dec 15, 2017
@Manishearth
Copy link
Member Author

Manishearth commented Dec 18, 2017

Bench numbers do not materially change with the UTF8 changes. I did come up with a pathological case of searching a Devanagari string for ä (which shares bytes) that ends up being 2x slower because every other character is a false positive hit (entirely negating memchr's win).

I think this pathological case is ok, it will only arise when mixing languages and for very specific characters.

I can check some form of these benchmarks into tree if y'all feel it necessary.

$ cargo bench
test find_char                            ... bench:         603 ns/iter (+/- 203)
test find_char_memchr                     ... bench:          10 ns/iter (+/- 3)
test find_multibyte_char_found            ... bench:         376 ns/iter (+/- 67)
test find_multibyte_char_notfound         ... bench:         618 ns/iter (+/- 129)
test find_multibyte_string_multibyte_char ... bench:         719 ns/iter (+/- 137)
test find_multibyte_string_pathological   ... bench:         620 ns/iter (+/- 98)

$ cargo +x-stage2 bench
test find_char                            ... bench:          67 ns/iter (+/- 45)
test find_char_memchr                     ... bench:          10 ns/iter (+/- 1)
test find_multibyte_char_found            ... bench:          50 ns/iter (+/- 12)
test find_multibyte_char_notfound         ... bench:          74 ns/iter (+/- 20)
test find_multibyte_string_multibyte_char ... bench:          74 ns/iter (+/- 20)
test find_multibyte_string_pathological   ... bench:       1,672 ns/iter (+/- 348)

Code:

#[bench]
fn find_char(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(x.find('/')));
}

#[bench]
fn find_char_memchr(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(memchr::memchr(b'/', x.as_bytes())));
}

#[bench]
fn find_multibyte_char_found(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, ก remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(x.find('ก')));
}

#[bench]
fn find_multibyte_char_notfound(b: &mut Bencher) {
    let x = test::black_box("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.");
    b.iter(|| test::black_box(x.find('ก')));
}

#[bench]
fn find_multibyte_string_multibyte_char(b: &mut Bencher) {
    let x = test::black_box("जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली");
    b.iter(|| test::black_box(x.find('ग'))); // not in the string
}

#[bench]
fn find_multibyte_string_pathological(b: &mut Bencher) {
    let x = test::black_box("जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली, जलद कोल्हा आळशी कुत्रा वरुन उडी मारली");
    b.iter(|| test::black_box(x.find('ä'))); // ä's last byte is found often in Devanagari text
}

@Manishearth
Copy link
Member Author

Manishearth commented Dec 18, 2017

If we really care about the pathological case it can be avoided by having some check in the loop that after X false positives falls back to regular "loop on next" behavior.

I don't think we should, though.

We could also write some monster SSE-enabled memchr that can search for up to 4 byte units. I'm not doing that.

@Manishearth
Copy link
Member Author

@bors-servo try

@bors
Copy link
Contributor

bors commented Dec 21, 2017

⌛ Trying commit 9b92a44 with merge afb0c20...

bors added a commit that referenced this pull request Dec 21, 2017
Use memchr for str::find(char)

This is a 10x improvement for searching for characters.

This also contains the patches from #46713 . Feel free to land both separately or together.

cc @mystor @alexcrichton

r? @bluss

fixes #46693
@bors
Copy link
Contributor

bors commented Dec 21, 2017

☀️ Test successful - status-travis
State: approved= try=True

@Manishearth
Copy link
Member Author

Manishearth commented Dec 21, 2017 via email

@nagisa
Copy link
Member

nagisa commented Jan 1, 2018

cc @Mark-Simulacrum ^

@BurntSushi
Copy link
Member

BurntSushi commented Jan 1, 2018

When the thing we're searching for is not ASCII we can still search for the first byte. However for most UTF8 text the first byte will generally be pretty uniform; i.e. if it's Arabic text will usually be 0xD8 or 0xD9, Korean will be 0xEA, 0xEB, 0xEC, or 0xED, Devanagari is usually 0xE0, etc. This means that memchr will have lots of false positives; we'll get lots of hits on the first byte and then have to check the second byte. This amount of stutter will probably make memchr's (minor) fixed overhead significant, and destroy any perf gains which we may get.

Searching for the second byte or even better, the last byte, might work better. But I'm not sure if I want to write that code right now, and the tradeoffs are a bit trickier there :)

Searching for the last byte is indeed a better heuristic on UTF-8 than searching for the first byte. You'd be in good company (GNU grep does that). But the last byte is still arbitrary. This is why the regex crate ranks every byte in order of what it believes is rare. Leading UTF-8 bytes are considered common while trailing bytes aren't. But you also get things like "z is rarer than a," which it commonly is. So the memchr is applied to the rarest byte in the pattern. Of course, you still wind up with pathological cases when the frequency rank doesn't match the corpus, but this will always be true when using memchr without analyzing the haystack before hand (which obviously doesn't make sense in this specific domain of text search). That code is here: https://github.com/rust-lang/regex/blob/9c790659c4e83e3497c6f2d14a818b3a69654d5f/src/literals.rs#L379-L514

(To be clear, I think the frequency rank stuff is probably overkill for searching a single char and would probably just stick to the last byte. Different story if you tackled str::find(str) though. Do we really not already use memchr in str::find(str) though?)

#[inline]
fn next(&mut self) -> SearchStep {
let old_finger = self.finger;
let slice = unsafe { self.haystack.get_unchecked(old_finger..self.haystack.len()) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the various bounds check elisions actually help here? I've tried eliding them in my own substring search algorithms and it meets with variable success.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they do, but I haven't checked and it seemed pretty easy to keep that invariant. I can check if you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general position has been to not elide bounds checks unless I'm pretty sure that it matters. If it were me, I'd remove the unsafe. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do some checking later this week, for now I'll land it.

@BurntSushi
Copy link
Member

This LGTM! Nice work @Manishearth :-)

@rust-lang rust-lang deleted a comment from BubbaSheen Jan 1, 2018
@Manishearth
Copy link
Member Author

You'd be in good company (GNU grep does that).

yay :)

To be clear, I think the frequency rank stuff is probably overkill for searching a single char and would probably just stick to the last byte.

phew

that sounds trickier to get right 😄

Do we really not already use memchr in str::find(str) though?

Yeah, we do an interesting but non-memchry algorithm. I considered retrofitting the existing memchr'd .find(char) into .find(str) but that would mean losing the existing algorithm which means the wins are iffier (not to mention that memchr has very little wins if you're stuttering the algorithm all the time, which is far likelier with a .find(str) built on top of .find(char))

This LGTM! Nice work @Manishearth :-)

can this be landed r=you? I've made a small mistake which I need to rectify, aside from that it seems basically ready. Or should we wait for second review?

@BurntSushi
Copy link
Member

@Manishearth Yeah r=me sounds great.

@Mark-Simulacrum
Copy link
Member

Perf queued; in the future please ping me directly.

@Manishearth
Copy link
Member Author

@bors r=burntsushi

@bors
Copy link
Contributor

bors commented Jan 1, 2018

📌 Commit 5cf5516 has been approved by burntsushi

@bors
Copy link
Contributor

bors commented Jan 1, 2018

⌛ Testing commit 5cf5516 with merge b65f0be...

bors added a commit that referenced this pull request Jan 1, 2018
Use memchr for str::find(char)

This is a 10x improvement for searching for characters.

This also contains the patches from #46713 . Feel free to land both separately or together.

cc @mystor @alexcrichton

r? @bluss

fixes #46693
@bors
Copy link
Contributor

bors commented Jan 1, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: burntsushi
Pushing b65f0be to master...

@bors bors merged commit 5cf5516 into rust-lang:master Jan 1, 2018
rep
}

/// Return the first index matching the byte `a` in `text`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a is meant to be x?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fixing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

someone fixed it already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str::find(char) is slower than it ought ot be
8 participants