-
-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster String#squish #1159
Faster String#squish #1159
Conversation
squish original ascii-only whitespace (600 bytes) 42.92k ( 23.30µs) (± 0.55%) 5.92kB/op 8.58× slower squish optimized ascii-only whitespace (600 bytes) 368.48k ( 2.71µs) (± 3.49%) 1.19kB/op fastest squish original w/ unicode whitespace (60 bytes) 528.12k ( 1.89µs) (± 2.31%) 624B/op 3.39× slower squish optimized w/ unicode whitespace (60 bytes) 1.79M (557.76ns) (± 2.61%) 225B/op fastest
spec/charms/string_spec.cr
Outdated
@@ -2,11 +2,10 @@ require "../spec_helper" | |||
|
|||
describe "String charm" do | |||
describe "squish" do | |||
it "squishes the text by removing newlines and extra whitespace" do | |||
og_string = " foo bar \n \t boo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also leave this in? I dig the new spec, but I'm thinking leave both of them in just for the extra safety. What are your thoughts?
end | ||
end | ||
|
||
puts "Sanity check the return output is consistent:" | ||
example = " f f\u00A0\u00A0\u00A0f f \n \t \v\v \f\f 11111 a l0* あ\u00A0\u00A0\u00A0 " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks so much for adding this.
Purpose
Resolves #1157
Description
Optimizes String#squish for both ascii and unicode text. There is some unavoidable (I think) duplication to optimize the ascii-only case. The change could be simplified to only use
Char#whitespace?
(which works for both ascii and unicode) but it's slower for ascii-only strings. However, it's still significantly faster than using a regular expression.I replaced bench.cr with my benchmark per @jwoertink 's comment in the Github issue. It shows the difference using the ascii-only optimization (Char#ascii_whitespace? vs Char#whitespace?). Here are the results from my machine (2014 macbook):
Lastly, there appears to be a unintended bugfix for consecutive unicode whitespace. The original regex doesn't appear to match them (libpcre issue?). The benchmark will show the difference if you run it locally.
Checklist
crystal tool format spec src
./script/setup
./script/test