Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster String#squish #1159

Merged
merged 3 commits into from
May 30, 2020
Merged

Faster String#squish #1159

merged 3 commits into from
May 30, 2020

Conversation

jamescook
Copy link
Contributor

Purpose

Resolves #1157

Description

Optimizes String#squish for both ascii and unicode text. There is some unavoidable (I think) duplication to optimize the ascii-only case. The change could be simplified to only use Char#whitespace? (which works for both ascii and unicode) but it's slower for ascii-only strings. However, it's still significantly faster than using a regular expression.

I replaced bench.cr with my benchmark per @jwoertink 's comment in the Github issue. It shows the difference using the ascii-only optimization (Char#ascii_whitespace? vs Char#whitespace?). Here are the results from my machine (2014 macbook):

     squish regex ascii-only whitespace (600 bytes)  42.17k ( 23.71µs) (± 1.57%)  5.92kB/op   8.97× slower
 squish optimized ascii-only whitespace (600 bytes) 378.22k (  2.64µs) (± 0.54%)  1.19kB/op        fastest
squish simplified ascii-only whitespace (600 bytes) 277.88k (  3.60µs) (± 1.24%)  1.19kB/op   1.36× slower

     squish regex w/ unicode whitespace (630 bytes)  63.52k ( 15.74µs) (± 0.95%)  4.48kB/op   3.39× slower
 squish optimized w/ unicode whitespace (630 bytes) 214.08k (  4.67µs) (± 1.42%)  1.41kB/op   1.01× slower
squish simplified w/ unicode whitespace (630 bytes) 215.32k (  4.64µs) (± 0.69%)  1.41kB/op        fastest

Lastly, there appears to be a unintended bugfix for consecutive unicode whitespace. The original regex doesn't appear to match them (libpcre issue?). The benchmark will show the difference if you run it locally.

Checklist

  • - An issue already exists detailing the issue/or feature request that this PR fixes
  • - All specs are formatted with crystal tool format spec src
  • - Inline documentation has been added and/or updated
  • - Lucky builds on docker with ./script/setup
  • - All builds and specs pass on docker with ./script/test

  squish original ascii-only whitespace (600 bytes)  42.92k ( 23.30µs) (± 0.55%)  5.92kB/op   8.58× slower
 squish optimized ascii-only whitespace (600 bytes) 368.48k (  2.71µs) (± 3.49%)  1.19kB/op        fastest

  squish original w/ unicode whitespace (60 bytes) 528.12k (  1.89µs) (± 2.31%)  624B/op   3.39× slower
squish optimized w/ unicode whitespace (60 bytes)   1.79M (557.76ns) (± 2.61%)  225B/op        fastest
@@ -2,11 +2,10 @@ require "../spec_helper"

describe "String charm" do
describe "squish" do
it "squishes the text by removing newlines and extra whitespace" do
og_string = " foo bar \n \t boo"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also leave this in? I dig the new spec, but I'm thinking leave both of them in just for the extra safety. What are your thoughts?

end
end

puts "Sanity check the return output is consistent:"
example = " f f\u00A0\u00A0\u00A0f f \n \t \v\v \f\f 11111 a l0* あ\u00A0\u00A0\u00A0 "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Member

@jwoertink jwoertink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks so much for adding this.

@jwoertink jwoertink merged commit d1a2352 into luckyframework:master May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster String#squish
2 participants