Faster String#squish #1159

jamescook · 2020-05-29T11:37:43Z

Purpose

Resolves #1157

Description

Optimizes String#squish for both ascii and unicode text. There is some unavoidable (I think) duplication to optimize the ascii-only case. The change could be simplified to only use Char#whitespace? (which works for both ascii and unicode) but it's slower for ascii-only strings. However, it's still significantly faster than using a regular expression.

I replaced bench.cr with my benchmark per @jwoertink 's comment in the Github issue. It shows the difference using the ascii-only optimization (Char#ascii_whitespace? vs Char#whitespace?). Here are the results from my machine (2014 macbook):

     squish regex ascii-only whitespace (600 bytes)  42.17k ( 23.71µs) (± 1.57%)  5.92kB/op   8.97× slower
 squish optimized ascii-only whitespace (600 bytes) 378.22k (  2.64µs) (± 0.54%)  1.19kB/op        fastest
squish simplified ascii-only whitespace (600 bytes) 277.88k (  3.60µs) (± 1.24%)  1.19kB/op   1.36× slower

     squish regex w/ unicode whitespace (630 bytes)  63.52k ( 15.74µs) (± 0.95%)  4.48kB/op   3.39× slower
 squish optimized w/ unicode whitespace (630 bytes) 214.08k (  4.67µs) (± 1.42%)  1.41kB/op   1.01× slower
squish simplified w/ unicode whitespace (630 bytes) 215.32k (  4.64µs) (± 0.69%)  1.41kB/op        fastest

Lastly, there appears to be a unintended bugfix for consecutive unicode whitespace. The original regex doesn't appear to match them (libpcre issue?). The benchmark will show the difference if you run it locally.

Checklist

- An issue already exists detailing the issue/or feature request that this PR fixes
- All specs are formatted with crystal tool format spec src
- Inline documentation has been added and/or updated
- Lucky builds on docker with ./script/setup
- All builds and specs pass on docker with ./script/test

squish original ascii-only whitespace (600 bytes) 42.92k ( 23.30µs) (± 0.55%) 5.92kB/op 8.58× slower squish optimized ascii-only whitespace (600 bytes) 368.48k ( 2.71µs) (± 3.49%) 1.19kB/op fastest squish original w/ unicode whitespace (60 bytes) 528.12k ( 1.89µs) (± 2.31%) 624B/op 3.39× slower squish optimized w/ unicode whitespace (60 bytes) 1.79M (557.76ns) (± 2.61%) 225B/op fastest

jwoertink · 2020-05-29T16:28:41Z

spec/charms/string_spec.cr

@@ -2,11 +2,10 @@ require "../spec_helper"

 describe "String charm" do
  describe "squish" do
-    it "squishes the text by removing newlines and extra whitespace" do
-      og_string = " foo   bar    \n   \t   boo"


Can we also leave this in? I dig the new spec, but I'm thinking leave both of them in just for the extra safety. What are your thoughts?

jwoertink · 2020-05-29T16:29:40Z

bench.cr

  end
 end
+
+puts "Sanity check the return output is consistent:"
+example = " f f\u00A0\u00A0\u00A0f f \n \t \v\v \f\f 11111  a l0* あ\u00A0\u00A0\u00A0 "


jwoertink

Awesome! Thanks so much for adding this.

jamescook added 2 commits May 29, 2020 06:14

Replace existing benchmark with one for String#squish per comment in

2560b10

jwoertink reviewed May 29, 2020

View reviewed changes

Include original spec for String#squish

f163f5e

jwoertink approved these changes May 30, 2020

View reviewed changes

jwoertink merged commit d1a2352 into luckyframework:master May 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster String#squish #1159

Faster String#squish #1159

jamescook commented May 29, 2020

jwoertink May 29, 2020

jwoertink May 29, 2020

jwoertink left a comment

Faster String#squish #1159

Faster String#squish #1159

Conversation

jamescook commented May 29, 2020

Purpose

Description

Checklist

jwoertink May 29, 2020

Choose a reason for hiding this comment

jwoertink May 29, 2020

Choose a reason for hiding this comment

jwoertink left a comment

Choose a reason for hiding this comment