⚡️ Don't memoize `SequenceSet#string` on normalized sets #554

nevans · 2025-11-22T19:24:37Z

Not duplicating the data in @tuples and @string saves memory. For large sequence sets, this memory savings can be substantial.

But this is a tradeoff: it saves time when the string is not used, but uses more time when the string is used more than once. Working with set operations can create many ephemeral sets, so avoiding unintentional string generation can save a lot of time.

Also, by quickly scanning the entries after a string is parsed, we can bypass the merge algorithm for normalized strings. But this does cause a small penalty for non-normalized strings.

Please note: It is still possible to create a memoized string on a normalized SequenceSet with #append. For example: create a monotonically sorted SequenceSet with non-normal final entry, then call #append with an adjacently following entry. #append coalesces the final entry and converts it into normal form, but doesn't check whether the preceding entries of the SequenceSet are normalized.

Benchmarks

Results from benchmarks/sequence_set-normalize.yml

There is still room for improvement here, because #normalize generates the normalized string for comparison rather than just reparse the string.

                           normal
               local:     19938.9 i/s
             v0.5.12:      2988.7 i/s - 6.67x  slower

                frozen and normal
               local:  17011413.5 i/s
             v0.5.12:      3574.4 i/s - 4759.30x  slower

                         unsorted
               local:     19434.9 i/s
             v0.5.12:      2957.5 i/s - 6.57x  slower

                         abnormal
               local:     19835.9 i/s
             v0.5.12:      3037.1 i/s - 6.53x  slower

Results from benchmarks/sequence_set-new.yml

Note that this benchmark doesn't use SequenceSet::new; it uses SequenceSet::[], which freezes the result. In this case, the benchmark result differences are mostly driven by improved performance of #freeze.

             n=     10   ints   (sorted)
                      local:    118753.9 i/s
                    v0.5.12:     85411.4 i/s - 1.39x  slower

             n=     10 string   (sorted)
                    v0.5.12:    123087.2 i/s
                      local:    122746.3 i/s - 1.00x  slower

             n=     10   ints (shuffled)
                      local:    105919.2 i/s
                    v0.5.12:     79294.5 i/s - 1.34x  slower

             n=     10 string (shuffled)
                    v0.5.12:    114826.6 i/s
                      local:    108086.2 i/s - 1.06x  slower

             n=    100   ints   (sorted)
                      local:     16418.4 i/s
                    v0.5.12:     11864.2 i/s - 1.38x  slower

             n=    100 string   (sorted)
                      local:     18161.7 i/s
                    v0.5.12:     15219.3 i/s - 1.19x  slower

             n=    100   ints (shuffled)
                      local:     16640.1 i/s
                    v0.5.12:     11815.8 i/s - 1.41x  slower

             n=    100 string (shuffled)
                    v0.5.12:     14755.8 i/s
                      local:     14512.8 i/s - 1.02x  slower

             n=  1,000   ints   (sorted)
                      local:      1722.2 i/s
                    v0.5.12:      1229.0 i/s - 1.40x  slower

             n=  1,000 string   (sorted)
                      local:      1862.1 i/s
                    v0.5.12:      1543.2 i/s - 1.21x  slower

             n=  1,000   ints (shuffled)
                      local:      1684.9 i/s
                    v0.5.12:      1252.3 i/s - 1.35x  slower

             n=  1,000 string (shuffled)
                    v0.5.12:      1467.3 i/s
                      local:      1424.6 i/s - 1.03x  slower

             n= 10,000   ints   (sorted)
                      local:       158.1 i/s
                    v0.5.12:       127.9 i/s - 1.24x  slower

             n= 10,000 string   (sorted)
                      local:       187.7 i/s
                    v0.5.12:       143.4 i/s - 1.31x  slower

             n= 10,000   ints (shuffled)
                      local:       145.8 i/s
                    v0.5.12:       114.5 i/s - 1.27x  slower

             n= 10,000 string (shuffled)
                    v0.5.12:       138.4 i/s
                      local:       136.9 i/s - 1.01x  slower

             n=100,000   ints   (sorted)
                      local:        14.9 i/s
                    v0.5.12:        10.6 i/s - 1.40x  slower

             n=100,000 string   (sorted)
                      local:        19.2 i/s
                    v0.5.12:        14.0 i/s - 1.37x  slower

The new code is ~1-6% slower for shuffled strings, but ~30-40% faster for sorted sets (note that unsorted non-string inputs create a sorted set).

Not duplicating the data in `@tuples` and `@string` saves memory. For large sequence sets, this memory savings can be substantial. But this is a tradeoff: it saves time when the string is not used, but uses more time when the string is used more than once. Working with set operations can create many ephemeral sets, so avoiding unintentional string generation can save a lot of time. Also, by quickly scanning the entries after a string is parsed, we can bypass the merge algorithm for normalized strings. But this does cause a small penalty for non-normalized strings. **Please note:** It _is still possible_ to create a memoized string on a normalized SequenceSet with `#append`. For example: create a monotonically sorted SequenceSet with non-normal final entry, then call `#append` with an adjacently following entry. `#append` coalesces the final entry and converts it into normal form, but doesn't check whether the _preceding entries_ of the SequenceSet are normalized. -------------------------------------------------------------------- Results from benchmarks/sequence_set-normalize.yml There is still room for improvement here, because #normalize generates the normalized string for comparison rather than just reparse the string. ``` normal local: 19938.9 i/s v0.5.12: 2988.7 i/s - 6.67x slower frozen and normal local: 17011413.5 i/s v0.5.12: 3574.4 i/s - 4759.30x slower unsorted local: 19434.9 i/s v0.5.12: 2957.5 i/s - 6.57x slower abnormal local: 19835.9 i/s v0.5.12: 3037.1 i/s - 6.53x slower ``` -------------------------------------------------------------------- Results from benchmarks/sequence_set-new.yml Note that this benchmark doesn't use `SequenceSet::new`; it uses `SequenceSet::[]`, which freezes the result. In this case, the benchmark result differences are mostly driven by improved performance of `#freeze`. ``` n= 10 ints (sorted) local: 118753.9 i/s v0.5.12: 85411.4 i/s - 1.39x slower n= 10 string (sorted) v0.5.12: 123087.2 i/s local: 122746.3 i/s - 1.00x slower n= 10 ints (shuffled) local: 105919.2 i/s v0.5.12: 79294.5 i/s - 1.34x slower n= 10 string (shuffled) v0.5.12: 114826.6 i/s local: 108086.2 i/s - 1.06x slower n= 100 ints (sorted) local: 16418.4 i/s v0.5.12: 11864.2 i/s - 1.38x slower n= 100 string (sorted) local: 18161.7 i/s v0.5.12: 15219.3 i/s - 1.19x slower n= 100 ints (shuffled) local: 16640.1 i/s v0.5.12: 11815.8 i/s - 1.41x slower n= 100 string (shuffled) v0.5.12: 14755.8 i/s local: 14512.8 i/s - 1.02x slower n= 1,000 ints (sorted) local: 1722.2 i/s v0.5.12: 1229.0 i/s - 1.40x slower n= 1,000 string (sorted) local: 1862.1 i/s v0.5.12: 1543.2 i/s - 1.21x slower n= 1,000 ints (shuffled) local: 1684.9 i/s v0.5.12: 1252.3 i/s - 1.35x slower n= 1,000 string (shuffled) v0.5.12: 1467.3 i/s local: 1424.6 i/s - 1.03x slower n= 10,000 ints (sorted) local: 158.1 i/s v0.5.12: 127.9 i/s - 1.24x slower n= 10,000 string (sorted) local: 187.7 i/s v0.5.12: 143.4 i/s - 1.31x slower n= 10,000 ints (shuffled) local: 145.8 i/s v0.5.12: 114.5 i/s - 1.27x slower n= 10,000 string (shuffled) v0.5.12: 138.4 i/s local: 136.9 i/s - 1.01x slower n=100,000 ints (sorted) local: 14.9 i/s v0.5.12: 10.6 i/s - 1.40x slower n=100,000 string (sorted) local: 19.2 i/s v0.5.12: 14.0 i/s - 1.37x slower ``` The new code is ~1-6% slower for shuffled strings, but ~30-40% faster for sorted sets (note that unsorted non-string inputs create a sorted set). 📚 Update SequenceSet#normalize rdoc

nevans changed the title ~~⚡️ Don't store SequenceSet#string when normalized~~ ⚡️ Don't memoize SequenceSet#string on normalized sets Nov 22, 2025

nevans force-pushed the sequence_set/drop-normalized-string branch 4 times, most recently from 20ac793 to 2baf04d Compare November 24, 2025 22:32

nevans force-pushed the sequence_set/drop-normalized-string branch from 2baf04d to 8ed52bf Compare November 25, 2025 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Don't memoize `SequenceSet#string` on normalized sets #554

⚡️ Don't memoize `SequenceSet#string` on normalized sets #554

nevans commented Nov 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Don't memoize SequenceSet#string on normalized sets #554

Are you sure you want to change the base?

⚡️ Don't memoize SequenceSet#string on normalized sets #554

Conversation

nevans commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Results from benchmarks/sequence_set-normalize.yml

Results from benchmarks/sequence_set-new.yml

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Don't memoize `SequenceSet#string` on normalized sets #554

⚡️ Don't memoize `SequenceSet#string` on normalized sets #554

nevans commented Nov 22, 2025 •

edited

Loading