Skip to content

Conversation

@nevans
Copy link
Collaborator

@nevans nevans commented Nov 22, 2025

Not duplicating the data in @tuples and @string saves memory. For large sequence sets, this memory savings can be substantial.

But this is a tradeoff: it saves time when the string is not used, but uses more time when the string is used more than once. Working with set operations can create many ephemeral sets, so avoiding unintentional string generation can save a lot of time.

Also, by quickly scanning the entries after a string is parsed, we can bypass the merge algorithm for normalized strings. But this does cause a small penalty for non-normalized strings.

Please note: It is still possible to create a memoized string on a normalized SequenceSet with #append. For example: create a monotonically sorted SequenceSet with non-normal final entry, then call #append with an adjacently following entry. #append coalesces the final entry and converts it into normal form, but doesn't check whether the preceding entries of the SequenceSet are normalized.

Benchmarks

Results from benchmarks/sequence_set-normalize.yml

There is still room for improvement here, because #normalize generates the normalized string for comparison rather than just reparse the string.

                           normal
               local:     19938.9 i/s
             v0.5.12:      2988.7 i/s - 6.67x  slower

                frozen and normal
               local:  17011413.5 i/s
             v0.5.12:      3574.4 i/s - 4759.30x  slower

                         unsorted
               local:     19434.9 i/s
             v0.5.12:      2957.5 i/s - 6.57x  slower

                         abnormal
               local:     19835.9 i/s
             v0.5.12:      3037.1 i/s - 6.53x  slower

Results from benchmarks/sequence_set-new.yml

Note that this benchmark doesn't use SequenceSet::new; it uses SequenceSet::[], which freezes the result. In this case, the benchmark result differences are mostly driven by improved performance of #freeze.

             n=     10   ints   (sorted)
                      local:    118753.9 i/s
                    v0.5.12:     85411.4 i/s - 1.39x  slower

             n=     10 string   (sorted)
                    v0.5.12:    123087.2 i/s
                      local:    122746.3 i/s - 1.00x  slower

             n=     10   ints (shuffled)
                      local:    105919.2 i/s
                    v0.5.12:     79294.5 i/s - 1.34x  slower

             n=     10 string (shuffled)
                    v0.5.12:    114826.6 i/s
                      local:    108086.2 i/s - 1.06x  slower

             n=    100   ints   (sorted)
                      local:     16418.4 i/s
                    v0.5.12:     11864.2 i/s - 1.38x  slower

             n=    100 string   (sorted)
                      local:     18161.7 i/s
                    v0.5.12:     15219.3 i/s - 1.19x  slower

             n=    100   ints (shuffled)
                      local:     16640.1 i/s
                    v0.5.12:     11815.8 i/s - 1.41x  slower

             n=    100 string (shuffled)
                    v0.5.12:     14755.8 i/s
                      local:     14512.8 i/s - 1.02x  slower

             n=  1,000   ints   (sorted)
                      local:      1722.2 i/s
                    v0.5.12:      1229.0 i/s - 1.40x  slower

             n=  1,000 string   (sorted)
                      local:      1862.1 i/s
                    v0.5.12:      1543.2 i/s - 1.21x  slower

             n=  1,000   ints (shuffled)
                      local:      1684.9 i/s
                    v0.5.12:      1252.3 i/s - 1.35x  slower

             n=  1,000 string (shuffled)
                    v0.5.12:      1467.3 i/s
                      local:      1424.6 i/s - 1.03x  slower

             n= 10,000   ints   (sorted)
                      local:       158.1 i/s
                    v0.5.12:       127.9 i/s - 1.24x  slower

             n= 10,000 string   (sorted)
                      local:       187.7 i/s
                    v0.5.12:       143.4 i/s - 1.31x  slower

             n= 10,000   ints (shuffled)
                      local:       145.8 i/s
                    v0.5.12:       114.5 i/s - 1.27x  slower

             n= 10,000 string (shuffled)
                    v0.5.12:       138.4 i/s
                      local:       136.9 i/s - 1.01x  slower

             n=100,000   ints   (sorted)
                      local:        14.9 i/s
                    v0.5.12:        10.6 i/s - 1.40x  slower

             n=100,000 string   (sorted)
                      local:        19.2 i/s
                    v0.5.12:        14.0 i/s - 1.37x  slower

The new code is ~1-6% slower for shuffled strings, but ~30-40% faster for sorted sets (note that unsorted non-string inputs create a sorted set).

@nevans nevans changed the title ⚡️ Don't store SequenceSet#string when normalized ⚡️ Don't memoize SequenceSet#string on normalized sets Nov 22, 2025
@nevans nevans force-pushed the sequence_set/drop-normalized-string branch 4 times, most recently from 20ac793 to 2baf04d Compare November 24, 2025 22:32
Not duplicating the data in `@tuples` and `@string` saves memory.  For
large sequence sets, this memory savings can be substantial.

But this is a tradeoff: it saves time when the string is not used, but
uses more time when the string is used more than once.  Working with set
operations can create many ephemeral sets, so avoiding unintentional
string generation can save a lot of time.

Also, by quickly scanning the entries after a string is parsed, we can
bypass the merge algorithm for normalized strings.  But this does cause
a small penalty for non-normalized strings.

**Please note:**  It _is still possible_ to create a memoized string on
a normalized SequenceSet with `#append`.  For example: create a
monotonically sorted SequenceSet with non-normal final entry, then
call `#append` with an adjacently following entry.  `#append` coalesces
the final entry and converts it into normal form, but doesn't check
whether the _preceding entries_ of the SequenceSet are normalized.

--------------------------------------------------------------------
Results from benchmarks/sequence_set-normalize.yml

There is still room for improvement here, because #normalize generates
the normalized string for comparison rather than just reparse the
string.

```
                           normal
               local:     19938.9 i/s
             v0.5.12:      2988.7 i/s - 6.67x  slower

                frozen and normal
               local:  17011413.5 i/s
             v0.5.12:      3574.4 i/s - 4759.30x  slower

                         unsorted
               local:     19434.9 i/s
             v0.5.12:      2957.5 i/s - 6.57x  slower

                         abnormal
               local:     19835.9 i/s
             v0.5.12:      3037.1 i/s - 6.53x  slower
```

--------------------------------------------------------------------
Results from benchmarks/sequence_set-new.yml

Note that this benchmark doesn't use `SequenceSet::new`; it uses
`SequenceSet::[]`, which freezes the result. In this case, the benchmark
result differences are mostly driven by improved performance of
`#freeze`.

```
             n=     10   ints   (sorted)
                      local:    118753.9 i/s
                    v0.5.12:     85411.4 i/s - 1.39x  slower

             n=     10 string   (sorted)
                    v0.5.12:    123087.2 i/s
                      local:    122746.3 i/s - 1.00x  slower

             n=     10   ints (shuffled)
                      local:    105919.2 i/s
                    v0.5.12:     79294.5 i/s - 1.34x  slower

             n=     10 string (shuffled)
                    v0.5.12:    114826.6 i/s
                      local:    108086.2 i/s - 1.06x  slower

             n=    100   ints   (sorted)
                      local:     16418.4 i/s
                    v0.5.12:     11864.2 i/s - 1.38x  slower

             n=    100 string   (sorted)
                      local:     18161.7 i/s
                    v0.5.12:     15219.3 i/s - 1.19x  slower

             n=    100   ints (shuffled)
                      local:     16640.1 i/s
                    v0.5.12:     11815.8 i/s - 1.41x  slower

             n=    100 string (shuffled)
                    v0.5.12:     14755.8 i/s
                      local:     14512.8 i/s - 1.02x  slower

             n=  1,000   ints   (sorted)
                      local:      1722.2 i/s
                    v0.5.12:      1229.0 i/s - 1.40x  slower

             n=  1,000 string   (sorted)
                      local:      1862.1 i/s
                    v0.5.12:      1543.2 i/s - 1.21x  slower

             n=  1,000   ints (shuffled)
                      local:      1684.9 i/s
                    v0.5.12:      1252.3 i/s - 1.35x  slower

             n=  1,000 string (shuffled)
                    v0.5.12:      1467.3 i/s
                      local:      1424.6 i/s - 1.03x  slower

             n= 10,000   ints   (sorted)
                      local:       158.1 i/s
                    v0.5.12:       127.9 i/s - 1.24x  slower

             n= 10,000 string   (sorted)
                      local:       187.7 i/s
                    v0.5.12:       143.4 i/s - 1.31x  slower

             n= 10,000   ints (shuffled)
                      local:       145.8 i/s
                    v0.5.12:       114.5 i/s - 1.27x  slower

             n= 10,000 string (shuffled)
                    v0.5.12:       138.4 i/s
                      local:       136.9 i/s - 1.01x  slower

             n=100,000   ints   (sorted)
                      local:        14.9 i/s
                    v0.5.12:        10.6 i/s - 1.40x  slower

             n=100,000 string   (sorted)
                      local:        19.2 i/s
                    v0.5.12:        14.0 i/s - 1.37x  slower
```

The new code is ~1-6% slower for shuffled strings, but ~30-40% faster
for sorted sets (note that unsorted non-string inputs create a sorted
set).

📚 Update SequenceSet#normalize rdoc
@nevans nevans force-pushed the sequence_set/drop-normalized-string branch from 2baf04d to 8ed52bf Compare November 25, 2025 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants