Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Improve SequenceSet with Set, Range, Enumerable methods #239

Merged
merged 1 commit into from
Dec 11, 2023

Conversation

nevans
Copy link
Collaborator

@nevans nevans commented Nov 25, 2023

The version of SequenceSet in net-imap prior to this PR is merely a placeholder, needed in order to complete tagged-ext for #225.

This updates it with a full API, inspired by Set, Range, and Array. This allows it to be more broadly useful, e.g. for storing and working with mailbox state. A better API for working with sequence-set is also a prerequisite for properly supporting ESEARCH (#115, #121). NOTE: The API added here is documented as experimental, but I'd like to remove that label when v0.5.0 is released.

In addition to Integer, Range, and enumerables, any object with #to_sequence_set can now be used to create a sequence set. For compatibility with MessageSet, ThreadMember#to_sequence_set collects all child seqno into a SequenceSet.

Because mailbox state can be very large, inputs are stored in an internal sorted array of ranges. These are stored as [start, stop] tuples, not Range objects, for simpler manipulation. A future optimization could convert all tuples to a flat one-dimensional Array (to reduce object allocations). Storing the data in sorted range tuples allows many of the important operations to be O(lg n).

Although updates do use Array#insert and Array#slice!—which are technically O(n)—they tend to be fast until the number of elements is very large. Count and index-based methods are also O(n). A future optimization could cache the count and compose larger sets from a balanced tree of smaller sets, to preserve O(lg n) for most operations.

SequenceSet will be used to replace MessageSet (which is used internally to validate, format, and send certain command args). Some notable differences between the two:

  • Most validation is done up-front, when initializing or adding values.
  • A ThreadMember to sequence-set bug has been fixed.
  • The generated string is sorted and adjacent ranges are combined.

TODO list:

  • Documentation
    • Many basic improvements
    • The API is big: add a "What's here?" TOC
    • Identify unstable API: either split into future PRs or document as "experimental".
      For now, everything except for ::[] and #valid_string is marked as "experimental". I don't expect any big changes though.
  • Improve test suite
  • Add many many useful methods
    • Creation
      • ::[] to create a validated immutable sequence set
      • ::try_convert to safely coerce other objects into SequenceSet
      • SequenceSet#to_sequence_set -- itself
    • (Undocumentated) Compatibility with MessageSet
      • #validate
      • #send_data
      • ThreadMember#to_sequence_set
    • Equality and comparison predicates
      • #==
      • #eql? and #hash
      • #===
      • #cover?
      • #include?, #member?
      • #include_star?
      • #intersect?
      • #disjoint?
      • #empty?
      • #valid? -- i.e: not empty
      • #full?
    • Mutable Set API
      • #replace
      • #complement!
      • #add and #<<
      • #add?
      • #merge
      • #clear
      • #delete
      • #delete?
      • #subtract
      • #limit!
    • Immutable Set API
      • #freeze
      • lhs + rhs, lhs | rhs (#union)
      • lhs - rhs (#difference)
      • lhs & rhs (#intersection)
      • lhs ^ rhs (#xor)
      • ~set (#complement)
      • #limit
    • Iteration
      • #elements, #to_a
        • #each_element
      • #ranges
        • #each_range
      • #numbers
        • #each_number
      • #to_set
    • min, max, minmax
    • Index based operations
      • #count
      • #at
      • #[], #slice
        • integer offset (non-negative)
        • negative integer offset
        • range of offsets
        • starting offset and length
      • #slice!: remove a number or numbers at a given index or indices
      • #find_index: get the index of a number in the set
      • #delete_at: remove a number or numbers at a given index or indices
    • string methods
      • #valid_string - raises an exception when the set is empty
      • #string - is nil when the set is empty
      • #string= - can be set to nil to clear the set
      • #to_s - is empty string when empty
      • #normalize, normalize!
      • #inspect - special cases for frozen and empty
  • Refactor (simplify) implementation, without sacrificing O(lg n) performance.
    • This could go on indefinitely... I'll just declare this version "good enough".

TODO (later):

  • #pretty_print
  • #find_index_lte, #find_index_gte
  • Completely replace MessageSet.
  • Replace or supplement the UID set implementation in UIDPlusData.

@nevans nevans changed the title ✨ Improve SequenceSet with useful Set, Range, Enumerable (etc) methods ✨ Improve SequenceSet with Set, Range, Enumerable methods Nov 25, 2023
@nevans nevans force-pushed the sequence-set branch 2 times, most recently from 59afa64 to d191357 Compare November 27, 2023 04:13
@nevans nevans force-pushed the sequence-set branch 5 times, most recently from fdb847d to 60664a5 Compare December 11, 2023 03:24
@nevans nevans marked this pull request as ready for review December 11, 2023 03:43
@nevans nevans force-pushed the sequence-set branch 2 times, most recently from 43b900d to 8a7604f Compare December 11, 2023 07:49
The version of SequenceSet in net-imap prior to this commit was merely a
placeholder, needed in order to complete `tagged-ext` for ruby#225.

This updates it with a full API, inspired by Set, Range, and Array.
This allows it to be more broadly useful, e.g. for storing and working
with mailbox state.

In addition to Integer, Range, and enumerables, any object with
`#to_sequence_set` can now be used to create a sequence set.  For
compatibility with MessageSet, `ThreadMember#to_sequence_set` collects
all child seqno into a SequenceSet.

Because mailbox state can be _very_ large, inputs are stored in an
internal sorted array of ranges.  These are stored as `[start, stop]`
tuples, not Range objects, for simpler manipulation.  A future
optimization could convert all tuples to a flat one-dimensional Array
(to reduce object allocations).  Storing the data in sorted range tuples
allows many of the important operations to be `O(lg n)`.

Although updates do use `Array#insert` and `Array#slice!`—which are
technically `O(n)`—they tend to be fast until the number of elements is
very large.  Count and index-based methods are also `O(n)`.  A future
optimization could cache the count and compose larger sets from a sorted
tree of smaller sets, to preserve `O(lg n)` for most operations.

SequenceSet can be used to replace MessageSet (which is used internally
to validate, format, and send certain command args).  Some notable
differences between the two:
* Most validation is done up-front, when initializing or adding values.
* A ThreadMember to `sequence-set` bug has been fixed.
* The generated string is sorted and adjacent ranges are combined.

TODO in future PRs:
* #index_lte => get the index of a number in the set, or if the number
  isn't in the set, the number before it.
* Replace or supplement the UID set implementation in UIDPlusData.
* fully replace MessageSet (probably not before v0.5.0)
@nevans nevans merged commit c0cadb1 into ruby:master Dec 11, 2023
11 checks passed
@nevans nevans deleted the sequence-set branch December 11, 2023 14:53
@nevans nevans requested a review from shugo December 11, 2023 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant