Skip to content

Conversation

jmschonfeld
Copy link
Contributor

@jmschonfeld jmschonfeld commented Nov 9, 2023

This implements the API described by SE-0270 Add Collection Operations on Noncontiguous Elements. Specifically, this adds the RangeSet and DiscontiguousSlice types and associated APIs to the stdlib.

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

1 similar comment
@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please build toolchain macOS platform

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

3 similar comments
@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test macOS platform

@jmschonfeld jmschonfeld marked this pull request as ready for review November 28, 2023 20:16
@jmschonfeld jmschonfeld requested review from a team, hborla and xedin as code owners November 28, 2023 20:16
hasher.combine(element)
count += 1
}
hasher.combine(count) // discriminator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does including count actually get us here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lorentey please correct me if I'm wrong - I believe we added this while discussing this type and comparing it to existing collection types. I believe this is a discriminator for cases where you may have a Hashable parent type such as the following:

struct Parent<Base: Collection>: Hashable where Base.Element: Hashable {
    let slice1: DiscontiguousSlice<Base>
    let slice2: DiscontiguousSlice<Base>
}

Using the count of the slice as a discriminator here causes the hash values of two unequal parents Parent(slice1: [1], slice2: [2, 3]) and Parent(slice1: [1, 2], slice2: [3]) to have two different hash values resulting in a better hashing function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@lorentey lorentey Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- variable-sized values must use a self-delimiting hash encoding, to prevent (accidental or deliberate) hash collisions for aggregate types that include such values as components, such as @jmschonfeld's struct Parent.

On a second look, I believe combining the count as the last step isn't quite good enough, though -- we'll need to either swallow the cost of retrieving the count in a separate pass and combining it up front, or we need to insert discriminators before each item and a different delimiter at the end:

  public func hash(into hasher: inout Hasher) {
     for element in self {
       hasher.combine(1 as UInt8) // discriminator
       hasher.combine(element)
     }
     hasher.combine(0 as UInt8) // delimiter
  }

I think retrieving the count up front is generally going to be cheaper. I'll submit a suggestion implementing this in a separate thread.

The expectation is that we would in theory be able to reconstruct/decode the hashed value if we are given hash encoding, even if some unrelated bytes got appended at the end of it. (If we want to delimit the encoding using a count, then this means that the count needs to appear first, as it wouldn't be distinguishable from actual elements later.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like we're taking it too far; you cannot actually prevent all collisions, only make a reasonable best-effort. I buy the rationale as far as using some delimiter, but it's not at all clear to me that we gain anything by using count instead of an arbitrarily chosen delimiter byte.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting hashing right is an important detail, because it has security implications. Playing loose with the rules can lead to exploitable vulnerabilities; we should not let the Standard Library go down that path.

For Set/Dictionary to work as promised, we must avoid allowing repeatable hash collisions. Two distinct pieces of input data must not ever reliably hash to the same value.

When hashing general purpose container types, care must be taken to make sure the encoding remains robust even when concatenated with arbitrary other encodings. Delimiting the hash encoding with a specific bit pattern usually doesn't work in this case, as we cannot prevent the element type from emitting the same pattern. (Unlike with special-purpose collections where the Element type is known, such as String.) The two approaches above (counting elements or combining per-instance discriminators) are the two basic techniques we can choose from.

Hashing the count of items requires two passes in this case; using per-instance discriminators would feed significantly more data to the hasher. Of these two options, the former seems preferable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The current implementation is great! It explicitly feeds the count to the hasher before hashing any of the items. Accordingly, this can be marked resolved, unless anyone has an objection.)

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

Copy link
Member

@lorentey lorentey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made lots of notes, but most of them are tiny formatting nits with suggestions.

for range in inversion.ranges {
result.append(contentsOf: self[range])
}
self = result
Copy link
Member

@lorentey lorentey Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: This implementation will not work for noncopyable elements. The most general RRC.removeAll(where:) implementation shares this problem, though, so I don't think it is a real objection, even if we end up generalizing the existing protocols for noncopyable support.

Copy link
Contributor Author

@jmschonfeld jmschonfeld Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. I'd be happy to update it to something that is compatible, but I can't think of a good way to remove multiple subranges at once otherwise because any mutation to the collection will invalidate any remaining subrange indices we have yet to process

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is implementable in a reasonable way without self-assignment. The question is: would it be better to reformulate this entry point as a nonmutating func removingSubranges API? That would be a more honest description of how it works, and it would directly translate to a consuming algorithm.

Update: I see we already have a removingSubranges API on Collection, but that returns a DiscontiguousSlice, and it would clash with this one. Bummer! (I can't judge if we need to keep both...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah as you mentioned the current API under review has both, but the removing one returns a DiscontiguousSlice in constant time wrapping the base collection rather than creating a copy and removing various elements from it

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

2 similar comments
@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please build toolchain macOS Platform

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test Linux platform

@jmschonfeld jmschonfeld requested a review from lorentey December 13, 2023 17:39
Copy link
Member

@lorentey lorentey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another review pass -- this still looks good. I added some new notes, incl. some API-level observations.

Because we aren't entirely sure if we'll be able to keep things opaque forever, I recommend exposing ABI entry points for at least the most important high-level helper functions. (If we do end up wishing to make more things @inlinable later, not having the symbols exposed will otherwise be extremely painful.)

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test macOS platform

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@jmschonfeld
Copy link
Contributor Author

Rebased to pick up the new ABI checker tests and added the symbols to fix the macOS test failure

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@jmschonfeld
Copy link
Contributor Author

@swift-ci please smoke test

@available(SwiftStdlib 5.11, *)
extension DiscontiguousSlice.Index: CustomStringConvertible {
public var description: String {
"<base: \(String(reflecting: base)), rangeOffset: \(_rangeOffset)>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: this will be quite unreadable when the reflection of base produces a structured multi-line string. I'm not sure what we can practically do about that here -- it is a solvable problem, but it requires adding machinery for context-sensitive string embedding that would not be appropriate to implement in this PR.

Copy link
Member

@lorentey lorentey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

The big caveat is that this is an opaque Collection type, a big deviation from most previous stdlib additions. (The closest existing analogue is probably CollectionDifference.) I have precious little working experience with these, so I can't fully foresee what problems we'll need to prepare for. However, I think the current code does as good job as can be expected.

My second worry is about test coverage, especially of corner cases. The test-to-code ratio is pretty low -- the 586 lines of new tests are unlikely to cover all of the new functionality. (Especially as these are all concrete test cases.) Without exhaustive tests, we do not know whether this will actually work as intended.

We should probably have some combinatorial tests to validate the consistency of the protocol conformances. Collection is a very tricky protocol, and this feature is adding multiple new implementations of it, some quite subtle.

The standard checkHashable/checkCollection/checkBidirectionalCollection/checkMutableCollection/etc. routines implement some exhaustive checks to try to catch typical implementation issues. They only validate a subset of our semantic requirements, but they are great at checking things we'd not usually think to explicitly test for.

The StdlibUnittest modules also include some test collection implementations (Minimal*Collection) that are intended to exercise hidden corners in collection algorithms and lazy transformations like DiscontiguousSlice. (They implement protocol requirements with no shortcuts (sometimes in the most awkward/unhelpful manner), and they come with extra hooks to e.g. verify that expected calls do occur.)

@jmschonfeld jmschonfeld merged commit 2404013 into swiftlang:main Jan 9, 2024
@jmschonfeld jmschonfeld deleted the rangeset-revival branch January 10, 2024 16:47
Catfish-Man pushed a commit to Catfish-Man/swift that referenced this pull request Jan 19, 2024
…ang#69766)

* Adds RangeSet/DiscontiguousSlice to the stdlib

* Remove redundant DiscontiguousSlice.Index: Comparable conformance

* Attempt to fix embedded build

* Attempt to fix macOS test failures

* Fix Constaints/members.swift failure on linux

* Add exceptions to ABI/source checker to fix macOS tests

* Fix incremental dependency test failure

* Remove inlining/unfreeze implementation for future improvements

* Simplify indices(where:) implementation

* Address review feedback

* Add test for underscored, public slice members

* Address feedback on inlining, hashing, and initializing with unordered arrays

* Fix ABI checker issues

* Remove MutableCollection extension for DiscontiguousSlice

* Make insertion return a discardable Bool

* Fix ABI checker tests

* Fix other ABI checker tests due to dropping MutableCollection subscript
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants