-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[SE-0270] Add Collection Operations on Noncontiguous Elements #69766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci please smoke test |
1 similar comment
@swift-ci please smoke test |
90dadbe
to
215329b
Compare
@swift-ci please smoke test |
@swift-ci please build toolchain macOS platform |
@swift-ci please smoke test |
3 similar comments
@swift-ci please smoke test |
@swift-ci please smoke test |
@swift-ci please smoke test |
@swift-ci please smoke test macOS platform |
hasher.combine(element) | ||
count += 1 | ||
} | ||
hasher.combine(count) // discriminator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does including count
actually get us here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lorentey please correct me if I'm wrong - I believe we added this while discussing this type and comparing it to existing collection types. I believe this is a discriminator for cases where you may have a Hashable
parent type such as the following:
struct Parent<Base: Collection>: Hashable where Base.Element: Hashable {
let slice1: DiscontiguousSlice<Base>
let slice2: DiscontiguousSlice<Base>
}
Using the count
of the slice as a discriminator here causes the hash values of two unequal parents Parent(slice1: [1], slice2: [2, 3])
and Parent(slice1: [1, 2], slice2: [3])
to have two different hash values resulting in a better hashing function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Array
has this discriminator for a similar reason: https://github.com/apple/swift/blob/6af9c99d66554ec9925125ed2afc4c74162e757c/stdlib/public/core/Array.swift#L1801-L1814
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- variable-sized values must use a self-delimiting hash encoding, to prevent (accidental or deliberate) hash collisions for aggregate types that include such values as components, such as @jmschonfeld's struct Parent
.
On a second look, I believe combining the count as the last step isn't quite good enough, though -- we'll need to either swallow the cost of retrieving the count
in a separate pass and combining it up front, or we need to insert discriminators before each item and a different delimiter at the end:
public func hash(into hasher: inout Hasher) {
for element in self {
hasher.combine(1 as UInt8) // discriminator
hasher.combine(element)
}
hasher.combine(0 as UInt8) // delimiter
}
I think retrieving the count
up front is generally going to be cheaper. I'll submit a suggestion implementing this in a separate thread.
The expectation is that we would in theory be able to reconstruct/decode the hashed value if we are given hash encoding, even if some unrelated bytes got appended at the end of it. (If we want to delimit the encoding using a count, then this means that the count needs to appear first, as it wouldn't be distinguishable from actual elements later.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like we're taking it too far; you cannot actually prevent all collisions, only make a reasonable best-effort. I buy the rationale as far as using some delimiter, but it's not at all clear to me that we gain anything by using count
instead of an arbitrarily chosen delimiter byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting hashing right is an important detail, because it has security implications. Playing loose with the rules can lead to exploitable vulnerabilities; we should not let the Standard Library go down that path.
For Set/Dictionary to work as promised, we must avoid allowing repeatable hash collisions. Two distinct pieces of input data must not ever reliably hash to the same value.
When hashing general purpose container types, care must be taken to make sure the encoding remains robust even when concatenated with arbitrary other encodings. Delimiting the hash encoding with a specific bit pattern usually doesn't work in this case, as we cannot prevent the element type from emitting the same pattern. (Unlike with special-purpose collections where the Element type is known, such as String.) The two approaches above (counting elements or combining per-instance discriminators) are the two basic techniques we can choose from.
Hashing the count of items requires two passes in this case; using per-instance discriminators would feed significantly more data to the hasher. Of these two options, the former seems preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The current implementation is great! It explicitly feeds the count to the hasher before hashing any of the items. Accordingly, this can be marked resolved, unless anyone has an objection.)
@swift-ci please smoke test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made lots of notes, but most of them are tiny formatting nits with suggestions.
for range in inversion.ranges { | ||
result.append(contentsOf: self[range]) | ||
} | ||
self = result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: This implementation will not work for noncopyable elements. The most general RRC.removeAll(where:)
implementation shares this problem, though, so I don't think it is a real objection, even if we end up generalizing the existing protocols for noncopyable support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. I'd be happy to update it to something that is compatible, but I can't think of a good way to remove multiple subranges at once otherwise because any mutation to the collection will invalidate any remaining subrange indices we have yet to process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is implementable in a reasonable way without self-assignment. The question is: would it be better to reformulate this entry point as a nonmutating func removingSubranges
API? That would be a more honest description of how it works, and it would directly translate to a consuming
algorithm.
Update: I see we already have a removingSubranges
API on Collection
, but that returns a DiscontiguousSlice
, and it would clash with this one. Bummer! (I can't judge if we need to keep both...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah as you mentioned the current API under review has both, but the removing
one returns a DiscontiguousSlice
in constant time wrapping the base collection rather than creating a copy and removing various elements from it
test/api-digester/Outputs/stability-stdlib-source-x86_64.swift.expected
Outdated
Show resolved
Hide resolved
@swift-ci please smoke test |
@swift-ci please build toolchain macOS Platform |
@swift-ci please smoke test Linux platform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another review pass -- this still looks good. I added some new notes, incl. some API-level observations.
Because we aren't entirely sure if we'll be able to keep things opaque forever, I recommend exposing ABI entry points for at least the most important high-level helper functions. (If we do end up wishing to make more things @inlinable
later, not having the symbols exposed will otherwise be extremely painful.)
@swift-ci please smoke test |
@swift-ci please smoke test macOS platform |
6675bd9
to
f8ae577
Compare
@swift-ci please smoke test |
Rebased to pick up the new ABI checker tests and added the symbols to fix the macOS test failure |
@swift-ci please smoke test |
@swift-ci please smoke test |
@available(SwiftStdlib 5.11, *) | ||
extension DiscontiguousSlice.Index: CustomStringConvertible { | ||
public var description: String { | ||
"<base: \(String(reflecting: base)), rangeOffset: \(_rangeOffset)>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: this will be quite unreadable when the reflection of base produces a structured multi-line string. I'm not sure what we can practically do about that here -- it is a solvable problem, but it requires adding machinery for context-sensitive string embedding that would not be appropriate to implement in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
The big caveat is that this is an opaque Collection type, a big deviation from most previous stdlib additions. (The closest existing analogue is probably CollectionDifference
.) I have precious little working experience with these, so I can't fully foresee what problems we'll need to prepare for. However, I think the current code does as good job as can be expected.
My second worry is about test coverage, especially of corner cases. The test-to-code ratio is pretty low -- the 586 lines of new tests are unlikely to cover all of the new functionality. (Especially as these are all concrete test cases.) Without exhaustive tests, we do not know whether this will actually work as intended.
We should probably have some combinatorial tests to validate the consistency of the protocol conformances. Collection
is a very tricky protocol, and this feature is adding multiple new implementations of it, some quite subtle.
The standard checkHashable
/checkCollection
/checkBidirectionalCollection
/checkMutableCollection
/etc. routines implement some exhaustive checks to try to catch typical implementation issues. They only validate a subset of our semantic requirements, but they are great at checking things we'd not usually think to explicitly test for.
The StdlibUnittest modules also include some test collection implementations (Minimal*Collection
) that are intended to exercise hidden corners in collection algorithms and lazy transformations like DiscontiguousSlice
. (They implement protocol requirements with no shortcuts (sometimes in the most awkward/unhelpful manner), and they come with extra hooks to e.g. verify that expected calls do occur.)
…ang#69766) * Adds RangeSet/DiscontiguousSlice to the stdlib * Remove redundant DiscontiguousSlice.Index: Comparable conformance * Attempt to fix embedded build * Attempt to fix macOS test failures * Fix Constaints/members.swift failure on linux * Add exceptions to ABI/source checker to fix macOS tests * Fix incremental dependency test failure * Remove inlining/unfreeze implementation for future improvements * Simplify indices(where:) implementation * Address review feedback * Add test for underscored, public slice members * Address feedback on inlining, hashing, and initializing with unordered arrays * Fix ABI checker issues * Remove MutableCollection extension for DiscontiguousSlice * Make insertion return a discardable Bool * Fix ABI checker tests * Fix other ABI checker tests due to dropping MutableCollection subscript
This implements the API described by SE-0270 Add Collection Operations on Noncontiguous Elements. Specifically, this adds the
RangeSet
andDiscontiguousSlice
types and associated APIs to the stdlib.