-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand documentation of Set #45416
Expand documentation of Set #45416
Conversation
Hash sets have properties that can be counter-intuitive to people not familiar with computer science: They provide constant-time membership testing, are unordered, and deduplicate their elements as determined by `isequal` and `hash`. This commit expands the documentation of `Set` to mention the above properties, and also that it is a subtype of `AbstractSet`.
* Remove redundant first line * Re-add removed "See also" references * Removed redundant empty Set instantiation example * Have example show mutability * Have example show deduplication * Have example show behaviour under `isequal`
I've addressed your comments with the following changes
|
base/set.jl
Outdated
0.0 | ||
-0.0 | ||
|
||
julia> push!(s, 'a'); push!(s, 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about pushing 0
to better highlight the mutability?
julia> push!(s, 'a'); push!(s, 0.0) | |
julia> push!(s, 'a'); push!(s, 0) |
... and than 0.0 in s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with that is that I think the current behaviour where 0
overwrites 0.0
is an implementation detail (i.e. I think keeping 0.0
would be documented behaviour, too), so it's better to not have it in the doc example.
can we finally close #23269 while at it? |
I added more examples by recommendation - if this docstring is too long let me know. |
|
||
# Examples | ||
```jldoctest filter = r"^\\S.+" | ||
julia> s = Set("aaBca") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julia> s = Set("aaBca") | |
julia> s = Set("aaBca"); push!(s, 'b') |
If you are worried it to be too long, I guess you can connect these together without losing any vital information.
base/set.jl
Outdated
|
||
`Set`s have efficient implementations of set operations such as `in`, `union` and `intersect`. | ||
Elements in a `Set` are unique, as determined by the elements' definitions of | ||
`isequal` and `hash`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really get the addition of hash
here. You could have a set with a bunch of elements that all hash to 1
, it would just be inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to prevent a user overloading isequal
without overloading hash
(e,g, having a bunch of not-equal elements that hash to 1
is an implementation error). This is not legal according to isequal
docs though, so it might be redundant to also state it here. I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the problem is kind of the opposite, i.e., having multiple isequal
elements with different hashes. So, omitting hash
may easily lead to hard-to-debug bugs, like
julia> import Base: isequal, hash
julia> struct X; x; end
julia> isequal(X(0.0), X(0))
false
julia> isequal(a::X, b::X) = a.x == b.x;
julia> isequal(X(0.0), X(0))
true
julia> s = Set([X(0.0), X(0)]) # Incorrect
Set{X} with 2 elements:
X(0)
X(0.0)
julia> hash.([X(0.0), X(0)])
2-element Vector{UInt64}:
0x29f6c96c4b9e8bb8
0x0792f7046dd5cf45
julia> hash(x::X) = hash(x.x);
julia> s = Set([X(0.0), X(0)]) # Correct
Set{X} with 1 element:
X(0)
Btw, hash
is usually mention for other languages too. Just to name few: C++, Rust, Java ... unless the set is BST-based, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as determined by the elements' definitions of
isequal
andhash
This "and" doesn't seem quite accurate. I think it only calls hash
, although that should always match isequal
. Maybe we can just say that? The goal is always to agree with isequal
, but the mechanism is hash
. Making it sound like the mechanism is isequal
seems misleading.
I think the logic is the same as unique
, whose description does not mention hash
, but (IMO) probably should. Mentioning that they agree, and linking to unique
here, might also be a good idea.
help?> unique
search: unique unique! allunique
unique(itr)
Return an array containing only the unique elements of collection itr, as determined by isequal,
in the order that the first of each set of equivalent elements originally appears. The element
type of the input is preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isequal
is actually called to check hash
collisions for Dict
(which implements Set
internally) in
Line 273 in 0062c26
if (key === k || isequal(key, k)) |
How about writing the assumption explicitly, like in Rust?
isequal(a, b) -> hash(a) == hash(b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. So it does depend on both.
Maybe the description should first say what the contract is (unique according to isequal
, just like unique
) and later say that the fast implementation needs hash
? Uniqueness in the 1st paragraph, what's efficient in the 2nd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that really necessary for the docstring, though? To me, mentioning hash
seem like an optimization based on the implementation detail of Set, and that is usually outside the scope of a docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, yes! These two functions are entangled tightly. So, mentioning only one seems dangerous to me because it motivates the user to re-define isequal
only, and it may lead to an incorrect/undefined behavior.
OK so it's clear to me that we don't all exactly agree on what should and shouldn't go in the docstring. But let's not let the perfect be the enemy of the good and let's make sure this PR doesn't stall out and get nowhere due to bikeshedding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Any objections to this? Otherwise I'd like to merge this tmrw.
Hash sets have properties that can be counter-intuitive to people not familiar
with computer science: They provide constant-time membership testing, are unordered,
and deduplicate their elements as determined by
isequal
andhash
.This commit expands the documentation of
Set
to mention the above properties,and also that it is a subtype of
AbstractSet
.