Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand documentation of Set #45416

Merged
merged 9 commits into from
Jun 1, 2022
Merged

Conversation

jakobnissen
Copy link
Contributor

Hash sets have properties that can be counter-intuitive to people not familiar
with computer science: They provide constant-time membership testing, are unordered,
and deduplicate their elements as determined by isequal and hash.

This commit expands the documentation of Set to mention the above properties,
and also that it is a subtype of AbstractSet.

Hash sets have properties that can be counter-intuitive to people not familiar
with computer science: They provide constant-time membership testing, are unordered,
and deduplicate their elements as determined by `isequal` and `hash`.

This commit expands the documentation of `Set` to mention the above properties,
and also that it is a subtype of `AbstractSet`.
base/set.jl Outdated Show resolved Hide resolved
@giordano giordano added docs This change adds or pertains to documentation collections Data structures holding multiple items, e.g. sets labels May 22, 2022
base/set.jl Show resolved Hide resolved
base/set.jl Outdated Show resolved Hide resolved
base/set.jl Outdated Show resolved Hide resolved
base/set.jl Outdated Show resolved Hide resolved
* Remove redundant first line
* Re-add removed "See also" references
* Removed redundant empty Set instantiation example
* Have example show mutability
* Have example show deduplication
* Have example show behaviour under `isequal`
@jakobnissen
Copy link
Contributor Author

I've addressed your comments with the following changes

  • Remove redundant first line
  • Re-add removed "See also" references
  • Removed redundant empty Set instantiation example
  • Have example show mutability
  • Have example show deduplication
  • Have example show behaviour under isequal

base/set.jl Outdated
0.0
-0.0

julia> push!(s, 'a'); push!(s, 0.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about pushing 0 to better highlight the mutability?

Suggested change
julia> push!(s, 'a'); push!(s, 0.0)
julia> push!(s, 'a'); push!(s, 0)

... and than 0.0 in s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with that is that I think the current behaviour where 0 overwrites 0.0 is an implementation detail (i.e. I think keeping 0.0 would be documented behaviour, too), so it's better to not have it in the doc example.

base/set.jl Outdated Show resolved Hide resolved
@Moelf
Copy link
Contributor

Moelf commented May 24, 2022

can we finally close #23269 while at it?

@jakobnissen
Copy link
Contributor Author

I added more examples by recommendation - if this docstring is too long let me know.


# Examples
```jldoctest filter = r"^\\S.+"
julia> s = Set("aaBca")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
julia> s = Set("aaBca")
julia> s = Set("aaBca"); push!(s, 'b')

If you are worried it to be too long, I guess you can connect these together without losing any vital information.

base/set.jl Outdated

`Set`s have efficient implementations of set operations such as `in`, `union` and `intersect`.
Elements in a `Set` are unique, as determined by the elements' definitions of
`isequal` and `hash`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really get the addition of hash here. You could have a set with a bunch of elements that all hash to 1, it would just be inefficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to prevent a user overloading isequal without overloading hash (e,g, having a bunch of not-equal elements that hash to 1 is an implementation error). This is not legal according to isequal docs though, so it might be redundant to also state it here. I'll remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the problem is kind of the opposite, i.e., having multiple isequal elements with different hashes. So, omitting hash may easily lead to hard-to-debug bugs, like

julia> import Base: isequal, hash

julia> struct X; x; end

julia> isequal(X(0.0), X(0))
false

julia> isequal(a::X, b::X) = a.x == b.x;

julia> isequal(X(0.0), X(0))
true

julia> s = Set([X(0.0), X(0)]) # Incorrect
Set{X} with 2 elements:
  X(0)
  X(0.0)

julia> hash.([X(0.0), X(0)])
2-element Vector{UInt64}:
 0x29f6c96c4b9e8bb8
 0x0792f7046dd5cf45

julia> hash(x::X) = hash(x.x);

julia> s = Set([X(0.0), X(0)]) # Correct
Set{X} with 1 element:
  X(0)

Btw, hash is usually mention for other languages too. Just to name few: C++, Rust, Java ... unless the set is BST-based, of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as determined by the elements' definitions of isequal and hash

This "and" doesn't seem quite accurate. I think it only calls hash, although that should always match isequal. Maybe we can just say that? The goal is always to agree with isequal, but the mechanism is hash. Making it sound like the mechanism is isequal seems misleading.

I think the logic is the same as unique, whose description does not mention hash, but (IMO) probably should. Mentioning that they agree, and linking to unique here, might also be a good idea.

help?> unique
search: unique unique! allunique

  unique(itr)

  Return an array containing only the unique elements of collection itr, as determined by isequal,
  in the order that the first of each set of equivalent elements originally appears. The element
  type of the input is preserved.

Copy link
Member

@petvana petvana May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isequal is actually called to check hash collisions for Dict (which implements Set internally) in

if (key === k || isequal(key, k))

How about writing the assumption explicitly, like in Rust?

isequal(a, b) -> hash(a) == hash(b)

Copy link
Contributor

@mcabbott mcabbott May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. So it does depend on both.

Maybe the description should first say what the contract is (unique according to isequal, just like unique) and later say that the fast implementation needs hash? Uniqueness in the 1st paragraph, what's efficient in the 2nd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really necessary for the docstring, though? To me, mentioning hash seem like an optimization based on the implementation detail of Set, and that is usually outside the scope of a docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, yes! These two functions are entangled tightly. So, mentioning only one seems dangerous to me because it motivates the user to re-define isequal only, and it may lead to an incorrect/undefined behavior.

@jakobnissen
Copy link
Contributor Author

OK so it's clear to me that we don't all exactly agree on what should and shouldn't go in the docstring. But let's not let the perfect be the enemy of the good and let's make sure this PR doesn't stall out and get nowhere due to bikeshedding.

Copy link
Member

@aviatesk aviatesk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Any objections to this? Otherwise I'd like to merge this tmrw.

@KristofferC KristofferC merged commit cbcb359 into JuliaLang:master Jun 1, 2022
@jakobnissen jakobnissen deleted the setdocs branch June 1, 2022 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets docs This change adds or pertains to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants