Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issetequal behavior with duplicate elements #32550

Closed
pearlzli opened this issue Jul 10, 2019 · 11 comments
Closed

issetequal behavior with duplicate elements #32550

pearlzli opened this issue Jul 10, 2019 · 11 comments
Labels
bug Indicates an unexpected problem or unintended behavior collections Data structures holding multiple items, e.g. sets good first issue Indicates a good issue for first-time contributors to Julia help wanted Indicates that a maintainer wants help on an issue or pull request

Comments

@pearlzli
Copy link

As I mentioned on Discourse, the issetequal docstring says that issetequal(a, b) is equivalent to a ⊆ b && b ⊆ a, but this isn't the case when there are duplicate elements:

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)

julia> a = [1,2,3];

julia> b = [1,1,2,3];

julia> a ⊆ b && b ⊆ a
true

julia> issetequal(a, b)
false

This is because the implementation assumes no duplicate elements:

issetequal(l, r) = length(l) == length(r) && l ⊆ r
@StefanKarpinski StefanKarpinski added bug Indicates an unexpected problem or unintended behavior collections Data structures holding multiple items, e.g. sets good first issue Indicates a good issue for first-time contributors to Julia help wanted Indicates that a maintainer wants help on an issue or pull request labels Jul 10, 2019
@StefanKarpinski
Copy link
Member

Good catch. That algorithm does work for sets but is incorrect for any collection that can have duplicates. One option would be to specialize the current implementation for AbstractSet and to implement the more generic version as a ⊆ b && b ⊆ a but that is a bit inefficient, so I have to wonder if we can't do a bit better here, perhaps iterating the two collections only a single time.

@nalimilan
Copy link
Member

There could be an allunique keyword argument that one would set to true when the inputs are known to contain only unique values, in which case the current fast algorithm would be used. Otherwise, the slower approach would have to be taken.

@mcognetta
Copy link
Contributor

mcognetta commented Jul 11, 2019

Why does this even allow duplicates? Do any other languages have a standard implementation of set that behaves like this? Seems like issetequal should only work for things of type <:AbstractSet. If it really needs to work on any collection then add a fallback like issetequal(l, r) = issetequal(Set(l), Set(r)).

@pearlzli
Copy link
Author

@mcognetta My reading is that issetequal checks a specific notion called "set equality", i.e. whether two collections have the same elements. If we restricted it to work only on AbstractSets, then it would seem redundant to have both issetequal and regular isequal for AbstractSets.

@mcognetta
Copy link
Contributor

mcognetta commented Jul 11, 2019

@pearlzli issetequal and == seem to do the exact same thing for AbstractSet so it makes sense that we can just define one and have the other fall back to it. I am not sure it is redundant to have both defined though since you could have a case where you don't know whether or not you are using a set or some other collection and want to use == instead of issetequal to avoid the temporary construction of a set (or an invalid result if you are anticipating duplicates).

issetequal(l, r) = length(l) == length(r) && l r

==(l::AbstractSet, r::AbstractSet) = length(l) == length(r) && l r

@pearlzli
Copy link
Author

pearlzli commented Jul 11, 2019

I think what I meant is: suppose we restricted issetequal to work only on AbstractSets. I agree that there are situations in which you'd want to use isequal rather than issetequal, but are there any in which you'd prefer issetequal over isequal?

I should specify that when I say "restrict to only work on sets", I mean the case where we only define issetequal(l::AbstractSet, r::AbstractSet) and no other methods.

@StefanKarpinski
Copy link
Member

I think the right thing to do here is just to fix the implementation for non-sets.

@bermani
Copy link

bermani commented Jul 12, 2019

Hello, I am a first-time contributor. I read through this issue and the implementation and I have ideas on how to implement issetequal with the possibility of duplicates in mind.

The l ⊆ r && r ⊆ l implementation is inefficient as stated above since it iterates over both collections twice. isequal(Set(l), Set(r)) is better, it iterates over both l and r once to construct each set and then over l once more to check for equality. But I can think of an algorithm that iterates over l and r only once each:

  1. Construct a new Set with the contents of r. I'll call it r for simplicity.
  2. Initialize an empty Set, let's call it s.
  3. For each element in l, perform the following checks:
    • If the element is in neither r nor s, we've found an item that exists in l but not r. Return false immediately
    • If the element is in r, remove it from r and put it in s. This means we've found an item that we haven't seen yet in l, continue to the next iteration
    • If the element is in s, it's an element that is a duplicate within l (we put it into s in a previous iteration), continue to the next iteration
  4. If there are elements in r after this process, then those elements exist in r but not l, but if not, then l and r contain the same elements. At this point we return length(r) == 0

We should test the runtime similarly to #26198 to compare different algorithms and input sizes. The difference in speed might not be significant since all of these approaches are O(n + m), and the overhead of creating two sets is most likely more inefficient than a simpler approach for small inputs.

@bermani
Copy link

bermani commented Jul 26, 2019

Hello, I got around to testing the runtimes. My code is in this jupyter notebook. Essentially,
Screen Shot 2019-07-25 at 4 59 04 PM
the isequal(Set(l), Set(r)) method is the fastest. My algorithm is not good.

Unless anyone has any other ideas on how to implement issetequal or has any other methods or ideas for testing runtimes, I think the best option is to change the implementation to issetequal(l,r) = Set(l) == Set(r). If no one responds within a day or two I'll submit a PR.

@StefanKarpinski
Copy link
Member

I've got an in-progress PR for this that I just need to finish up. It's trickier than it might seem :)

StefanKarpinski added a commit that referenced this issue Aug 8, 2019
add tests that set ops fail for non-sets (#32550)
@StefanKarpinski
Copy link
Member

I put up a PR that checks some cases where you can avoid constructing a set: basically, when one side is a set and the other side has length, you can check if the set has too many unique values and return false early. It's a slight optimization but it's the best I could come up with. Last resort is to just make sets and thereby guarantee unique elements counts.

StefanKarpinski added a commit that referenced this issue Aug 8, 2019
fix #32550: issetequal with duplicate values
KristofferC pushed a commit that referenced this issue Aug 26, 2019
add tests that set ops fail for non-sets (#32550)

(cherry picked from commit a135040)
SaschaMann added a commit to exercism/julia that referenced this issue Nov 29, 2019
Broken by JuliaLang/julia#32550

Fixed by enforcing uniqueness of CustomSet elements when adding to them
with push!

This internal representation of the Set as an Array is horrible and
should be fixed at some point.
SaschaMann added a commit to exercism/julia that referenced this issue Nov 29, 2019
Broken by JuliaLang/julia#32550

Fixed by enforcing uniqueness of CustomSet elements when adding to them
with push!

This internal representation of the Set as an Array is horrible and
should be fixed at some point.
SaschaMann added a commit to exercism/julia that referenced this issue Nov 29, 2019
Broken by JuliaLang/julia#32550

Fixed by enforcing uniqueness of CustomSet elements when adding to them
with push!

This internal representation of the Set as an Array is horrible and
should be fixed at some point.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior collections Data structures holding multiple items, e.g. sets good first issue Indicates a good issue for first-time contributors to Julia help wanted Indicates that a maintainer wants help on an issue or pull request
Projects
None yet
Development

No branches or pull requests

5 participants