issetequal behavior with duplicate elements #32550

pearlzli · 2019-07-10T21:11:39Z

As I mentioned on Discourse, the issetequal docstring says that issetequal(a, b) is equivalent to a ⊆ b && b ⊆ a, but this isn't the case when there are duplicate elements:

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)

julia> a = [1,2,3];

julia> b = [1,1,2,3];

julia> a ⊆ b && b ⊆ a
true

julia> issetequal(a, b)
false

This is because the implementation assumes no duplicate elements:

issetequal(l, r) = length(l) == length(r) && l ⊆ r

The text was updated successfully, but these errors were encountered:

StefanKarpinski · 2019-07-10T22:24:20Z

Good catch. That algorithm does work for sets but is incorrect for any collection that can have duplicates. One option would be to specialize the current implementation for AbstractSet and to implement the more generic version as a ⊆ b && b ⊆ a but that is a bit inefficient, so I have to wonder if we can't do a bit better here, perhaps iterating the two collections only a single time.

nalimilan · 2019-07-11T07:35:38Z

There could be an allunique keyword argument that one would set to true when the inputs are known to contain only unique values, in which case the current fast algorithm would be used. Otherwise, the slower approach would have to be taken.

mcognetta · 2019-07-11T15:18:34Z

Why does this even allow duplicates? Do any other languages have a standard implementation of set that behaves like this? Seems like issetequal should only work for things of type <:AbstractSet. If it really needs to work on any collection then add a fallback like issetequal(l, r) = issetequal(Set(l), Set(r)).

pearlzli · 2019-07-11T15:35:39Z

@mcognetta My reading is that issetequal checks a specific notion called "set equality", i.e. whether two collections have the same elements. If we restricted it to work only on AbstractSets, then it would seem redundant to have both issetequal and regular isequal for AbstractSets.

mcognetta · 2019-07-11T15:44:53Z

@pearlzli issetequal and == seem to do the exact same thing for AbstractSet so it makes sense that we can just define one and have the other fall back to it. I am not sure it is redundant to have both defined though since you could have a case where you don't know whether or not you are using a set or some other collection and want to use == instead of issetequal to avoid the temporary construction of a set (or an invalid result if you are anticipating duplicates).

julia/base/abstractset.jl

Line 291 in f5a50be

issetequal(l, r) = length(l) == length(r) && l ⊆ r

julia/base/abstractset.jl

Line 226 in f5a50be

==(l::AbstractSet, r::AbstractSet) = length(l) == length(r) && l ⊆ r

pearlzli · 2019-07-11T15:51:15Z

I think what I meant is: suppose we restricted issetequal to work only on AbstractSets. I agree that there are situations in which you'd want to use isequal rather than issetequal, but are there any in which you'd prefer issetequal over isequal?

I should specify that when I say "restrict to only work on sets", I mean the case where we only define issetequal(l::AbstractSet, r::AbstractSet) and no other methods.

StefanKarpinski · 2019-07-11T17:14:19Z

I think the right thing to do here is just to fix the implementation for non-sets.

bermani · 2019-07-12T22:49:47Z

Hello, I am a first-time contributor. I read through this issue and the implementation and I have ideas on how to implement issetequal with the possibility of duplicates in mind.

The l ⊆ r && r ⊆ l implementation is inefficient as stated above since it iterates over both collections twice. isequal(Set(l), Set(r)) is better, it iterates over both l and r once to construct each set and then over l once more to check for equality. But I can think of an algorithm that iterates over l and r only once each:

Construct a new Set with the contents of r. I'll call it r for simplicity.
Initialize an empty Set, let's call it s.
For each element in l, perform the following checks:
- If the element is in neither r nor s, we've found an item that exists in l but not r. Return false immediately
- If the element is in r, remove it from r and put it in s. This means we've found an item that we haven't seen yet in l, continue to the next iteration
- If the element is in s, it's an element that is a duplicate within l (we put it into s in a previous iteration), continue to the next iteration
If there are elements in r after this process, then those elements exist in r but not l, but if not, then l and r contain the same elements. At this point we return length(r) == 0

We should test the runtime similarly to #26198 to compare different algorithms and input sizes. The difference in speed might not be significant since all of these approaches are O(n + m), and the overhead of creating two sets is most likely more inefficient than a simpler approach for small inputs.

bermani · 2019-07-26T00:12:24Z

Hello, I got around to testing the runtimes. My code is in this jupyter notebook. Essentially,

the isequal(Set(l), Set(r)) method is the fastest. My algorithm is not good.

Unless anyone has any other ideas on how to implement issetequal or has any other methods or ideas for testing runtimes, I think the best option is to change the implementation to issetequal(l,r) = Set(l) == Set(r). If no one responds within a day or two I'll submit a PR.

StefanKarpinski · 2019-07-31T17:58:36Z

I've got an in-progress PR for this that I just need to finish up. It's trickier than it might seem :)

add tests that set ops fail for non-sets (#32550)

StefanKarpinski · 2019-08-08T02:50:16Z

I put up a PR that checks some cases where you can avoid constructing a set: basically, when one side is a set and the other side has length, you can check if the set has too many unique values and return false early. It's a slight optimization but it's the best I could come up with. Last resort is to just make sets and thereby guarantee unique elements counts.

fix #32550: issetequal with duplicate values

add tests that set ops fail for non-sets (#32550) (cherry picked from commit a135040)

Broken by JuliaLang/julia#32550 Fixed by enforcing uniqueness of CustomSet elements when adding to them with push! This internal representation of the Set as an Array is horrible and should be fixed at some point.

StefanKarpinski added a commit that referenced this issue Aug 8, 2019

fix #32550: issetequal with duplicate values

d959500

add tests that set ops fail for non-sets (#32550)

StefanKarpinski mentioned this issue Aug 8, 2019

fix #32550: issetequal with duplicate values #32826

Merged

StefanKarpinski closed this as completed in a135040 Aug 8, 2019

StefanKarpinski added a commit that referenced this issue Aug 8, 2019

Merge pull request #32826 from JuliaLang/sk/issetequal

e1520ba

fix #32550: issetequal with duplicate values

KristofferC mentioned this issue Aug 26, 2019

WIP: Backports for 1.2.1 #33073

Closed

9 tasks

KristofferC pushed a commit that referenced this issue Aug 26, 2019

fix #32550: issetequal with duplicate values

03c9c72

add tests that set ops fail for non-sets (#32550) (cherry picked from commit a135040)

KristofferC mentioned this issue Aug 26, 2019

WIP: Backports for 1.0.5 #33075

Merged

55 tasks

KristofferC mentioned this issue Dec 3, 2019

WIP: Backports release 1.0.6 #34011

Closed

56 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issetequal behavior with duplicate elements #32550

issetequal behavior with duplicate elements #32550

pearlzli commented Jul 10, 2019

StefanKarpinski commented Jul 10, 2019

nalimilan commented Jul 11, 2019

mcognetta commented Jul 11, 2019 •

edited

Loading

pearlzli commented Jul 11, 2019

mcognetta commented Jul 11, 2019 •

edited

Loading

pearlzli commented Jul 11, 2019 •

edited

Loading

StefanKarpinski commented Jul 11, 2019

bermani commented Jul 12, 2019

bermani commented Jul 26, 2019

StefanKarpinski commented Jul 31, 2019

StefanKarpinski commented Aug 8, 2019

issetequal behavior with duplicate elements #32550

issetequal behavior with duplicate elements #32550

Comments

pearlzli commented Jul 10, 2019

StefanKarpinski commented Jul 10, 2019

nalimilan commented Jul 11, 2019

mcognetta commented Jul 11, 2019 • edited Loading

pearlzli commented Jul 11, 2019

mcognetta commented Jul 11, 2019 • edited Loading

pearlzli commented Jul 11, 2019 • edited Loading

StefanKarpinski commented Jul 11, 2019

bermani commented Jul 12, 2019

bermani commented Jul 26, 2019

StefanKarpinski commented Jul 31, 2019

StefanKarpinski commented Aug 8, 2019

mcognetta commented Jul 11, 2019 •

edited

Loading

mcognetta commented Jul 11, 2019 •

edited

Loading

pearlzli commented Jul 11, 2019 •

edited

Loading