Implement `isdisjoint` #13192

eschnett · 2015-09-17T18:39:08Z

This implements the functionality. Which files do I need to change manually to document it?

eschnett · 2015-09-17T18:39:34Z

Addresses #13189.

tkelman · 2015-09-17T18:45:44Z

See https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#improving-documentation - docstring either in helpdb or inline, and add the signature to the rst docs so doc/genstdlib.jl populates it properly

StefanKarpinski · 2015-09-17T18:56:28Z

I feel like there has to be a much more efficient way to implement this. Checking, for each element in the first set, whether it is in the second set seems like it would be more efficient than constructing the intersection. This appears to be checking for pairwise disjointness, which is not obviously the only correct meaning for n-ary disjointness. A definition equivalent to isempty(intersect(S...)) would make sense too.

eschnett · 2015-09-17T19:20:51Z

@StefanKarpinski An implementation based on in -- yes, that's feasible.

A definition equivalent to isempty(intersect(S...)): what would this be called? I can't think of a name for it. The current definition is called "mutually disjoint".

StefanKarpinski · 2015-09-17T19:26:10Z

It's standard in mathematical texts to explicitly say "pairwise disjoint" or "collectively disjoint" to distinguish these two cases. I would argue that the term "disjoint" by itself is ambiguous for more than two arguments.

eschnett · 2015-09-17T19:31:44Z

@StefanKarpinski Is this a generic comment, or do you suggest to rename the function to is_pairwise_disjoint? Or do you suggest to change its behaviour to mean "at least two of the sets are disjoint"?

StefanKarpinski · 2015-09-17T19:36:07Z

ispairwisedisjoint is kind of a mouthful, but I do think that multi-argument isdisjoint is not good. Both pairwise disjointnes and collective disjointness are interesting things that people might want to check for and which can probably be done more efficiently with a little bit of cleverness.

eschnett · 2015-09-17T19:56:57Z

Efficiency is not an argument; these are generic implementations that can be overwritten for specific datatypes, for example if elements are kept in a certain order. For example, the existing generic implementation of symdiff states:

# symdiff is associative, so a relatively clean
# way to implement this is by using setdiff and union, and
# recursing. Has the advantage of keeping order, too, but
# not as fast as other methods that make a single pass and
# store counts with a Dict.

It seems you're arguing to omit the multi-argument version of isdisjoint.

eschnett · 2015-09-17T20:10:07Z

@tkelman I didn't understand the text in CONTRIBUTING.md. Apparently it's dangerous to edit documentation since it could lead to losing documentation if one makes a typo. There's also a distinction between changing the "docstring signature" and "docstring body", and things have to be done in a particular order; however, it doesn't say how to initially add a docstring. There seem to be "definitions" and "autogenerated content", but from reading this text I don't know which is which.

It seems that the structure of the documentation is completely decoupled from the structure of the source code. This is strange. Wouldn't it be easier to generate a reference manual by following the source code structure, and by putting all functions there to ensure no one is missing, wrong, or a ghost?

I was trying to follow e.g. the definition of union. This function is defined in array.jl as union(vs...). It is documented in stdlib/collections.rst as union(s1,s2...).

So here is what I did:

I added the docstring for isdisjoint to base/array.jl where the function is defined
I made up a function signature, which I put into the docstring, instead of documenting the several methods in the source
I did not edit base/doc/helpdb.jl, since this should (over time) go away
I added a .. function:: block to doc/stdlib/collections.rst, since this is where it should logically belong. I also added the text .. Docstring generated from Julia source there, but I don't know whether this was necessary.
I rebuilt Julia
I ran doc/genstdlib.jl
I repeated this several times with slight variations until the docstring appeared in the rst file.

I don't understand how I could lose documentation with this. I assume that genstdlib.jl will not touch the signatures I put in by hand, and I also assume it won't change the actual source code? What is the big, bold warning about?

Did I miss something?

tkelman · 2015-09-17T20:50:55Z

Yeah the docsystem had a major revamp recently and things are in a still-confusing state right now. Sorry.

Apparently it's dangerous to edit documentation since it could lead to losing documentation if one makes a typo.

I'm not sure I follow you here.

The "docstring body" is the documentation that you add, and gets spliced into the RST under an "autogenerated from julia source" comment.

Wouldn't it be easier to generate a reference manual by following the source code structure, and by putting all functions there to ensure no one is missing, wrong, or a ghost?

Yes, probably. There's still some awkwardness with how multiple dispatch and macro-generated methods interact with documentation, in terms of where to put docstrings and how many of them to write. There are several issues about continuing to move more of the doc system into Julia/Markdown and away from the existing sphinx manual which should eventually get rid of the awkwardness of having separate md and rst representations of the docstrings, but we're not there yet.

StefanKarpinski · 2015-09-17T20:51:24Z

Efficiency is an argument when it comes to what should or should not go in a standard library – if writing isempty(intersect(A,B)) was sufficiently fast, we wouldn't be having this discussion; it only make sense to include it because it can be implemented significantly more efficiently.

I have no position on wether the multi-argument form should check for pairwise disjointness of collective disjointness – both are valid and useful operations. I am saying that it would be wrong to have a multiargument form of isdisjoint that does either of these things since it's non-obvious which one it would do. But that's kind of unsatisfying since if you have a two-argument form, you naturally want to extend it. So I'm not sure what to do. Spitballing some possibilities:

Define isdisjoint(A,B) only and leave the multiargument verison undefined
Have ispairwisedisjoint(A...) and iscollectivelydisoint functions
Expose both bits of functionality through some other API – keywords?
Don't include this functionality in the standard library at all

eschnett · 2015-09-17T20:58:33Z

Going with Python here and punting and multiple arguments for isdisjoint.

Regarding efficiency: The standard library has a generic implementation of symdiff (and now isdisjoint), and particular datatypes can provide more specific methods if these are e.g. more efficient. Of course it's up to the respective datatypes whether this is done. However, the isempty(intersect(a, b)) construct precludes any such optimization since it forces the intersection to be calculated.

My isdisjoint implementation is reasonable efficient for many datatypes; I would expect to to be O(n log n) for a tree. However, if you e.g. store the set as ordered array, then you can achieve this in O(n) with a different algorithm.

ScottPJones · 2015-09-17T22:41:51Z

My OCDness wants those to be are instead of is! 😀

StefanKarpinski · 2015-09-17T23:11:43Z

Yeah, this is where the is* predicate naming scheme breaks down.

ScottPJones · 2015-09-18T00:52:53Z

is* is OK, if at least the common use is with a single argument, but since these are always 2 or more, could we possibly call them are*? All predicates don't start with is either, there is haskey, for example. Or else simply drop the is, i.e. disjoint(...), pairwisedisjoint(...), ..., to avoid the bad grammar.

Another thing - I was wondering if for more than 2 arguments that iteratively checking all elements of A to see if they are in B for all pairs would be rather expensive. I can vaguely remember algorithms for these sorts of things from 6.046 with Rivest and Leiserson, but that was > 30 years ago!
I was thinking, maybe for pairwise disjoint, you could take the largest of the sets, make a copy, and then, for each of the other sets, see if the key was present, return false if so, otherwise add the key, and keep testing. May be totally bogus, but might be reasonable.

eschnett · 2015-09-18T01:02:48Z

Regarding the improved algorithm: Whether this makes sense depends on the collection. The implementation here is a "reasonable" default algorithm for many kinds of collections. I expect various collections to provide specialized implementations that are more efficient. See the comment for the default implementation of symdiff.

stevengj · 2015-09-18T02:42:40Z

base/array.jl

+Determine whether the collections `v1` and `v2` are disjoint, using
+the function `in` to check.
+"""
+isdisjoint(v1, v2) = all(v -> v ∉ v1, v2)


Probably iterate over the smaller of the two sets? If one of the sets is much bigger than the other that could make a huge difference.

Also, if the point of putting it in the standard library is efficiency, probably an explicit loop would be better than a higher-order function.

eschnett · 2015-09-18T15:24:07Z

@stevengj I addressed your comments.

stevengj · 2015-09-18T16:03:49Z

base/array.jl

+"""
+function _isdisjoint(v1, v2)
+    for v in v1
+        if v in v2 return false end


These kinds of one-liners are usually written v in v2 && return false elsewhere in Base.

StefanKarpinski · 2015-09-18T17:21:22Z

I have to wonder if there isn't a better way to generically express this. It seems like we have a mixture of things here: universal or existential, pairwise versus collective, some pairwise operation on collections.

eschnett · 2015-09-18T17:53:43Z

@StefanKarpinski You mean something like allpairs(isdisjoint, vs...) and anypair(isdisjoint, vs...)? Those would be easy to implement. Of course, their generic implementation would be O(n^2) in the number of arguments given.

StefanKarpinski · 2015-09-18T17:55:38Z

The O(n^2) complexity indicates to me that such an API isn't quite right yet, but I'm not sure what's better.

eschnett · 2015-09-18T18:31:10Z

I think algorithms with better complexity must make stronger assumptions about the collections involved. That is, you want to use other operations than just in. For example, if you have access to union or setdiff or symdiff, then you can create something more efficient. For bonus points, create a way for collections to describe which such operations are available, and what complexity they have (!), and then choose an optimal algorithm for isdisjoint...

eschnett · 2015-09-22T14:39:47Z

This patch is ready to be merged.

@StefanKarpinski's comments above refer to a proposed extension of the current API (allowing more than two sets to be passed) and does not apply to the current patch. Also, the n in O(n^2) refers to the number of sets passed, not to the number of elements in the set.

vtjnash

Needs a rebase. Looks like it was consensus accepted for merging, then just got forgotten and neglected.

vtjnash · 2019-06-26T22:57:41Z

base/array.jl

+function isdisjoint(v1, v2)
+    # Iterate over the smaller set
+    # Use a function call to ensure type stability
+    length(v1)<=length(v2) ? _isdisjoint(v1,v2) : _isdisjoint(v2,v1)


Suggested change

length(v1)<=length(v2) ? _isdisjoint(v1,v2) : _isdisjoint(v2,v1)

if IteratorSize(v1) isa HasLength && IteratorSize(v2) isa HasLength && length(v2) < length(v1)

return _isdisjoint(v2, v1)

end

return _isdisjoint(v1, v2)

Also you may want to handle AbstractSets specially since it's much faster to check containment in an AbstractSet.

bicycle1885 · 2019-09-13T02:38:42Z

@eschnett Do you have time to finish this? I sometimes implement this for my own, so I wish this be in the standard library.

eschnett · 2019-09-13T03:07:10Z

@bicycle1885 Oh wow, that was four years ago... Not sure if I'll have time in the coming week.

This was added in #13192, but the PR became stale. We've also changed a bit how we write these kinds of algorithms, so make use of the tools we have (e.g. `fastin`).

Keno · 2020-01-19T02:43:28Z

Rebased in #34427.

This was added in #13192, but the PR became stale. We've also changed a bit how we write these kinds of algorithms, so make use of the tools we have (e.g. `fastin`).

eschnett force-pushed the isdisjoint branch from 70f5e76 to 2aee9ed Compare September 17, 2015 20:15

eschnett changed the title ~~Implement isdisjoint, both generically, and optimized for Set~~ Implement isdisjoint Sep 17, 2015

eschnett force-pushed the isdisjoint branch from 2aee9ed to 5609ef1 Compare September 18, 2015 01:14

stevengj reviewed Sep 18, 2015
View reviewed changes

eschnett force-pushed the isdisjoint branch from 5609ef1 to 95b4631 Compare September 18, 2015 15:23

stevengj reviewed Sep 18, 2015
View reviewed changes

Implement isdisjoint

63bcfa4

eschnett force-pushed the isdisjoint branch from 95b4631 to 63bcfa4 Compare September 18, 2015 17:55

vtjnash approved these changes Jun 26, 2019

View reviewed changes

Keno added a commit that referenced this pull request Jan 17, 2020

Add isdisjoint

35853c5

This was added in #13192, but the PR became stale. We've also changed a bit how we write these kinds of algorithms, so make use of the tools we have (e.g. `fastin`).

Keno added a commit that referenced this pull request Jan 17, 2020

Add isdisjoint

ef8f932

This was added in #13192, but the PR became stale. We've also changed a bit how we write these kinds of algorithms, so make use of the tools we have (e.g. `fastin`).

Keno mentioned this pull request Jan 18, 2020

Add isdisjoint #34427

Merged

Keno added a commit that referenced this pull request Jan 18, 2020

Add isdisjoint

3c2d1f6

This was added in #13192, but the PR became stale. We've also changed a bit how we write these kinds of algorithms, so make use of the tools we have (e.g. `fastin`).

Keno added a commit that referenced this pull request Jan 18, 2020

Add isdisjoint

f3b7768

This was added in #13192, but the PR became stale. We've also changed a bit how we write these kinds of algorithms, so make use of the tools we have (e.g. `fastin`).

Keno closed this Jan 19, 2020

Keno added a commit that referenced this pull request Jan 20, 2020

Add isdisjoint (#34427)

bc0e5d1

This was added in #13192, but the PR became stale. We've also changed a bit how we write these kinds of algorithms, so make use of the tools we have (e.g. `fastin`).

KristofferC pushed a commit that referenced this pull request Apr 11, 2020

Add isdisjoint (#34427)

53f9d1c

This was added in #13192, but the PR became stale. We've also changed a bit how we write these kinds of algorithms, so make use of the tools we have (e.g. `fastin`).

eschnett deleted the isdisjoint branch August 14, 2023 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `isdisjoint` #13192

Implement `isdisjoint` #13192

eschnett commented Sep 17, 2015 •

edited by vtjnash

Loading

eschnett commented Sep 17, 2015

tkelman commented Sep 17, 2015

StefanKarpinski commented Sep 17, 2015

eschnett commented Sep 17, 2015

StefanKarpinski commented Sep 17, 2015

eschnett commented Sep 17, 2015

StefanKarpinski commented Sep 17, 2015

eschnett commented Sep 17, 2015

eschnett commented Sep 17, 2015

tkelman commented Sep 17, 2015

StefanKarpinski commented Sep 17, 2015

eschnett commented Sep 17, 2015

ScottPJones commented Sep 17, 2015

StefanKarpinski commented Sep 17, 2015

ScottPJones commented Sep 18, 2015

eschnett commented Sep 18, 2015

stevengj Sep 18, 2015

eschnett commented Sep 18, 2015

stevengj Sep 18, 2015

StefanKarpinski commented Sep 18, 2015

eschnett commented Sep 18, 2015

StefanKarpinski commented Sep 18, 2015

eschnett commented Sep 18, 2015

eschnett commented Sep 22, 2015

vtjnash left a comment

vtjnash Jun 26, 2019

StefanKarpinski Sep 24, 2019

bicycle1885 commented Sep 13, 2019

eschnett commented Sep 13, 2019

Keno commented Jan 19, 2020

-    length(v1)<=length(v2) ? _isdisjoint(v1,v2) : _isdisjoint(v2,v1)
+if IteratorSize(v1) isa HasLength && IteratorSize(v2) isa HasLength && length(v2) < length(v1)
+    return _isdisjoint(v2, v1)
+end
+return _isdisjoint(v1, v2)

Implement isdisjoint #13192

Implement isdisjoint #13192

Conversation

eschnett commented Sep 17, 2015 • edited by vtjnash Loading

eschnett commented Sep 17, 2015

tkelman commented Sep 17, 2015

StefanKarpinski commented Sep 17, 2015

eschnett commented Sep 17, 2015

StefanKarpinski commented Sep 17, 2015

eschnett commented Sep 17, 2015

StefanKarpinski commented Sep 17, 2015

eschnett commented Sep 17, 2015

eschnett commented Sep 17, 2015

tkelman commented Sep 17, 2015

StefanKarpinski commented Sep 17, 2015

eschnett commented Sep 17, 2015

ScottPJones commented Sep 17, 2015

StefanKarpinski commented Sep 17, 2015

ScottPJones commented Sep 18, 2015

eschnett commented Sep 18, 2015

stevengj Sep 18, 2015

Choose a reason for hiding this comment

eschnett commented Sep 18, 2015

stevengj Sep 18, 2015

Choose a reason for hiding this comment

StefanKarpinski commented Sep 18, 2015

eschnett commented Sep 18, 2015

StefanKarpinski commented Sep 18, 2015

eschnett commented Sep 18, 2015

eschnett commented Sep 22, 2015

vtjnash left a comment

Choose a reason for hiding this comment

vtjnash Jun 26, 2019

Choose a reason for hiding this comment

StefanKarpinski Sep 24, 2019

Choose a reason for hiding this comment

bicycle1885 commented Sep 13, 2019

eschnett commented Sep 13, 2019

Keno commented Jan 19, 2020

Implement `isdisjoint` #13192

Implement `isdisjoint` #13192

eschnett commented Sep 17, 2015 •

edited by vtjnash

Loading