Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement isdisjoint #13192

Closed
wants to merge 1 commit into from
Closed

Implement isdisjoint #13192

wants to merge 1 commit into from

Conversation

eschnett
Copy link
Contributor

@eschnett eschnett commented Sep 17, 2015

This implements the functionality. Which files do I need to change manually to document it?

fixes #13189

@eschnett
Copy link
Contributor Author

Addresses #13189.

@tkelman
Copy link
Contributor

tkelman commented Sep 17, 2015

See https://github.com/JuliaLang/julia/blob/master/CONTRIBUTING.md#improving-documentation - docstring either in helpdb or inline, and add the signature to the rst docs so doc/genstdlib.jl populates it properly

@StefanKarpinski
Copy link
Member

I feel like there has to be a much more efficient way to implement this. Checking, for each element in the first set, whether it is in the second set seems like it would be more efficient than constructing the intersection. This appears to be checking for pairwise disjointness, which is not obviously the only correct meaning for n-ary disjointness. A definition equivalent to isempty(intersect(S...)) would make sense too.

@eschnett
Copy link
Contributor Author

@StefanKarpinski An implementation based on in -- yes, that's feasible.

A definition equivalent to isempty(intersect(S...)): what would this be called? I can't think of a name for it. The current definition is called "mutually disjoint".

@StefanKarpinski
Copy link
Member

It's standard in mathematical texts to explicitly say "pairwise disjoint" or "collectively disjoint" to distinguish these two cases. I would argue that the term "disjoint" by itself is ambiguous for more than two arguments.

@eschnett
Copy link
Contributor Author

@StefanKarpinski Is this a generic comment, or do you suggest to rename the function to is_pairwise_disjoint? Or do you suggest to change its behaviour to mean "at least two of the sets are disjoint"?

@StefanKarpinski
Copy link
Member

ispairwisedisjoint is kind of a mouthful, but I do think that multi-argument isdisjoint is not good. Both pairwise disjointnes and collective disjointness are interesting things that people might want to check for and which can probably be done more efficiently with a little bit of cleverness.

@eschnett
Copy link
Contributor Author

Efficiency is not an argument; these are generic implementations that can be overwritten for specific datatypes, for example if elements are kept in a certain order. For example, the existing generic implementation of symdiff states:

# symdiff is associative, so a relatively clean
# way to implement this is by using setdiff and union, and
# recursing. Has the advantage of keeping order, too, but
# not as fast as other methods that make a single pass and
# store counts with a Dict.

It seems you're arguing to omit the multi-argument version of isdisjoint.

@eschnett
Copy link
Contributor Author

@tkelman I didn't understand the text in CONTRIBUTING.md. Apparently it's dangerous to edit documentation since it could lead to losing documentation if one makes a typo. There's also a distinction between changing the "docstring signature" and "docstring body", and things have to be done in a particular order; however, it doesn't say how to initially add a docstring. There seem to be "definitions" and "autogenerated content", but from reading this text I don't know which is which.

It seems that the structure of the documentation is completely decoupled from the structure of the source code. This is strange. Wouldn't it be easier to generate a reference manual by following the source code structure, and by putting all functions there to ensure no one is missing, wrong, or a ghost?

I was trying to follow e.g. the definition of union. This function is defined in array.jl as union(vs...). It is documented in stdlib/collections.rst as union(s1,s2...).

So here is what I did:

  • I added the docstring for isdisjoint to base/array.jl where the function is defined
  • I made up a function signature, which I put into the docstring, instead of documenting the several methods in the source
  • I did not edit base/doc/helpdb.jl, since this should (over time) go away
  • I added a .. function:: block to doc/stdlib/collections.rst, since this is where it should logically belong. I also added the text .. Docstring generated from Julia source there, but I don't know whether this was necessary.
  • I rebuilt Julia
  • I ran doc/genstdlib.jl
  • I repeated this several times with slight variations until the docstring appeared in the rst file.

I don't understand how I could lose documentation with this. I assume that genstdlib.jl will not touch the signatures I put in by hand, and I also assume it won't change the actual source code? What is the big, bold warning about?

Did I miss something?

@eschnett eschnett changed the title Implement isdisjoint, both generically, and optimized for Set Implement isdisjoint Sep 17, 2015
@tkelman
Copy link
Contributor

tkelman commented Sep 17, 2015

Yeah the docsystem had a major revamp recently and things are in a still-confusing state right now. Sorry.

Apparently it's dangerous to edit documentation since it could lead to losing documentation if one makes a typo.

I'm not sure I follow you here.

The "docstring body" is the documentation that you add, and gets spliced into the RST under an "autogenerated from julia source" comment.

Wouldn't it be easier to generate a reference manual by following the source code structure, and by putting all functions there to ensure no one is missing, wrong, or a ghost?

Yes, probably. There's still some awkwardness with how multiple dispatch and macro-generated methods interact with documentation, in terms of where to put docstrings and how many of them to write. There are several issues about continuing to move more of the doc system into Julia/Markdown and away from the existing sphinx manual which should eventually get rid of the awkwardness of having separate md and rst representations of the docstrings, but we're not there yet.

@StefanKarpinski
Copy link
Member

Efficiency is an argument when it comes to what should or should not go in a standard library – if writing isempty(intersect(A,B)) was sufficiently fast, we wouldn't be having this discussion; it only make sense to include it because it can be implemented significantly more efficiently.

I have no position on wether the multi-argument form should check for pairwise disjointness of collective disjointness – both are valid and useful operations. I am saying that it would be wrong to have a multiargument form of isdisjoint that does either of these things since it's non-obvious which one it would do. But that's kind of unsatisfying since if you have a two-argument form, you naturally want to extend it. So I'm not sure what to do. Spitballing some possibilities:

  1. Define isdisjoint(A,B) only and leave the multiargument verison undefined
  2. Have ispairwisedisjoint(A...) and iscollectivelydisoint functions
  3. Expose both bits of functionality through some other API – keywords?
  4. Don't include this functionality in the standard library at all

@eschnett
Copy link
Contributor Author

Going with Python here and punting and multiple arguments for isdisjoint.

Regarding efficiency: The standard library has a generic implementation of symdiff (and now isdisjoint), and particular datatypes can provide more specific methods if these are e.g. more efficient. Of course it's up to the respective datatypes whether this is done. However, the isempty(intersect(a, b)) construct precludes any such optimization since it forces the intersection to be calculated.

My isdisjoint implementation is reasonable efficient for many datatypes; I would expect to to be O(n log n) for a tree. However, if you e.g. store the set as ordered array, then you can achieve this in O(n) with a different algorithm.

@ScottPJones
Copy link
Contributor

My OCDness wants those to be are instead of is! 😀

@StefanKarpinski
Copy link
Member

Yeah, this is where the is* predicate naming scheme breaks down.

@ScottPJones
Copy link
Contributor

is* is OK, if at least the common use is with a single argument, but since these are always 2 or more, could we possibly call them are*? All predicates don't start with is either, there is haskey, for example. Or else simply drop the is, i.e. disjoint(...), pairwisedisjoint(...), ..., to avoid the bad grammar.

Another thing - I was wondering if for more than 2 arguments that iteratively checking all elements of A to see if they are in B for all pairs would be rather expensive. I can vaguely remember algorithms for these sorts of things from 6.046 with Rivest and Leiserson, but that was > 30 years ago!
I was thinking, maybe for pairwise disjoint, you could take the largest of the sets, make a copy, and then, for each of the other sets, see if the key was present, return false if so, otherwise add the key, and keep testing. May be totally bogus, but might be reasonable.

@eschnett
Copy link
Contributor Author

Regarding the improved algorithm: Whether this makes sense depends on the collection. The implementation here is a "reasonable" default algorithm for many kinds of collections. I expect various collections to provide specialized implementations that are more efficient. See the comment for the default implementation of symdiff.

Determine whether the collections `v1` and `v2` are disjoint, using
the function `in` to check.
"""
isdisjoint(v1, v2) = all(v -> v ∉ v1, v2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably iterate over the smaller of the two sets? If one of the sets is much bigger than the other that could make a huge difference.

Also, if the point of putting it in the standard library is efficiency, probably an explicit loop would be better than a higher-order function.

@eschnett
Copy link
Contributor Author

@stevengj I addressed your comments.

"""
function _isdisjoint(v1, v2)
for v in v1
if v in v2 return false end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kinds of one-liners are usually written v in v2 && return false elsewhere in Base.

@StefanKarpinski
Copy link
Member

I have to wonder if there isn't a better way to generically express this. It seems like we have a mixture of things here: universal or existential, pairwise versus collective, some pairwise operation on collections.

@eschnett
Copy link
Contributor Author

@StefanKarpinski You mean something like allpairs(isdisjoint, vs...) and anypair(isdisjoint, vs...)? Those would be easy to implement. Of course, their generic implementation would be O(n^2) in the number of arguments given.

@StefanKarpinski
Copy link
Member

The O(n^2) complexity indicates to me that such an API isn't quite right yet, but I'm not sure what's better.

@eschnett
Copy link
Contributor Author

I think algorithms with better complexity must make stronger assumptions about the collections involved. That is, you want to use other operations than just in. For example, if you have access to union or setdiff or symdiff, then you can create something more efficient. For bonus points, create a way for collections to describe which such operations are available, and what complexity they have (!), and then choose an optimal algorithm for isdisjoint...

@eschnett
Copy link
Contributor Author

This patch is ready to be merged.

@StefanKarpinski's comments above refer to a proposed extension of the current API (allowing more than two sets to be passed) and does not apply to the current patch. Also, the n in O(n^2) refers to the number of sets passed, not to the number of elements in the set.

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a rebase. Looks like it was consensus accepted for merging, then just got forgotten and neglected.

function isdisjoint(v1, v2)
# Iterate over the smaller set
# Use a function call to ensure type stability
length(v1)<=length(v2) ? _isdisjoint(v1,v2) : _isdisjoint(v2,v1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
length(v1)<=length(v2) ? _isdisjoint(v1,v2) : _isdisjoint(v2,v1)
if IteratorSize(v1) isa HasLength && IteratorSize(v2) isa HasLength && length(v2) < length(v1)
return _isdisjoint(v2, v1)
end
return _isdisjoint(v1, v2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you may want to handle AbstractSets specially since it's much faster to check containment in an AbstractSet.

@bicycle1885
Copy link
Member

@eschnett Do you have time to finish this? I sometimes implement this for my own, so I wish this be in the standard library.

@eschnett
Copy link
Contributor Author

@bicycle1885 Oh wow, that was four years ago... Not sure if I'll have time in the coming week.

Keno added a commit that referenced this pull request Jan 17, 2020
This was added in #13192, but the PR became stale. We've also changed
a bit how we write these kinds of algorithms, so make use of the tools
we have (e.g. `fastin`).
Keno added a commit that referenced this pull request Jan 17, 2020
This was added in #13192, but the PR became stale. We've also changed
a bit how we write these kinds of algorithms, so make use of the tools
we have (e.g. `fastin`).
@Keno Keno mentioned this pull request Jan 18, 2020
Keno added a commit that referenced this pull request Jan 18, 2020
This was added in #13192, but the PR became stale. We've also changed
a bit how we write these kinds of algorithms, so make use of the tools
we have (e.g. `fastin`).
Keno added a commit that referenced this pull request Jan 18, 2020
This was added in #13192, but the PR became stale. We've also changed
a bit how we write these kinds of algorithms, so make use of the tools
we have (e.g. `fastin`).
@Keno
Copy link
Member

Keno commented Jan 19, 2020

Rebased in #34427.

@Keno Keno closed this Jan 19, 2020
Keno added a commit that referenced this pull request Jan 20, 2020
This was added in #13192, but the PR became stale. We've also changed
a bit how we write these kinds of algorithms, so make use of the tools
we have (e.g. `fastin`).
KristofferC pushed a commit that referenced this pull request Apr 11, 2020
This was added in #13192, but the PR became stale. We've also changed
a bit how we write these kinds of algorithms, so make use of the tools
we have (e.g. `fastin`).
@eschnett eschnett deleted the isdisjoint branch August 14, 2023 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set function to check for intersection?
8 participants