-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make pairwise
multi-threaded and make corkendall
and corspearman
wrap pairwise
.
#923
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great. Thanks for contributing such a PR. I'm happy with it but lets keep if open for a couple of days to see if anybody else would like to comment.
src/rankcorr.jl
Outdated
Xjrank = tiedrank(Xj) | ||
C[j,1] = cor(Xjrank, yrank) | ||
function corspearman(x::AbstractMatrix, y::AbstractVector; skipmissing::Symbol=:none) | ||
return corspearman(x, reshape(y, (length(y), 1)); skipmissing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the 1.0 tests are failing because of this way of passing the keyword argument. I'd be okay with bumping the minimum version for StatsBase since 1.0 is ancient but maybe it's better to just make it skipmissing=skipmissing
here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When writing and testing the code (in KendallTau) I was unaware that support for Julia 1.0.5 was necessary, so the test failures when I submitted the PR came as a surprise. Changing how keyword arguments are handled would be easy, as would avoiding trailing backslashes within string literals. But the fact that 1.0.5 does not support eachcol
is a bit more difficult. eachcol
was added in 1.1. For the time being I will work on responding to @devmotion's comments and wait for your guidance as to whether bumping the minimum version might happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done some more investigation:
Other popular packages (the compat section of Project.toml) have the following minimum versions:
1.9.4 for Statistics
1.6 for Plots, DataFrames, Revise, Optim, ForwardDiff, OhMyREPL, CSV
1.3 for Distributions
1 for StatsBase
So upping the minimum version for StatsBase seems sensible.
But as I remarked above, eachcol
is crucial to how corkendall
and corspearman
are mere wrappers to pairwise
. eachcol
was introduced in Julia 1.1, so I imagined that upping the minimum Julia version for StatsBase to 1.1 would suffice. But no - the code appears to require Julia 1.9.4, and as of now does not work with Julia 1.8.5. For example the method check_vectors
fails because:
# Julia 1.9.4
julia> keys((eachcol([1 2;3 4]))[2])
2-element LinearIndices{1, Tuple{Base.OneTo{Int64}}}:
1
2
# Julia 1.8.5
julia> keys((eachcol([1 2;3 4]))[2])
ERROR: MethodError: no method matching getindex(::Base.Generator{Base.OneTo{Int64}, Base.var"#242#243"{Matrix{Int64}}}, ::Int64)
Stacktrace:
[1] top-level scope
@ REPL[1]:1
So the alternatives are:
- A. We increase the minimum version to 1.9.4, which happens to be that for Statistics, but is more modern than the current LTS.
- B. I find a workaround so that this PR requires either no increase in the minimum version number or an increase to 1.6 which is more in line with other popular packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's now the case that increasing the minimum version to 1.6 would work. See here.
src/pairwise.jl
Outdated
#cor and friends are special-cased. | ||
iscor = (f in (corkendall, corspearman, cor)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit unfortunate that this means other functions can't benefit from the improvements (?) since it is not extendable. Maybe an additional argument would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I check that you've understood the purpose of the iscor
variable (which might be better named diag_is_1
)?
It's a Boolean that answers the question "Does x === y
imply f(x,y) == 1.0
, irrespective of NaN
, missing
and Inf
entries, and that therefore there is a speedup to be had by not bothering to call f
when x[i] === y[i]
".
The upshot is that if f
was some new correlation estimator for which the answer to the question above was "yes" but where we failed to update line 15 then we'd miss out on the "diagonal elements are near-zero-cost to calculate" but still have the benefit of multithreading. Of course, an additional argument is possible, but it makes the interface to pairwise
a bit more complicated.
There's also the fact that the author of funky_cor
could simply include the line x===y && return 1.0
in its definition, which remark has got me wondering whether I should do that for corspearman
and corkendall
...
src/pairwise.jl
Outdated
[26, 25, 16, 15, 6, 5] | ||
``` | ||
""" | ||
function equal_sum_subsets(n::Int, num_subsets::Int)::Vector{Vector{Int}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case this function is kept (see comments above), is it necessary to return a vector of vectors or could one return an iterator of vectors instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reasonably sure that equal_sum_subsets
works well in the sense of doing a good job of load balancing. The only reason it currently returns a vector of vectors is that's what I know how to do, so I would need to read up on iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit a1d9ae9 I have replaced the function equal_sum_subsets
with an iterator EqualSumSubsets
whose element type is another iterator TwoStepRange
. It seems to work well and leads to a small reduction in allocations. But please review as this is the first time I've written code to create an iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't work on 32 bit. Maybe I need to replace use of Int64 with Int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit 951b391 fixes EqualSumSubsets
on 32-bit.
No longer use eachcol when VERSION < v"1.9"
I've now made changes so that Would this meet with approval from the StatsBase maintainers? |
The failed checks after commit [46d5655] are all in the codecov step, with error:
IIUC (this page) I need to wait and try again. |
It's been a while... May I ask why this PR has not been merged? Is it: |
I know people have many obligations and PRs/emails get ignored sometimes, but this is a bit of an unfortunate situation. @PGS62 was rebuffed from registering a new package, but now this PR sits languishing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay! This is a significant refactoring so there are many things to discuss. My main remarks are (see details in inline comments):
- I think the threading strategy can be simplified a lot. Task local storage shouldn't be needed (it's almost never the case), using
@spawn
is the recommended pattern and it composes with nested threaded calls contrary to@threads
. Also I wonder whether we can find a simpler solution of splitting indices across tasks thanEqualSumSubsets
(see here). - I wouldn't add a
skipmissing
argument tocorkendall
andcorspeaman
, as this makes the code significantly more verbose and the signature inconsistent withcor
. We havepairwise
for that. (And anyway it's better to make separate PRs for distinct features as in my experience large PRs are more likely to become unmanageable and stall.) - There are lots of unrelated formatting changes, please revert these so that the diff is cleaner and easier to review. Also in the changed code the style isn't always consistent with StatsBase's (in particular, arguments should be aligned in method definitions, no empty initial lines, space after
#
in comments).
|
||
#cor and friends are special-cased. | ||
diag_is_1 = (f in (corkendall, corspearman, cor)) | ||
(diag_is_1 || f == cov) && (symmetric = x === y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(diag_is_1 || f == cov) && (symmetric = x === y) | |
if (diag_is_1 || f === cov) && x === y | |
symmetric = true | |
end |
diag_is_1 = (f in (corkendall, corspearman, cor)) | ||
(diag_is_1 || f == cov) && (symmetric = x === y) | ||
#cov(x) is faster than cov(x, x) | ||
(f == cov) && (f = ((x, y) -> x === y ? cov(x) : cov(x, y))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(f == cov) && (f = ((x, y) -> x === y ? cov(x) : cov(x, y))) | |
if f === cov | |
f = (x, y) -> x === y ? cov(x) : cov(x, y) | |
end |
end | ||
end | ||
symmetric && LinearAlgebra.copytri!(dest, 'L') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this is an internal function, better copy using an explicit loop as before. Anyway that function just does that, without any particular optimization.
(f == cov) && (f = ((x, y) -> x === y ? cov(x) : cov(x, y))) | ||
|
||
Threads.@threads for subset in EqualSumSubsets(nc, Threads.nthreads()) | ||
for j in subset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for j in subset | |
@inbounds for j in subset |
function _pairwise!(::Val{:none}, f, dest::AbstractMatrix{V}, x, y, | ||
symmetric::Bool) where {V} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function _pairwise!(::Val{:none}, f, dest::AbstractMatrix{V}, x, y, | |
symmetric::Bool) where {V} | |
function _pairwise!(::Val{:none}, f, dest::AbstractMatrix{V}, x, y, | |
symmetric::Bool) where {V} |
# eachcol was added in Julia 1.1 but for Julia 1.8, keys(eachcol(x)) fails for any Matrix x | ||
# which causes `check_vectors` to fail. Solution is to use the comprehension below. | ||
_eachcol(x) = VERSION < v"1.9" ? [view(x, :, i) for i in axes(x, 2)] : eachcol(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though check_vectors
allowed any iterator, as the docstring for pairwise
indicates. It doesn't seem to call keys
on the input. Things like pairwise(cor, eachcol(rand(4, 5)), eachcol(rand(4, 5)))
work on Julia 1.6.
Anyway even if it works around an error, this function doesn't fix it when users pass eachcol(...)
so we need a better solution.
# _isnan required so that corkendall and corspearman have correct handling of NaNs and | ||
# can also accept arguments with element type for which isnan is not defined but isless is | ||
# is defined, so that rank correlation makes sense. | ||
_isnan(x::T) where {T<:Number} = isnan(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_isnan(x::T) where {T<:Number} = isnan(x) | |
_isnan(x::Number) = isnan(x) |
end | ||
|
||
if missing isa T || missing isa V | ||
sortedx, permutedy = handle_pairwise(sortedx, scratch_py; scratch_fx=scratch_fx, scratch_fy=scratch_fy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes pairwise deletion but what about skipmissing=:listwise
?
#cov(x) is faster than cov(x, x) | ||
(f == cov) && (f = ((x, y) -> x === y ? cov(x) : cov(x, y))) | ||
|
||
Threads.@threads for subset in EqualSumSubsets(nc, Threads.nthreads()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Continuing #923 (comment))
This approach is clever, but I also wonder whether it wouldn't be possible to achieve the same result with less code. If you identify a cell in the upper triangle using an integer index, you can just split these equally across threads. Then you can compute back what row and column the integer index corresponds to (e.g. via i, j = Tuple(CartesianIndices(dest))
). If that's simpler you could even attribute indices from the full table to tasks, and skip those from the lower triangle from inside tasks. Am I missing something?
It could also make sense to spawn more tasks than Threads.nthreads
when the input matrix is really large. Indeed, we don't have any guarantee that there are Threads.nthreads()
idle threads. We may be operating within a context which is already threaded, and for example one thread, or two thirds of the threads may already be busy; another app may also be using cores. So when we're sure that the overhead of spawning threads is small compared to the total cost of the operation it's worth spawning lots of tasks that can then be attributed to idle threads as appropriate. This is what we do in DataFrames.jl, with a threshold so that we spawn a new task for each batch of N rows.
My thanks to @nalimilan for the response! As it happens I've been travelling for the last three weeks with almost no internet (Nepal), but I will reply as soon as I can. |
This pull request follows up on #849. In that discussion I envisaged amending
corkendall
to be multi-threaded. Instead I have done the following:pairwise
andpairwise!
multi-threaded, and in the case ofskipmissing = :pairwise
, greatly reduced allocations which is needed for good multi-threaded performance. A nice illustration of the performance improvements this leads to is obtained withf = LinearAlgebra.dot
. The example here shows a 31.5 times speedup on a 12 core PC.corkendall
now simply wrapspairwise
but with added methods of_pairwise!
fortypeof(corkendall)
to retain the speedups tocorkendall
that were a result of Speeding up Kendall Correlationcorkendall
#634.corspearman
also wrapspairwise
with added methods of_pairwise!
fortypeof(corspearman)
. The purpose of these added methods is to reduce the number of calls totiedrank
from2nm
to2(n+m)
whenn
andm
are respectively the number of columns in the argumentsx
andy
to the call tocorspearman(x,y...
.Since
corkendall
andcorspearman
now wrappairwise
they have askipmissing
argument, which introduces an additional inconsistency betweencorkandall
/corspearman
andcov
/cor
.I developed the new code in my package KendallTau. Please see the README file for further examples of the performance improvements achieved, which for all but the smallest input data exceed the number of cores on the PC (thanks to reduced allocations).
I have amended the tests as needed, and I believe that test coverage is as close to 100% as possible.
I apologise that there is quite a lot for maintainers to review, and I'll be happy to answer questions.