Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change corkendall to be multi-threaded and handle missing values #849

Open
PGS62 opened this issue Feb 16, 2023 · 2 comments
Open

Change corkendall to be multi-threaded and handle missing values #849

PGS62 opened this issue Feb 16, 2023 · 2 comments

Comments

@PGS62
Copy link
Contributor

PGS62 commented Feb 16, 2023

Back in Feb 2021, with help from @nalimilan, I re-wrote corkendall to improve its speed by a factor of about seven (see # 634).

More recently I have been working on a multi-threaded version, and have released a version which I'd like to propose should replace corkendall here in StatsBase. As well as being multi-threaded, the new version has support for missing values, though it does not do this via pairwise.

The speed up is broadly equal to the number of cores available. No speed up is achievable if both x and y are vectors.

Could I request that a StatsBase maintainer looks at the code and lets me know their thoughts?

https://github.com/PGS62/KendallTau.jl/blob/main/src/corkendall.jl

Philip

julia> using StatsBase,KendallTau,Random #StatsBase v0.33.21
[ Info: Precompiling KendallTau [aff0c2a8-f755-4235-a8c7-7336d8be0b73]

julia> x = rand(1000,10);StatsBase.corkendall(x)==KendallTau.corkendall(x)#compile
true

julia> x = rand(1000,1000);

julia> @time res_sb = StatsBase.corkendall(x);
 21.393938 seconds (3.00 M allocations: 17.082 GiB, 5.36% gc time)

julia> @time res_kt = KendallTau.corkendall(x);
  1.780313 seconds (2.28 k allocations: 8.876 MiB, 0.14% gc time)
  
julia> 21.393938/1.780313
12.016953198679108

julia> res_sb == res_kt
true

julia> Threads.nthreads()#12 cores, 20 logical processors
20
@nalimilan
Copy link
Member

Sorry for the delay. Sounds interesting. A few questions/remarks:

  • How does this interact with Improve corspearman performance (Matrix & Matrix-Matrix cases) #850? AFAICT the code mentioned here does not benefit from this optimization, meaning it's slower when there's a single thread, right?
  • Isn't there a slowdown when the number of rows or columns is small? Generally, spawning tasks has a significant overhead.
  • I'd rather avoid adding a skipmissing argument to corspearman and corkendall, as then for consistency we would need to add it to cor, cov, var, etc. Instead I think you can define specialized methods for pairwise(::typeof(corspearman), ...) and so on. Anyway, it's probably better to handle this in a separate PR to keep things simple.

@PGS62
Copy link
Contributor Author

PGS62 commented Mar 13, 2023

Thanks for your questions.

  • There's basically no interaction between this issue and Improve corspearman performance (Matrix & Matrix-Matrix cases) #850 . I noticed that corspearman (for which there's no handling of missing values and no use of threads) is inefficient since it makes more calls than needs be to tiedrank. Fixing that, while being careful to maintain the desired handling of nans, was straightforward, and I think performance is improved for all possible inputs. The amended corspearman does not use threads or handle missings.
  • On the question of slowdown for small number of rows\columns in the output, I investigated this for the case of x having 1,000 rows. My finding was that the threaded version is slower if x has two columns (2 by 2 output correlation matrix) and faster if x has three or more columns. I think that indicates that spawning tasks is a manageable overhead compared with sorting a 1,000 element array. This finding is illustrated in the plot here. I expect the situation would be different when x has substantially fewer rows, but I've not quantified that.
  • I understand your point about adding a skipmissing argument. On the other hand, a concern I have about pairwise is that it's not very discoverable for a new user of StatsBase. Perhaps that could be fixed with a brief "See also" mention of pairwise in the documentation of corkendall and friends?

This is how we might proceed from here:

  1. I raise a PR to amend pairwise so that it works with the current implementation both corspearman and corkendall (not the case today). What I have in mind is to make it so that when _pairwise! calls f it passes arguments (xnm and ynm) of type AbstractVector{<:Real} as opposed to the current set up where the vectors passed have an element type which permits missings.

  2. Once PR 1 is merged, I raise a second PR that exposes a new version of corkendall that uses threads (and no skipmissing argument) and also contains a specialised method pairwise(::typeof(corkendall), ...).

Does that sound sensible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants