-
Notifications
You must be signed in to change notification settings - Fork 226
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Chatterjee Correlation Coefficient (#770)
* Implement rank vector [ci skip] * Add documentation. Admittedly terrible. * Add unit tests. * Cleanup method of detecting if execution policies are valid or not [ci skip] * Implement and test chatterjee correlation [ci skip] * Add spot checks and special handling for constant Y [ci skip] * Add performance file [ci skip] * Add execution policy support to rank [ci skip] * Remove duplicates from v when generating the order vector [ci skip] * Fix macro error for use of <execution> [ci skip] * Use explicit types instead of auto to avoid warnings [ci skip] * Add execution policy testing to rank [ci skip] * Add threaded implementation [ci skip] * Added threaded testing * Fix formatting and ASCII issues in test * Fix more ASCII issues * refactoring * Fix threaded impl * Remove non-ASCII apostrophe [ci skip] * Doc fixes and add test comparing generally to paper values * Significantly tighten tolerance around expected values from paper * Change tolerance for sin comparison Co-authored-by: Nick Thompson <nathompson7@protonmail.com>
- Loading branch information
1 parent
4d5cc69
commit e5eae18
Showing
13 changed files
with
719 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
[/ | ||
Copyright 2022 Matt Borland | ||
|
||
Distributed under the Boost Software License, Version 1.0. | ||
(See accompanying file LICENSE_1_0.txt or copy at | ||
http://www.boost.org/LICENSE_1_0.txt). | ||
] | ||
|
||
[section:chatterjee_correlation Chatterjee Correlation] | ||
|
||
[heading Synopsis] | ||
|
||
`` | ||
#include <boost/math/statistics/chatterjee_correlation.hpp> | ||
|
||
namespace boost::math::statistics { | ||
|
||
C++17: | ||
template <typename ExecutionPolicy, typename Container> | ||
auto chatterjee_correlation(ExecutionPolicy&& exec, const Container& u, const Container& v); | ||
|
||
C++11: | ||
template <typename Container> | ||
auto chatterjee_correlation(const Container& u, const Container& v); | ||
} | ||
`` | ||
|
||
[heading Description] | ||
|
||
The classical correlation coefficients like the Pearson's correlation are useful primarily for distinguishing when one dataset depends linearly on another. | ||
However, Pearson's correlation coefficient has a known weakness in that when the dependent variable has an obvious functional relationship with the independent variable, the value of the correlation coefficient can take on any value. | ||
As Chatterjee says: | ||
|
||
> Ideally, one would like a coefficient that approaches | ||
its maximum value if and only if one variable looks more and more like a | ||
noiseless function of the other, just as Pearson correlation is close to its maximum value if and only if one variable is close to being a noiseless linear function of the other. | ||
|
||
This is the problem Chatterjee's coefficient solves. | ||
Let X and Y be random variables, where Y is not constant, and let (X_i, Y_i) be samples from this distribution. | ||
Rearrange these samples so that X_(0) < X_{(1)} < ... X_{(n-1)} and create (R(X_{(i)}), R(Y_{(i)})). | ||
The Chatterjee correlation is then given by | ||
|
||
[$../equations/chatterjee_correlation.svg] | ||
|
||
In the limit of an infinite amount of i.i.d data, the statistic lies in [0, 1]. | ||
However, if the data is not infinite, the statistic may be negative. | ||
If X and Y are independent, the value is zero, and if Y is a measurable function of X, then the statistic is unity. | ||
The complexity is O(n log n). | ||
|
||
An example is given below: | ||
|
||
std::vector<double> X{1,2,3,4,5}; | ||
std::vector<double> Y{1,2,3,4,5}; | ||
using boost::math::statistics::chatterjee_correlation; | ||
double coeff = chatterjee_correlation(X, Y); | ||
|
||
The implementation follows [@https://arxiv.org/pdf/1909.10140.pdf Chatterjee's paper]. | ||
|
||
/Nota bene:/ If the input is an integer type the output will be a double precision type. | ||
|
||
[heading Invariants] | ||
|
||
The function expects at least two samples, a non-constant vector Y, and the same number of X's as Y's. | ||
If Y is constant, the result is a quiet NaN. | ||
The data set must be sorted by X values. | ||
If there are ties in the values of X, then the statistic is random due to the random breaking of ties. | ||
Of course, random numbers are not used internally, but the result is not guaranteed to be identical on different systems. | ||
|
||
[heading References] | ||
|
||
* Chatterjee, Sourav. "A new coefficient of correlation." Journal of the American Statistical Association 116.536 (2021): 2009-2022. | ||
|
||
[endsect] | ||
[/section:chatterjee_correlation Chatterjee Correlation] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
159 changes: 159 additions & 0 deletions
159
include/boost/math/statistics/chatterjee_correlation.hpp
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,159 @@ | ||
// (C) Copyright Matt Borland 2022. | ||
// Use, modification and distribution are subject to the | ||
// Boost Software License, Version 1.0. (See accompanying file | ||
// LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) | ||
|
||
#ifndef BOOST_MATH_STATISTICS_CHATTERJEE_CORRELATION_HPP | ||
#define BOOST_MATH_STATISTICS_CHATTERJEE_CORRELATION_HPP | ||
|
||
#include <cstdint> | ||
#include <cmath> | ||
#include <algorithm> | ||
#include <iterator> | ||
#include <vector> | ||
#include <limits> | ||
#include <utility> | ||
#include <type_traits> | ||
#include <boost/math/tools/assert.hpp> | ||
#include <boost/math/tools/config.hpp> | ||
#include <boost/math/statistics/detail/rank.hpp> | ||
|
||
#ifdef BOOST_MATH_EXEC_COMPATIBLE | ||
#include <execution> | ||
#include <future> | ||
#include <thread> | ||
#endif | ||
|
||
namespace boost { namespace math { namespace statistics { | ||
|
||
namespace detail { | ||
|
||
template <typename BDIter> | ||
std::size_t chatterjee_transform(BDIter begin, BDIter end) | ||
{ | ||
std::size_t sum = 0; | ||
|
||
while(++begin != end) | ||
{ | ||
if(*begin > *std::prev(begin)) | ||
{ | ||
sum += *begin - *std::prev(begin); | ||
} | ||
else | ||
{ | ||
sum += *std::prev(begin) - *begin; | ||
} | ||
} | ||
|
||
return sum; | ||
} | ||
|
||
template <typename ReturnType, typename ForwardIterator> | ||
ReturnType chatterjee_correlation_seq_impl(ForwardIterator u_begin, ForwardIterator u_end, ForwardIterator v_begin, ForwardIterator v_end) | ||
{ | ||
using std::abs; | ||
|
||
BOOST_MATH_ASSERT_MSG(std::is_sorted(u_begin, u_end), "The x values must be sorted in order to use this functionality"); | ||
|
||
const std::vector<std::size_t> rank_vector = rank(v_begin, v_end); | ||
|
||
std::size_t sum = chatterjee_transform(rank_vector.begin(), rank_vector.end()); | ||
|
||
ReturnType result = static_cast<ReturnType>(1) - (static_cast<ReturnType>(3 * sum) / static_cast<ReturnType>(rank_vector.size() * rank_vector.size() - 1)); | ||
|
||
// If the result is 1 then Y is constant and all the elements must be ties | ||
if (abs(result - static_cast<ReturnType>(1)) < std::numeric_limits<ReturnType>::epsilon()) | ||
{ | ||
return std::numeric_limits<ReturnType>::quiet_NaN(); | ||
} | ||
|
||
return result; | ||
} | ||
|
||
} // Namespace detail | ||
|
||
template <typename Container, typename Real = typename Container::value_type, | ||
typename ReturnType = typename std::conditional<std::is_integral<Real>::value, double, Real>::type> | ||
inline ReturnType chatterjee_correlation(const Container& u, const Container& v) | ||
{ | ||
return detail::chatterjee_correlation_seq_impl<ReturnType>(std::begin(u), std::end(u), std::begin(v), std::end(v)); | ||
} | ||
|
||
}}} // Namespace boost::math::statistics | ||
|
||
#ifdef BOOST_MATH_EXEC_COMPATIBLE | ||
|
||
namespace boost::math::statistics { | ||
|
||
namespace detail { | ||
|
||
template <typename ReturnType, typename ExecutionPolicy, typename ForwardIterator> | ||
ReturnType chatterjee_correlation_par_impl(ExecutionPolicy&& exec, ForwardIterator u_begin, ForwardIterator u_end, | ||
ForwardIterator v_begin, ForwardIterator v_end) | ||
{ | ||
using std::abs; | ||
BOOST_MATH_ASSERT_MSG(std::is_sorted(std::forward<ExecutionPolicy>(exec), u_begin, u_end), "The x values must be sorted in order to use this functionality"); | ||
|
||
auto rank_vector = rank(std::forward<ExecutionPolicy>(exec), v_begin, v_end); | ||
|
||
const auto num_threads = std::thread::hardware_concurrency() == 0 ? 2u : std::thread::hardware_concurrency(); | ||
std::vector<std::future<std::size_t>> future_manager {}; | ||
const auto elements_per_thread = std::ceil(static_cast<double>(rank_vector.size()) / num_threads); | ||
|
||
auto it = rank_vector.begin(); | ||
auto end = rank_vector.end(); | ||
for(std::size_t i {}; i < num_threads - 1; ++i) | ||
{ | ||
future_manager.emplace_back(std::async(std::launch::async | std::launch::deferred, [it, elements_per_thread]() -> std::size_t | ||
{ | ||
return chatterjee_transform(it, std::next(it, elements_per_thread)); | ||
})); | ||
it = std::next(it, elements_per_thread - 1); | ||
} | ||
|
||
future_manager.emplace_back(std::async(std::launch::async | std::launch::deferred, [it, end]() -> std::size_t | ||
{ | ||
return chatterjee_transform(it, end); | ||
})); | ||
|
||
std::size_t sum {}; | ||
for(std::size_t i {}; i < future_manager.size(); ++i) | ||
{ | ||
sum += future_manager[i].get(); | ||
} | ||
|
||
ReturnType result = static_cast<ReturnType>(1) - (static_cast<ReturnType>(3 * sum) / static_cast<ReturnType>(rank_vector.size() * rank_vector.size() - 1)); | ||
|
||
// If the result is 1 then Y is constant and all the elements must be ties | ||
if (abs(result - static_cast<ReturnType>(1)) < std::numeric_limits<ReturnType>::epsilon()) | ||
{ | ||
return std::numeric_limits<ReturnType>::quiet_NaN(); | ||
} | ||
|
||
return result; | ||
} | ||
|
||
} // Namespace detail | ||
|
||
template <typename ExecutionPolicy, typename Container, typename Real = typename Container::value_type, | ||
typename ReturnType = std::conditional_t<std::is_integral_v<Real>, double, Real>> | ||
inline ReturnType chatterjee_correlation(ExecutionPolicy&& exec, const Container& u, const Container& v) | ||
{ | ||
if constexpr (std::is_same_v<std::remove_reference_t<decltype(exec)>, decltype(std::execution::seq)>) | ||
{ | ||
return detail::chatterjee_correlation_seq_impl<ReturnType>(std::cbegin(u), std::cend(u), | ||
std::cbegin(v), std::cend(v)); | ||
} | ||
else | ||
{ | ||
return detail::chatterjee_correlation_par_impl<ReturnType>(std::forward<ExecutionPolicy>(exec), | ||
std::cbegin(u), std::cend(u), | ||
std::cbegin(v), std::cend(v)); | ||
} | ||
} | ||
|
||
} // Namespace boost::math::statistics | ||
|
||
#endif | ||
|
||
#endif // BOOST_MATH_STATISTICS_CHATTERJEE_CORRELATION_HPP |
Oops, something went wrong.