Chatterjee Correlation Coefficient (#770)

* Implement rank vector [ci skip] * Add documentation. Admittedly terrible. * Add unit tests. * Cleanup method of detecting if execution policies are valid or not [ci skip] * Implement and test chatterjee correlation [ci skip] * Add spot checks and special handling for constant Y [ci skip] * Add performance file [ci skip] * Add execution policy support to rank [ci skip] * Remove duplicates from v when generating the order vector [ci skip] * Fix macro error for use of <execution> [ci skip] * Use explicit types instead of auto to avoid warnings [ci skip] * Add execution policy testing to rank [ci skip] * Add threaded implementation [ci skip] * Added threaded testing * Fix formatting and ASCII issues in test * Fix more ASCII issues * refactoring * Fix threaded impl * Remove non-ASCII apostrophe [ci skip] * Doc fixes and add test comparing generally to paper values * Significantly tighten tolerance around expected values from paper * Change tolerance for sin comparison Co-authored-by: Nick Thompson <nathompson7@protonmail.com>
boostorg · May 25, 2022 · e5eae18 · e5eae18
1 parent 4d5cc69
commit e5eae18
Show file tree

Hide file tree

Showing 13 changed files with 719 additions and 15 deletions.
diff --git a/doc/statistics/chatterjee_correlation.qbk b/doc/statistics/chatterjee_correlation.qbk
@@ -0,0 +1,74 @@
+[/
+  Copyright 2022 Matt Borland
+
+  Distributed under the Boost Software License, Version 1.0.
+  (See accompanying file LICENSE_1_0.txt or copy at
+  http://www.boost.org/LICENSE_1_0.txt).
+]
+
+[section:chatterjee_correlation Chatterjee Correlation]
+
+[heading Synopsis]
+
+``
+#include <boost/math/statistics/chatterjee_correlation.hpp>
+
+namespace boost::math::statistics {
+
+    C++17:
+    template <typename ExecutionPolicy, typename Container>
+    auto chatterjee_correlation(ExecutionPolicy&& exec, const Container& u, const Container& v);
+
+    C++11:
+    template <typename Container>
+    auto chatterjee_correlation(const Container& u, const Container& v);
+}
+``
+
+[heading Description]
+
+The classical correlation coefficients like the Pearson's correlation are useful primarily for distinguishing when one dataset depends linearly on another.
+However, Pearson's correlation coefficient has a known weakness in that when the dependent variable has an obvious functional relationship with the independent variable, the value of the correlation coefficient can take on any value.
+As Chatterjee says:
+
+> Ideally, one would like a coefficient that approaches
+its maximum value if and only if one variable looks more and more like a
+noiseless function of the other, just as Pearson correlation is close to its maximum value if and only if one variable is close to being a noiseless linear function of the other.
+
+This is the problem Chatterjee's coefficient solves.
+Let X and Y be random variables, where Y is not constant, and let (X_i, Y_i) be samples from this distribution.
+Rearrange these samples so that X_(0) < X_{(1)} < ... X_{(n-1)} and create (R(X_{(i)}), R(Y_{(i)})).
+The Chatterjee correlation is then given by
+
+[$../equations/chatterjee_correlation.svg]
+
+In the limit of an infinite amount of i.i.d data, the statistic lies in [0, 1].
+However, if the data is not infinite, the statistic may be negative.
+If X and Y are independent, the value is zero, and if Y is a measurable function of X, then the statistic is unity.
+The complexity is O(n log n).
+
+An example is given below:
+
+    std::vector<double> X{1,2,3,4,5};
+    std::vector<double> Y{1,2,3,4,5};
+    using boost::math::statistics::chatterjee_correlation;
+    double coeff = chatterjee_correlation(X, Y);
+
+The implementation follows [@https://arxiv.org/pdf/1909.10140.pdf Chatterjee's paper].
+
+/Nota bene:/ If the input is an integer type the output will be a double precision type.
+
+[heading Invariants]
+
+The function expects at least two samples, a non-constant vector Y, and the same number of X's as Y's.
+If Y is constant, the result is a quiet NaN.
+The data set must be sorted by X values.
+If there are ties in the values of X, then the statistic is random due to the random breaking of ties.
+Of course, random numbers are not used internally, but the result is not guaranteed to be identical on different systems.
+
+[heading References]
+
+* Chatterjee, Sourav. "A new coefficient of correlation." Journal of the American Statistical Association 116.536 (2021): 2009-2022.
+
+[endsect]
+[/section:chatterjee_correlation Chatterjee Correlation]
diff --git a/include/boost/math/statistics/bivariate_statistics.hpp b/include/boost/math/statistics/bivariate_statistics.hpp
@@ -18,13 +18,10 @@
 #include <boost/math/tools/assert.hpp>
 #include <boost/math/tools/config.hpp>
 
-// Support compilers with P0024R2 implemented without linking TBB
-// https://en.cppreference.com/w/cpp/compiler_support
-#if !defined(BOOST_NO_CXX17_HDR_EXECUTION) && defined(BOOST_HAS_THREADS)
+#ifdef BOOST_MATH_EXEC_COMPATIBLE
 #include <execution>
 #include <future>
 #include <thread>
-#define EXEC_COMPATIBLE
 #endif
 
 namespace boost{ namespace math{ namespace statistics { namespace detail {
@@ -60,7 +57,7 @@ ReturnType means_and_covariance_seq_impl(ForwardIterator u_begin, ForwardIterato
     return std::make_tuple(mu_u, mu_v, cov/i, Real(i));
 }
 
-#ifdef EXEC_COMPATIBLE
+#ifdef BOOST_MATH_EXEC_COMPATIBLE
 
 // Numerically stable parallel computation of (co-)variance
 // https://dl.acm.org/doi/10.1145/3221269.3223036
@@ -154,7 +151,7 @@ ReturnType means_and_covariance_parallel_impl(ForwardIterator u_begin, ForwardIt
     return std::make_tuple(mu_u_a, mu_v_a, cov_a, n_a);
 }
 
-#endif // EXEC_COMPATIBLE
+#endif // BOOST_MATH_EXEC_COMPATIBLE
 
 template<typename ReturnType, typename ForwardIterator>
 ReturnType correlation_coefficient_seq_impl(ForwardIterator u_begin, ForwardIterator u_end, ForwardIterator v_begin, ForwardIterator v_end)
@@ -204,7 +201,7 @@ ReturnType correlation_coefficient_seq_impl(ForwardIterator u_begin, ForwardIter
     return std::make_tuple(mu_u, Qu, mu_v, Qv, cov, rho, Real(i));
 }
 
-#ifdef EXEC_COMPATIBLE
+#ifdef BOOST_MATH_EXEC_COMPATIBLE
 
 // Numerically stable parallel computation of (co-)variance:
 // https://dl.acm.org/doi/10.1145/3221269.3223036
@@ -324,11 +321,11 @@ ReturnType correlation_coefficient_parallel_impl(ForwardIterator u_begin, Forwar
     return std::make_tuple(mu_u_a, Qu_a, mu_v_a, Qv_a, cov_a, rho, n_a);
 }
 
-#endif // EXEC_COMPATIBLE
+#endif // BOOST_MATH_EXEC_COMPATIBLE
 
 } // namespace detail
 
-#ifdef EXEC_COMPATIBLE
+#ifdef BOOST_MATH_EXEC_COMPATIBLE
 
 template<typename ExecutionPolicy, typename Container, typename Real = typename Container::value_type>
 inline auto means_and_covariance(ExecutionPolicy&& exec, Container const & u, Container const & v)

diff --git a/include/boost/math/statistics/chatterjee_correlation.hpp b/include/boost/math/statistics/chatterjee_correlation.hpp
@@ -0,0 +1,159 @@
+//  (C) Copyright Matt Borland 2022.
+//  Use, modification and distribution are subject to the
+//  Boost Software License, Version 1.0. (See accompanying file
+//  LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
+
+#ifndef BOOST_MATH_STATISTICS_CHATTERJEE_CORRELATION_HPP
+#define BOOST_MATH_STATISTICS_CHATTERJEE_CORRELATION_HPP
+
+#include <cstdint>
+#include <cmath>
+#include <algorithm>
+#include <iterator>
+#include <vector>
+#include <limits>
+#include <utility>
+#include <type_traits>
+#include <boost/math/tools/assert.hpp>
+#include <boost/math/tools/config.hpp>
+#include <boost/math/statistics/detail/rank.hpp>
+
+#ifdef BOOST_MATH_EXEC_COMPATIBLE
+#include <execution>
+#include <future>
+#include <thread>
+#endif
+
+namespace boost { namespace math { namespace statistics {
+
+namespace detail {
+
+template <typename BDIter>
+std::size_t chatterjee_transform(BDIter begin, BDIter end)
+{
+    std::size_t sum = 0;
+
+    while(++begin != end)
+    {
+        if(*begin > *std::prev(begin))
+        {
+            sum += *begin - *std::prev(begin);
+        }
+        else
+        {
+            sum += *std::prev(begin) - *begin;
+        }
+    }
+
+    return sum;
+}
+
+template <typename ReturnType, typename ForwardIterator>
+ReturnType chatterjee_correlation_seq_impl(ForwardIterator u_begin, ForwardIterator u_end, ForwardIterator v_begin, ForwardIterator v_end)
+{
+    using std::abs;
+
+    BOOST_MATH_ASSERT_MSG(std::is_sorted(u_begin, u_end), "The x values must be sorted in order to use this functionality");
+
+    const std::vector<std::size_t> rank_vector = rank(v_begin, v_end);
+
+    std::size_t sum = chatterjee_transform(rank_vector.begin(), rank_vector.end());
+
+    ReturnType result = static_cast<ReturnType>(1) - (static_cast<ReturnType>(3 * sum) / static_cast<ReturnType>(rank_vector.size() * rank_vector.size() - 1));
+
+    // If the result is 1 then Y is constant and all the elements must be ties
+    if (abs(result - static_cast<ReturnType>(1)) < std::numeric_limits<ReturnType>::epsilon())
+    {
+        return std::numeric_limits<ReturnType>::quiet_NaN();
+    }
+
+    return result;
+}
+
+} // Namespace detail
+
+template <typename Container, typename Real = typename Container::value_type, 
+          typename ReturnType = typename std::conditional<std::is_integral<Real>::value, double, Real>::type>
+inline ReturnType chatterjee_correlation(const Container& u, const Container& v)
+{
+    return detail::chatterjee_correlation_seq_impl<ReturnType>(std::begin(u), std::end(u), std::begin(v), std::end(v));
+}
+
+}}} // Namespace boost::math::statistics
+
+#ifdef BOOST_MATH_EXEC_COMPATIBLE
+
+namespace boost::math::statistics {
+
+namespace detail {
+
+template <typename ReturnType, typename ExecutionPolicy, typename ForwardIterator>
+ReturnType chatterjee_correlation_par_impl(ExecutionPolicy&& exec, ForwardIterator u_begin, ForwardIterator u_end,
+                                                                   ForwardIterator v_begin, ForwardIterator v_end)
+{
+    using std::abs;
+    BOOST_MATH_ASSERT_MSG(std::is_sorted(std::forward<ExecutionPolicy>(exec), u_begin, u_end), "The x values must be sorted in order to use this functionality");
+
+    auto rank_vector = rank(std::forward<ExecutionPolicy>(exec), v_begin, v_end);
+
+    const auto num_threads = std::thread::hardware_concurrency() == 0 ? 2u : std::thread::hardware_concurrency();
+    std::vector<std::future<std::size_t>> future_manager {};
+    const auto elements_per_thread = std::ceil(static_cast<double>(rank_vector.size()) / num_threads);
+
+    auto it = rank_vector.begin();
+    auto end = rank_vector.end();
+    for(std::size_t i {}; i < num_threads - 1; ++i)
+    {
+        future_manager.emplace_back(std::async(std::launch::async | std::launch::deferred, [it, elements_per_thread]() -> std::size_t
+        {
+            return chatterjee_transform(it, std::next(it, elements_per_thread));
+        }));
+        it = std::next(it, elements_per_thread - 1);
+    }
+
+    future_manager.emplace_back(std::async(std::launch::async | std::launch::deferred, [it, end]() -> std::size_t
+    {
+        return chatterjee_transform(it, end);
+    }));
+
+    std::size_t sum {};
+    for(std::size_t i {}; i < future_manager.size(); ++i)
+    {
+        sum += future_manager[i].get();
+    }
+
+    ReturnType result = static_cast<ReturnType>(1) - (static_cast<ReturnType>(3 * sum) / static_cast<ReturnType>(rank_vector.size() * rank_vector.size() - 1));
+
+    // If the result is 1 then Y is constant and all the elements must be ties
+    if (abs(result - static_cast<ReturnType>(1)) < std::numeric_limits<ReturnType>::epsilon())
+    {
+        return std::numeric_limits<ReturnType>::quiet_NaN();
+    }
+
+    return result;
+}
+
+} // Namespace detail
+
+template <typename ExecutionPolicy, typename Container, typename Real = typename Container::value_type,
+          typename ReturnType = std::conditional_t<std::is_integral_v<Real>, double, Real>>
+inline ReturnType chatterjee_correlation(ExecutionPolicy&& exec, const Container& u, const Container& v)
+{
+    if constexpr (std::is_same_v<std::remove_reference_t<decltype(exec)>, decltype(std::execution::seq)>)
+    {
+        return detail::chatterjee_correlation_seq_impl<ReturnType>(std::cbegin(u), std::cend(u),
+                                                                   std::cbegin(v), std::cend(v));
+    }
+    else
+    {
+        return detail::chatterjee_correlation_par_impl<ReturnType>(std::forward<ExecutionPolicy>(exec),
+                                                                   std::cbegin(u), std::cend(u),
+                                                                   std::cbegin(v), std::cend(v));
+    }
+}
+
+} // Namespace boost::math::statistics
+
+#endif
+
+#endif // BOOST_MATH_STATISTICS_CHATTERJEE_CORRELATION_HPP