Kolmogorov-Smirnov distribution #422

evanmiller · 2020-08-19T13:46:06Z

Add a new distribution, kolmogorov_smirnov_distribution, which takes a parameter that represents the number of observations used in a Kolmogorov-Smirnov test. (The K-S test is a popular test for comparing two CDFs, but the test statistic is not implemented here.)

This implementation includes Kolmogorov's original 1st order Taylor expansion, which itself is an infinite sum. There is a literature on the distribution's other mathematical properties (higher order terms and exact version); this literature is summarized in the main header file for anyone who may want to expand the implementation later.

The CDF is implemented using a Jacobi theta function, and the PDF is a hand-rolled derivative of that function. Quantiles plug the CDF and PDF into a Newton-Raphson iteration. The mean, variance, skewness, and kurtosis have nice closed-form expressions, and the mode uses a dumb run-time maximizer.

This commit includes graphs, a ULP plotter for the PDF, and the usual compilation and numerical tests. The test file is on the small side, but it verifies the first four moments by integrating the entire distribution, and also covers the quantiles pretty well. As of now the numerical tests only verify self-consistency (e.g. distribution moments and CDF-quantile relations), so there's room to add some external checks.

I will add user-facing documentation after the API is approved and the implementation is finalized. (Hence WIP)

See #421 for previous discussion and below for relevant graphs.

test/compile_test/dist_kolmogorov_smirnov_incl_test.cpp

reporting/accuracy/plot_kolmogorov_smirnov_pdf.cpp

evanmiller · 2020-08-19T16:48:39Z

The build failures indicate that skewness and kurtosis are required functions, but they are currently not implemented.

evanmiller · 2020-08-20T18:12:05Z

I've added a closed form expression for skewness, but I'm having trouble deriving / implementing kurtosis. It should yield a nice expression, but my numbers just aren't adding up when I try to implement it. A couple of questions to that end:

Can kurtosis be optional?
Is it okay if kurtosis is computed numerically at run-time?

If not I'll take another stab at a closed form this weekend.

evanmiller · 2020-08-21T02:41:15Z

Found my mistake! Closed-form kurtosis is implemented now. The latest commit does not [CI SKIP] so I will check the build result tomorrow.

evanmiller · 2020-08-22T16:57:47Z

Can someone assist me with the multiprecision Travis errors? I am pulling my hair out trying to figure them out.

../../../boost/math/distributions/kolmogorov_smirnov.hpp:123:14: error: no matching function for call to ‘boost::multiprecision::detail::expression<boost::multiprecision::detail::subtract_immediates, int, boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >, void, void>::expression()’
1791     RealType k;
1792              ^
1793In file included from ../../../boost/multiprecision/traits/is_variable_precision.hpp:10:0,
1794                 from ../../../boost/multiprecision/detail/precision.hpp:9,
1795                 from ../../../boost/multiprecision/number.hpp:25,
1796                 from ../../../boost/multiprecision/cpp_dec_float.hpp:29,
1797                 from multiprc_concept_check_1.cpp:23:
1798../../../boost/multiprecision/detail/number_base.hpp:897:29: note: candidate: constexpr boost::multiprecision::detail::expression<tag, Arg1, Arg2, void, void>::expression(const boost::multiprecision::detail::expression<tag, Arg1, Arg2, void, void>&) [with tag = boost::multiprecision::detail::subtract_immediates; Arg1 = int; Arg2 = boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >]
1799    BOOST_MP_CXX14_CONSTEXPR expression(const expression& e) : arg1(e.arg1), arg2(e.arg2) {}
1800                             ^~~~~~~~~~
1801../../../boost/multiprecision/detail/number_base.hpp:897:29: note:   candidate expects 1 argument, 0 provided
1802../../../boost/multiprecision/detail/number_base.hpp:896:29: note: candidate: constexpr boost::multiprecision::detail::expression<tag, Arg1, Arg2, void, void>::expression(const Arg1&, const Arg2&) [with tag = boost::multiprecision::detail::subtract_immediates; Arg1 = int; Arg2 = boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >]
1803    BOOST_MP_CXX14_CONSTEXPR expression(const Arg1& a1, const Arg2& a2) : arg1(a1), arg2(a2) {}
1804                             ^~~~~~~~~~
1805../../../boost/multiprecision/detail/number_base.hpp:896:29: note:   candidate expects 2 arguments, 0 provided
1806../../../boost/multiprecision/detail/number_base.hpp: In instantiation of ‘constexpr boost::multiprecision::detail::expression<tag, Arg1, Arg2, void, void>& boost::multiprecision::detail::expression<tag, Arg1, Arg2, void, void>::operator=(const Other&) [with Other = boost::multiprecision::detail::expression<boost::multiprecision::detail::minus, double, boost::multiprecision::detail::expression<boost::multiprecision::detail::multiplies, int, boost::multiprecision::detail::expression<boost::multiprecision::detail::minus, int, boost::multiprecision::detail::expression<boost::multiprecision::detail::subtract_immediates, int, boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >, void, void>, void, void>, void, void>, void, void>; tag = boost::multiprecision::detail::subtract_immediates; Arg1 = int; Arg2 = boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >]’:
1807../../../boost/math/distributions/kolmogorov_smirnov.hpp:126:11:   required from ‘RealType boost::math::detail::kolmogorov_smirnov_quantile_guess(RealType) [with RealType = boost::multiprecision::detail::expression<boost::multiprecision::detail::subtract_immediates, int, boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >, void, void>]’
1808../../../boost/math/distributions/kolmogorov_smirnov.hpp:402:58:   required from ‘RealType boost::math::quantile(const boost::math::complemented2_type<boost::math::kolmogorov_smirnov_distribution<RealType, Policy>, RealType>&) [with RealType = boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >; Policy = boost::math::policies::policy<boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy>]’
1809../../../boost/math/concepts/distributions.hpp:138:19:   required from ‘void boost::math::concepts::DistributionConcept<Distribution>::constraints() [with Distribution = boost::math::kolmogorov_smirnov_distribution<boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >, boost::math::policies::policy<boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy> >]’
1810../../../boost/concept/detail/has_constraints.hpp:32:62:   required by substitution of ‘template<class Model> boost::concepts::detail::yes boost::concepts::detail::has_constraints_(Model*, boost::concepts::detail::wrap_constraints<Model, (& Model:: constraints)>*) [with Model = boost::math::concepts::DistributionConcept<boost::math::kolmogorov_smirnov_distribution<boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >, boost::math::policies::policy<boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy> > >]’
1811../../../boost/concept/detail/has_constraints.hpp:42:5:   required from ‘const bool boost::concepts::not_satisfied<boost::math::concepts::DistributionConcept<boost::math::kolmogorov_smirnov_distribution<boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >, boost::math::policies::policy<boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy> > > >::value’
1812../../../boost/concept/detail/has_constraints.hpp:45:51:   required from ‘struct boost::concepts::not_satisfied<boost::math::concepts::DistributionConcept<boost::math::kolmogorov_smirnov_distribution<boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >, boost::math::policies::policy<boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy> > > >’
1813../../../boost/concept/detail/general.hpp:51:8:   required from ‘struct boost::concepts::requirement_<void (*)(boost::math::concepts::DistributionConcept<boost::math::kolmogorov_smirnov_distribution<boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >, boost::math::policies::policy<boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy> > >)>’
1814../../../boost/concept_check.hpp:50:7:   required from ‘void boost::function_requires(Model*) [with Model = boost::math::concepts::DistributionConcept<boost::math::kolmogorov_smirnov_distribution<boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >, boost::math::policies::policy<boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy, boost::math::policies::default_policy> > >]’
1815compile_test/instantiate.hpp:79:87:   required from ‘void instantiate(RealType) [with RealType = boost::multiprecision::number<boost::multiprecision::backends::cpp_dec_float<50u> >]’
1816multiprc_concept_check_1.cpp:33:27:   required from here
1817../../../boost/multiprecision/detail/number_base.hpp:908:7: error: static assertion failed: You can not assign to a Boost.Multiprecision expression template: did you inadvertantly store an expression template in a "auto" variable?  Or pass an expression to a template function with deduced temnplate arguments?
1818       static_assert(sizeof(Other) == INT_MAX, "You can not assign to a Boost.Multiprecision expression template: did you inadvertantly store an expression template in a \"auto\" variable?  Or pass an expression to a template function with deduced temnplate arguments?");
1819       ^~~~~~~~~~~~~

jzmaddock · 2020-08-23T09:58:34Z

This is a template argument deduction issue, somewhere (if you follow the instantiation stack back up you will find where, but looks like kolmogorov_smirnov.hpp:402) you have:

detail::kolmogorov_smirnov_quantile_guess(1-p)

But 1-p is an expression template for some multiprecision types, so template argument deduction for kolmogorov_smirnov_quantile_guess instantiates it with something that is not a number, but an expression. Use:

detail::kolmogorov_smirnov_quantile_guess(RealType(1-p))

Or:

detail::kolmogorov_smirnov_quantile_guess<RealType>(1-p)

The two ultimately end up the same.

evanmiller · 2020-08-23T11:15:25Z

@jzmaddock Thanks, I'll try it out.

evanmiller · 2020-08-23T17:44:41Z

I'm now seeing some C++03-related errors originating in exp_sinh.hpp; are these a problem?

In file included from ../../../boost/math/quadrature/exp_sinh.hpp:21,
                 from test_kolmogorov_smirnov.cpp:14:
../../../boost/math/quadrature/detail/exp_sinh_detail.hpp:44:5: warning: ‘auto’ changes meaning in C++11; please remove it [-Wc++11-compat]
     auto integrate(const F& f, Real* error, Real* L1, const char* function, Real tolerance, std::size_t* levels)->decltype(std::declval<F>()(std::declval<Real>())) const;
     ^~~~
     ----
../../../boost/math/quadrature/detail/exp_sinh_detail.hpp:44:115: error: expected type-specifier before ‘decltype’
     auto integrate(const F& f, Real* error, Real* L1, const char* function, Real tolerance, std::size_t* levels)->decltype(std::declval<F>()(std::declval<Real>())) const;
                                                                                                                   ^~~~~~~~
../../../boost/math/quadrature/detail/exp_sinh_detail.hpp:44:115: error: expected initializer before ‘decltype’
../../../boost/math/quadrature/detail/exp_sinh_detail.hpp:126:41: error: ‘>>’ should be ‘> >’ within a nested template argument list
     mutable std::vector<std::vector<Real>> m_abscissas;
                                         ^~
                                         > >
../../../boost/math/quadrature/detail/exp_sinh_detail.hpp:127:41: error: ‘>>’ should be ‘> >’ within a nested template argument list
     mutable std::vector<std::vector<Real>> m_weights;
                                         ^~
                                         > >

NAThompson · 2020-08-23T18:03:27Z

@evanmiller : None of the quadrature stuff works in C++-03 mode; in addition C++03 support is being dropped so I wouldn't worry about supporting it in new code.

evanmiller · 2020-08-23T18:05:38Z

@NAThompson Understood. How should I deal with the failing Travis test then? (Quadrature is only used in the K-S test code, not in the implementation itself.)

NAThompson · 2020-08-23T18:14:28Z

@evanmiller : Change

 [ run test_kolmogorov_smirnov.cpp ../../test/build//boost_unit_test_framework ]

to

 [ run test_kolmogorov_smirnov.cpp ../../test/build//boost_unit_test_framework [ requires cxx11_hdr_initializer_list cxx11_auto_declarations cxx11_lambdas cxx11_unified_initialization_syntax cxx11_smart_ptr ] ]

(Check that syntax before kicking off a build!)

evanmiller · 2020-08-23T18:18:02Z

@NAThompson I think it needs a few colons but I will try it out.

evanmiller · 2020-08-25T10:57:48Z

The tests are now a sea of green (modulo a couple of apt-get errors on Travis and pre-existing failures on Appveyor).

NAThompson · 2020-08-25T12:10:44Z

(modulo a couple of apt-get errors on Travis and pre-existing failures on Appveyor)

Oh it's a scourge isn't it? I'll try to take a look soon.

pabristow · 2020-08-28T15:48:16Z

Much of Boost.Math has working examples in the /example folder. I believe that they are very valuable to readers, many of whom are novice programmers but experts in other fields. There are many complexities in Boost.Math that allow use different floating-point types etc that have caused these users trouble. The /test folder provides usage examples, but are often obfuscated by the use of Boost.Test, and are deliberately obscure to test corner and edge cases. What may seem obvious and trivially simple is often puzzling to newcomers.

Real-life data is often available and should make the example more realistic.

So I feel it would be very useful to try to continue to provide simple working examples of functions, distributions etc, heavily commented and with typical output, and having a link from the docs (and perhaps also providing code snippets in the docs).

IMO, the K-S distribution (and the similar Anderson-Darling) would benefit from this type of example.

pabristow · 2020-08-28T15:52:21Z

I feel it is time to get the K-S prototype into its own branch off develop/.

It would probably save @evanmiller a lot of time if I take his documentation comments in code, references etc and convert them to Quickbook format as a prototype that can then be refined and expanded. But to do this I would like to work on a ks branch.

NAThompson · 2020-08-28T17:04:16Z

@pabristow

But to do this I would like to work on a ks branch.

You can do this with github cli; do:

$ gh pr checkout 422

evanmiller · 2020-08-28T17:05:09Z

@pabristow I agree that fleshed-out examples are very useful to practitioners. In the case of K-S, it would probably also be useful to provide a function for computing the test statistic D. This requires a vector of data and a theoretical distribution to compare the data against. (It's relatively easy to calculate - just the supremum of the CDF differences.) Then the example could walk through constructing a theoretical distribution, computing the test statistic, and generating a critical value and p-value for the null hypothesis. Doing that for Anderson-Darling would be useful as well.

Just so you know where I'm coming from, my interest in Boost.Math is on the software development end - i.e. providing functions and documentation useful to professional statistical application developers rather than novice end users. So while I do see value in extended examples, they will be less of a priority to me compared to the basic documentation and meeting Boost's other minimum requirements for inclusion.

Barring a special branch here you're welcome to send mini-PRs to the branch on my account. I did the Quickbook docs for my Jacobi Theta work and can do the same here - was just waiting for a general OK on the API first.

pabristow · 2020-08-30T16:04:49Z

One of the curious features of writing for Boost is that you have little idea who your 'customers' are. The only clues come from when they find bugs, often surprisingly obscure, or when they ask questions on the Boost lists, often quite elementary. Providing both examples and links to the example code has reduced the number of cries for help.

Another clue to users is provided indirectly by Google. I'm pleased by seeing how high Boost.Math often comes in the Google suggested links, and interpret this to mean that other people have clicked on these links (and arrogantly assumed that they liked them too ;-) ).

But that isn't to say that 'professional' statisticians are not important customers, especially as a USP of Boost.Math is that it meshes well with Boost.Multiprecision allowing much higher precision when this is vital. (Though I do wonder if pro statisticians don't prefer to use less picky languages like R and Python). An important attraction of using Boost.Math is that one can add to an existing program without writing data out and reading it into another statistics tool.

So I'm sure we agree that there is a place for both novice and expert examples. If I can help providing novice examples, please tell me. (In the past this has also smoked out some buglets in the process).

evanmiller · 2020-08-30T17:06:36Z

Thanks for the offer @pabristow. I think after the main documentation is done we can think about useful examples and workflows.

Does anyone object to the current API? If not I will take a crack at a Quickbook.

NAThompson · 2020-08-30T17:12:04Z

@evanmiller : I just read through your unit tests and think the API looks good.

Also, you might want to rebase on master to get a clean build.

evanmiller · 2020-08-30T22:29:34Z

Okay, I've added some user documentation, let me know if it's intelligible! It's not nearly as comprehensive as the code comments, but I don't think it's really worth enumerating all of the ways the function is not implemented (which is what the comments do).

@pabristow It might be a useful exercise to see if the docs give you enough rope to write an example program. If not then it means there are some holes.

I do think a complete tutorial will necessitate additional functions provided by Boost e.g. computing the K-S sample statistic, which isn't out-of-this-world difficult, but which is not totally trivial. I'm not sure what the right API for that function would be – and it's a bit further afield than what I'm trying to achieve right now.

Add a new distribution, kolmogorov_smirnov_distribution, which takes a parameter that represents the number of observations used in a Kolmogorov-Smirnov test. (The K-S test is a popular test for comparing two CDFs, but the test statistic is not implemented here.) This implementation includes Kolmogorov's original 1st order Taylor expansion. There is a literature on the distribution's other mathematical properties (higher order terms and exact version); this literature is summarized in the main header file for anyone who may want to expand the implementation later. The CDF is implemented using a Jacobi theta function, and the PDF is a hand-rolled derivative of that function. Quantiles plug the CDF and PDF into a Newton-Raphson iteration. The mean and variance have nice closed-form expressions, and the mode uses a dumb run-time maximizer. This commit includes graphs, a ULP plotter for the PDF, and the usual compilation and numerical tests. The test file is on the small side, but it integrates the distribution from zero to infinity, and covers the quantiles pretty well. As of now the numerical tests only verify self-consistency (e.g. distribution moments and CDF-quantile relations), so there's room to add some external checks. I will add user-facing documentation after the API is approved and the implementation is finalized.

The third moment integrates nicely with the help of Apery's constant (zeta_three). Verify the result via quadrature.

Verify the result via quadrature.

include/boost/math/distributions/kolmogorov_smirnov.hpp

NAThompson · 2020-09-02T14:15:13Z

include/boost/math/distributions/kolmogorov_smirnov.hpp

+        return RealType(1.8) - 5 * (1 - p);
+    if (p < 0.3)
+        return p + RealType(0.45);
+    return p + RealType(0.3);


I suppose it's low priority, but casting a double precision number like 0.3 to RealType is a little inelegant since it's not representable. Not really a big deal, but I'd use RealType(3)/RealType(10).

Well, the whole function is inelegant – it's basically me drawing three straight lines by hand to fit the curve for an initial guess. So I'm not too worried about epsilon-sized round-off errors.

include/boost/math/distributions/kolmogorov_smirnov.hpp

evanmiller · 2020-09-02T14:24:47Z

Needs to be in quotes to look in the same directory

Happy to change it, but note that many/most of the tests in that folder include <pch_light.hpp> rather than "pch_light.hpp"

pabristow · 2020-09-02T16:42:37Z

Passes K-S test on msvc 14.2 using Visual Studio IDE OK with minor moaning about
I:\boost\libs\math\test\test_kolmogorov_smirnov.cpp(24,56): warning C4244: 'argument': conversion from 'int' to 'RealType', possible loss of data
1> with
1> [
1> RealType=float
1> ]
1>I:\boost\libs\math\test\test_kolmogorov_smirnov.cpp(94): message : see reference to function template instantiation 'void test_spots(RealType)' being compiled
1> with
1> [
1> RealType=float
1> ]
1>I:\boost\libs\math\test\test_kolmogorov_smirnov.cpp(26,24): warning C4244: 'initializing': conversion from 'double' to 'RealType', possible loss of data
1> with
1> [
1> RealType=float
1> ]
1>I:\boost\libs\math\test\test_kolmogorov_smirnov.cpp(35,24): warning C4244: 'initializing': conversion from 'double' to 'RealType', possible loss of data
1> with
1> [
1> RealType=float
1> ]

test_kolmogorov_smirnov.cpp(17): warning C4100: 'T': unreferenced formal parameter

looks spurious.

(The curious passing of zero type dates from 2005 when compilers templates didn't yet work properly).

and needed to add a include to math/test where the annoying pch_light.hpp lives to find it from #include <pch_light>.

(Aside I note that you are not yet up-to-date with Boost 1.75, so I had to build library for unit_test_framework for 1.74. But updating is always tiresome.)

I also added a multiprecision test (this usually smokes out any potential conversion problems) and this mostly passed but appears to fail these two tests

BOOST_TEST_CHECK(pdf(dist, mode) >= pdf(dist, mode - 100 * tol));
BOOST_TEST_CHECK(pdf(dist, mode) >= pdf(dist, mode + 100 * tol));

However the difference is only a bit to two, so test is just too tight.

[16.8683936202546321115046170516878736 <
16.8683936202546321115046170516878798]

We haven't run multiprecision math tests routinely as they take too long (~30 sec for me), but it is nice to see that it works OK.

1>I:/Cpp/math/test_kolmogorov_smirnov/test_kolmogorov_smirnov_mp/test_kolmogorov_smirnov_mp.cpp(61): error : in "test_main": check pdf(dist, mode) >= pdf(dist, mode - 100 * tol) has failed [16.8683936202546321115046170516878736 < 16.8683936202546321115046170516878798]
1>I:/Cpp/math/test_kolmogorov_smirnov/test_kolmogorov_smirnov_mp/test_kolmogorov_smirnov_mp.cpp(62): error : in "test_main": check pdf(dist, mode) >= pdf(dist, mode + 100 * tol) has failed [16.8683936202546321115046170516878736 < 16.8683936202546321115046170516878767]

Using the B2 jamfile, I get failure to link and run the test and will investigate tomorrow.

WIP but looking good. Will start on an example of what I am told is called the 'Vodka Test' by some novice statisticians.

evanmiller · 2020-09-02T17:00:45Z

Thanks for the comments @pabristow. There's some overlap with @NAThompson's feedback, so some of the minor issues have already been addressed.

I plan to change 100*tol to sqrt(eps) – the purpose is just to move the x needle enough to demonstrate we're at some kind of maximum. This would be more reliable with an analytic expression for the mode, but I haven't come across one.

I'll take a look into updating to Boost 1.75, and will plan to kick off a CI build later today to confirm the latest round of fixes hasn't broken anything.

100*tol fails on multiprecision (I am told). sqrt(eps) should be larger and work better across precisions.

evanmiller · 2020-09-02T17:30:05Z

(Aside I note that you are not yet up-to-date with Boost 1.75, so I had to build library for unit_test_framework for 1.74. But updating is always tiresome.)

Can someone clarify this comment? I'm not sure what's out of date exactly or how I update my repos to the mythical Boost 1.75.

jzmaddock · 2020-09-02T17:44:53Z

Happy to change it, but note that many/most of the tests in that folder include <pch_light.hpp> rather than "pch_light.hpp"

If it's built from libs/multiprecision/test/Jamfile.v2 then libs/multiprecision/test will be in the include path automatically.

evanmiller · 2020-09-02T19:17:53Z

If it's built from libs/multiprecision/test/Jamfile.v2 then libs/multiprecision/test will be in the include path automatically.

@NAThompson Are you running the test via ../../../b2 test_kolmogorov_smirnov?

NAThompson · 2020-09-02T22:54:02Z

@evanmiller : I don't use b2; I just have a Makefile that I use locally. Don't worry about fixing that one.

pabristow · 2020-09-03T08:49:08Z

I have git pulled the boost /develop branch to get all the libraries fully up-to-date. I think you will have been testing against this using Travis. I don't think this will cause trouble.

FWIW I use

cd i:/boost

git checkout develop
git pull --recurse-submodules <<<<< git pulls all submodules in turn.
git submodule update --init <<<<<<<<<< This gets any new libraries (unusual)

pabristow · 2020-09-04T11:02:12Z

I have now run the tests OK on Windows with mingw64 using Visual studio IDE and MSVC latest preview, and also using Codeblocks IDE with Clang 10.0.0 and GCC 10.0.1, and with b2 using the jamfile (except that Clang doesn't work quite right failing to link with unit_test_framework, a bug in Clang jamfiles that I have yet to fix).

So all looking good so far :-)

You might like to follow @jzmaddock view in #431 in the tests_spots. But that's a cosmetic issue.

A working example is WIP.

NAThompson · 2020-09-04T14:30:22Z

@pabristow : I merged this because I thought it looked good; could you open up a new PR for the example?

@evanmiller : Nice work. I guess the next phase is extracting a p-value from the test statistic? This was something I think I got bogged down with on the Anderson-Darling test, IIRC.

evanmiller · 2020-09-04T14:32:10Z

Thanks for the report @pabristow – it looks like we're merged in, so test_spots is out of my hands now.

Thanks to everyone for the feedback and assistance on this and #394! My "journey" here began with a desire to replace about 50 lines of code of wobbly Kolmogorov-Smirnov code with something heavily tested and peer-reviewed. Ten weeks, two pull requests, 1000 LoC and several pages of documentation later I can replace those 50 lines with a few call to Boost :-).

The expertise here is an invaluable resource. While I'm here, I will note a few "pain points" as a new contributor:

Automated tests on the CI platforms take too long. It would be nice to indicate a specific test group rather than CI SKIP the whole thing for minor changes. Another possibility would be skipping intermediate compiler versions (e.g instead of MSVC 9-14 just do 9, 12, and 14).
Coming from a double-only world (and a C99 world at that), getting everything to work with every possible precision and real_concept was more work than I would have liked. Some of this was my unfamiliarity with C++ casting and templating rules. I guess in some ways this is "the pain and the glory" of Boost::Math.
In a couple of instances, the line between "We'd like you to do this" and "We need you to do this" was hazy. It would have been nice to have a clear checklist about what is required and what is optional vis a vis new special functions and new distributions so I could prioritize my time.
Function plotting is a little bit broken. IIRC I had to go chase down a couple of unofficial Boost packages to get it working and then the plot's legend was hit or miss. I'm referring to dist_graphs.cpp.
Building the docs takes too long to see how a single page changed.
There's a fair amount of cruft in the non-include folders. Weird old scripts, obsolete PNG files, etc.

And some unexpected enjoyments:

The ULP plots are really cool :-)
Quickbooks was fairly easy to write.
The long list of available constants came in handy more than once
The quadrature package is awesome for testing functions over their entire domain
Whatever dark magic is used to recover original expressions in the CHECK_CLOSE tests is neat
Very fast response times from the maintainers

I think that about covers it. I'll be available for questions about the contributed code and docs, but otherwise I will be checking out for a while to focus on other tasks. I do have another 50 lines of gerry-rigged numerical code that I might try to replace with a Boost contribution – but that will have to wait until another day!

evanmiller · 2020-09-04T14:34:14Z

@evanmiller : Nice work. I guess the next phase is extracting a p-value from the test statistic? This was something I think I got bogged down with on the Anderson-Darling test, IIRC.

I'm sure @pabristow is on the case but the p-value will just be the upper tail of the distribution. I.e.

boost::math::cdf(boost::math::complement(dist, d));

NAThompson · 2020-09-04T14:48:22Z

the p-value will just be the upper tail of the distribution

I think we're talking about two different things. The test statistic needs to computed from data correct? Your expression doesn't have an empirical cumulative distribution function.

evanmiller · 2020-09-04T14:54:01Z

I think we're talking about two different things. The test statistic needs to computed from data correct? Your expression doesn't have an empirical cumulative distribution function.

Yes - the test statistic is computed from the data (or empirical CDF) and from the theoretical distribution being compared.

NAThompson · 2020-09-04T14:59:42Z

Function plotting is a little bit broken. IIRC I had to go chase down a couple of unofficial Boost packages to get it working and then the plot's legend was hit or miss. I'm referring to dist_graphs.cpp.

I agree; I have branch locally that has a plotting function that has roughly the same interface as the ULPs plots. Sadly that has been sitting on my machine for quite some time.

The issue is that I have no competitive advantage with these plots. With the ULPs plots I had a unique insight (that the condition number could form an envelope indicating if the function was capably implemented), but spitting out an svg with linear path info is kinda bush-league. It's annoying to me that there's no interpolating spline for svg, the Catmull-Rom curve would work wonders. The approximating Bezier splines supported by svg are ridiculous. I'd have to back out the control points from interpolating points, which is a mathematics I have no interest in, since the Catmull-Rom curve made that superfluous.

pabristow · 2020-09-07T14:25:46Z

@evanmiller Thanks for all your valuable work of the KS distribution.
It wasn't on my TODO list, but no doubt some will find it useful (though the Anderson-Darling seems popular too).

Hope you had some fun, and found some of it educational, and sorry for the un-fun bit - the broken (by me) svg_plot (I will fix it).
(I was unbroking it when I was hit by a dodgy AMD CPU chip leading the nine months of hell until it finally broke properly (and I got a replacement).

Before you depart this (Boost) life completely, I'd be interested to know your views on the two more recent papers, and compare them with the method you used. Although n=100s is useful, there will be many wanting more. Did you explore this region?

https://arxiv.org/pdf/1802.06966.pdf Paul van Mulbregt
Computing the Cumulative Distribution Function and Quantiles of the One-sided Kolmogorov-Smirnov Statistic

also looks more promising by rewriting equations to avoid cancellation error.

Python package SciPy (v0.19.1) [1] provides the scipy.stats.ksone

https://core.ac.uk/download/pdf/25787785.pdf

Evaluating Kolmogorov’s Distribution
George Marsaglia The Florida State University
Wai Wan Tsang, Jingbo Wang The University of Hong Kong

evanmiller · 2020-09-07T15:11:37Z

@pabristow No regrets at all – and I was glad to scrape some of the rust off of my C++ skills.

Regarding the papers you linked: The first paper deals only with a one-sided statistic, which is a different distribution from the two-sided original. The paper claims it is easier to implement – but it is probably less useful in practice than a two-sided statistic.

The MTW paper is included in the overview article that I linked in 421 which I think you're referring to:

https://www.jstatsoft.org/article/view/v039i11

Carvalho found an improvement on MTW but I suspect it can be improved more. I had a brief email correspondence with Carvalho about it, but he has moved on to other research areas.

Basically: MTW is an implementation of the Durbin fomula, which requires raising a KxK matrix to the Nth power and taking the bottom-right cell of the result. MTW performs the Nth power computation with divide-and-conquer (log(N) matrix multiplications). Carvalho improves it by removing some of the multiplications that will ultimately be discarded.

I suspect MTW can be improved further via eigendecomposition, the old trick where you compute the eigenvalues, put them in a diagonal matrix, and then raise the entries of the diagonal to the Nth power to get a matrix power. Durbin's matrix is nearly lower-triangular – so I further suspect it will be possible to come up with a general expression for the Kth order characteristic polynomial of the Durbin matrix via Gaussian elimination (i.e. by hand), and then compute eigenvalues for specific cases with a root finder (i.e. without LAPACK).

For the yucks I tried deriving that characteristic polynomial and I made a little bit of progress but I got distracted with some other things. If anybody stumbles across this and wants to see my notes I'm happy to share.

Probably a better place to start would be the FFT solution that I linked in the other thread which I'll link again here:

https://openaccess.city.ac.uk/id/eprint/18541/8/Dimitrova%2C%20Kaishev%2C%20Tan%20%282017%29%20KSr1.pdf

I haven't delved into that paper, so am unable to comment on the computational complexity (or numerical stability) vis a vis MTW, Carvalho, or my proposed eigendecomposition.

Besides N<100 another possible direction for K-S would be a K-sample version, which for posterity I will note can be found in this paper:

https://projecteuclid.org/euclid.aoms/1177706261

Boost has all the building blocks for that, including Bessel zeroes, but I will not admit whether or not I have a working double-precision implementation.

jeremy-murphy · 2020-09-07T23:06:32Z

include/boost/math/distributions/kolmogorov_smirnov.hpp

+    RealType eps = policies::get_epsilon<RealType, Policy>();
+    int i = 0;
+    RealType pi2 = constants::pi_sqr<RealType>();
+    RealType x2n = x*x*n;


A very minor thing, but I would slap a const on these constants to make it easier for the reader and, possibly, the optimizing compiler.

Oh, this is already merged, never mind. :)

jzmaddock · 2020-09-26T17:47:51Z

This is causing testing failures in develop, see #437. Can someone please take a look? Thanks!

evanmiller · 2020-09-26T19:21:48Z

@jzmaddock I think the sqrt's here:

math/test/test_kolmogorov_smirnov.cpp

Lines 59 to 60 in 2dec5f6

    
           BOOST_TEST_CHECK(pdf(dist, mode) >= pdf(dist, mode - sqrt(eps))); 
        
           BOOST_TEST_CHECK(pdf(dist, mode) >= pdf(dist, mode + sqrt(eps)));

Are promoting the 2nd argument to a double, so std::sqrt should fix it.

NAThompson reviewed Aug 19, 2020

View reviewed changes

test/compile_test/dist_kolmogorov_smirnov_incl_test.cpp Outdated Show resolved Hide resolved

NAThompson reviewed Aug 19, 2020

View reviewed changes

reporting/accuracy/plot_kolmogorov_smirnov_pdf.cpp Show resolved Hide resolved

evanmiller added 7 commits August 30, 2020 20:01

Fix copyright notices [CI SKIP]

de347f1

Implement skewness for K-S distribution [CI SKIP]

59c4da1

The third moment integrates nicely with the help of Apery's constant (zeta_three). Verify the result via quadrature.

Implement kurtosis for the K-S distribution

73f0dec

Verify the result via quadrature.

Fix "0.0" conversion errors with multiprecision types

8ef4d3f

Kolmogorov-Smirnov build fixes

91fd5f6

Kolmogorov-Smirnov needs more BOOST_MATH_STD_USING

fad8917

NAThompson reviewed Sep 2, 2020

View reviewed changes

include/boost/math/distributions/kolmogorov_smirnov.hpp Outdated Show resolved Hide resolved

NAThompson reviewed Sep 2, 2020

View reviewed changes

include/boost/math/distributions/kolmogorov_smirnov.hpp Outdated Show resolved Hide resolved

NAThompson reviewed Sep 2, 2020

View reviewed changes

include/boost/math/distributions/kolmogorov_smirnov.hpp Outdated Show resolved Hide resolved

Make suggested changes [CI SKIP]

30e325a

Tweak testing logic for K-S mode [CI SKIP]

a1fa5e4

100*tol fails on multiprecision (I am told). sqrt(eps) should be larger and work better across precisions.

NAThompson merged commit 18ed616 into boostorg:develop Sep 4, 2020

evanmiller deleted the kolmogorov-smirnov branch September 4, 2020 14:54

jeremy-murphy reviewed Sep 7, 2020

View reviewed changes

evanmiller mentioned this pull request Nov 5, 2020

Kolmogorov-Smirnov distribution functions do not seem to match other versions #450

Closed

ryanelandt mentioned this pull request Jul 25, 2023

Update kolmogorov quantile newton ub to 1 #1002

Merged

Kolmogorov-Smirnov distribution #422

Kolmogorov-Smirnov distribution #422

Conversation

evanmiller commented Aug 19, 2020 • edited Loading

evanmiller commented Aug 19, 2020

evanmiller commented Aug 20, 2020

evanmiller commented Aug 21, 2020

evanmiller commented Aug 22, 2020

jzmaddock commented Aug 23, 2020

evanmiller commented Aug 23, 2020

evanmiller commented Aug 23, 2020

NAThompson commented Aug 23, 2020 • edited Loading

evanmiller commented Aug 23, 2020

NAThompson commented Aug 23, 2020

evanmiller commented Aug 23, 2020

evanmiller commented Aug 25, 2020

NAThompson commented Aug 25, 2020

pabristow commented Aug 28, 2020

pabristow commented Aug 28, 2020

NAThompson commented Aug 28, 2020

evanmiller commented Aug 28, 2020

pabristow commented Aug 30, 2020

evanmiller commented Aug 30, 2020

NAThompson commented Aug 30, 2020

evanmiller commented Aug 30, 2020

NAThompson Sep 2, 2020

Choose a reason for hiding this comment

evanmiller Sep 2, 2020

Choose a reason for hiding this comment

evanmiller commented Sep 2, 2020

pabristow commented Sep 2, 2020 • edited Loading

evanmiller commented Sep 2, 2020

evanmiller commented Sep 2, 2020

jzmaddock commented Sep 2, 2020

evanmiller commented Sep 2, 2020

NAThompson commented Sep 2, 2020

pabristow commented Sep 3, 2020

pabristow commented Sep 4, 2020

NAThompson commented Sep 4, 2020

evanmiller commented Sep 4, 2020

evanmiller commented Sep 4, 2020

NAThompson commented Sep 4, 2020

evanmiller commented Sep 4, 2020

NAThompson commented Sep 4, 2020 • edited Loading

pabristow commented Sep 7, 2020

evanmiller commented Sep 7, 2020

jeremy-murphy Sep 7, 2020

Choose a reason for hiding this comment

jeremy-murphy Sep 7, 2020

Choose a reason for hiding this comment

jzmaddock commented Sep 26, 2020

evanmiller commented Sep 26, 2020

evanmiller commented Aug 19, 2020 •

edited

Loading

NAThompson commented Aug 23, 2020 •

edited

Loading

pabristow commented Sep 2, 2020 •

edited

Loading

NAThompson commented Sep 4, 2020 •

edited

Loading