-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kolmogorov-Smirnov distribution #422
Conversation
The build failures indicate that |
I've added a closed form expression for
If not I'll take another stab at a closed form this weekend. |
Found my mistake! Closed-form kurtosis is implemented now. The latest commit does not [CI SKIP] so I will check the build result tomorrow. |
Can someone assist me with the multiprecision Travis errors? I am pulling my hair out trying to figure them out.
|
This is a template argument deduction issue, somewhere (if you follow the instantiation stack back up you will find where, but looks like kolmogorov_smirnov.hpp:402) you have:
But
Or:
The two ultimately end up the same. |
@jzmaddock Thanks, I'll try it out. |
I'm now seeing some C++03-related errors originating in
|
@evanmiller : None of the quadrature stuff works in C++-03 mode; in addition C++03 support is being dropped so I wouldn't worry about supporting it in new code. |
@NAThompson Understood. How should I deal with the failing Travis test then? (Quadrature is only used in the K-S test code, not in the implementation itself.) |
@evanmiller : Change
to
(Check that syntax before kicking off a build!) |
@NAThompson I think it needs a few colons but I will try it out. |
The tests are now a sea of green (modulo a couple of apt-get errors on Travis and pre-existing failures on Appveyor). |
Oh it's a scourge isn't it? I'll try to take a look soon. |
Much of Boost.Math has working examples in the /example folder. I believe that they are very valuable to readers, many of whom are novice programmers but experts in other fields. There are many complexities in Boost.Math that allow use different floating-point types etc that have caused these users trouble. The /test folder provides usage examples, but are often obfuscated by the use of Boost.Test, and are deliberately obscure to test corner and edge cases. What may seem obvious and trivially simple is often puzzling to newcomers. Real-life data is often available and should make the example more realistic. So I feel it would be very useful to try to continue to provide simple working examples of functions, distributions etc, heavily commented and with typical output, and having a link from the docs (and perhaps also providing code snippets in the docs). IMO, the K-S distribution (and the similar Anderson-Darling) would benefit from this type of example. |
I feel it is time to get the K-S prototype into its own branch off develop/. It would probably save @evanmiller a lot of time if I take his documentation comments in code, references etc and convert them to Quickbook format as a prototype that can then be refined and expanded. But to do this I would like to work on a ks branch. |
You can do this with github cli; do:
|
@pabristow I agree that fleshed-out examples are very useful to practitioners. In the case of K-S, it would probably also be useful to provide a function for computing the test statistic D. This requires a vector of data and a theoretical distribution to compare the data against. (It's relatively easy to calculate - just the supremum of the CDF differences.) Then the example could walk through constructing a theoretical distribution, computing the test statistic, and generating a critical value and p-value for the null hypothesis. Doing that for Anderson-Darling would be useful as well. Just so you know where I'm coming from, my interest in Boost.Math is on the software development end - i.e. providing functions and documentation useful to professional statistical application developers rather than novice end users. So while I do see value in extended examples, they will be less of a priority to me compared to the basic documentation and meeting Boost's other minimum requirements for inclusion. Barring a special branch here you're welcome to send mini-PRs to the branch on my account. I did the Quickbook docs for my Jacobi Theta work and can do the same here - was just waiting for a general OK on the API first. |
One of the curious features of writing for Boost is that you have little idea who your 'customers' are. The only clues come from when they find bugs, often surprisingly obscure, or when they ask questions on the Boost lists, often quite elementary. Providing both examples and links to the example code has reduced the number of cries for help. Another clue to users is provided indirectly by Google. I'm pleased by seeing how high Boost.Math often comes in the Google suggested links, and interpret this to mean that other people have clicked on these links (and arrogantly assumed that they liked them too ;-) ). But that isn't to say that 'professional' statisticians are not important customers, especially as a USP of Boost.Math is that it meshes well with Boost.Multiprecision allowing much higher precision when this is vital. (Though I do wonder if pro statisticians don't prefer to use less picky languages like R and Python). An important attraction of using Boost.Math is that one can add to an existing program without writing data out and reading it into another statistics tool. So I'm sure we agree that there is a place for both novice and expert examples. If I can help providing novice examples, please tell me. (In the past this has also smoked out some buglets in the process). |
Thanks for the offer @pabristow. I think after the main documentation is done we can think about useful examples and workflows. Does anyone object to the current API? If not I will take a crack at a Quickbook. |
@evanmiller : I just read through your unit tests and think the API looks good. Also, you might want to rebase on master to get a clean build. |
Okay, I've added some user documentation, let me know if it's intelligible! It's not nearly as comprehensive as the code comments, but I don't think it's really worth enumerating all of the ways the function is not implemented (which is what the comments do). @pabristow It might be a useful exercise to see if the docs give you enough rope to write an example program. If not then it means there are some holes. I do think a complete tutorial will necessitate additional functions provided by Boost e.g. computing the K-S sample statistic, which isn't out-of-this-world difficult, but which is not totally trivial. I'm not sure what the right API for that function would be – and it's a bit further afield than what I'm trying to achieve right now. |
Add a new distribution, kolmogorov_smirnov_distribution, which takes a parameter that represents the number of observations used in a Kolmogorov-Smirnov test. (The K-S test is a popular test for comparing two CDFs, but the test statistic is not implemented here.) This implementation includes Kolmogorov's original 1st order Taylor expansion. There is a literature on the distribution's other mathematical properties (higher order terms and exact version); this literature is summarized in the main header file for anyone who may want to expand the implementation later. The CDF is implemented using a Jacobi theta function, and the PDF is a hand-rolled derivative of that function. Quantiles plug the CDF and PDF into a Newton-Raphson iteration. The mean and variance have nice closed-form expressions, and the mode uses a dumb run-time maximizer. This commit includes graphs, a ULP plotter for the PDF, and the usual compilation and numerical tests. The test file is on the small side, but it integrates the distribution from zero to infinity, and covers the quantiles pretty well. As of now the numerical tests only verify self-consistency (e.g. distribution moments and CDF-quantile relations), so there's room to add some external checks. I will add user-facing documentation after the API is approved and the implementation is finalized.
The third moment integrates nicely with the help of Apery's constant (zeta_three). Verify the result via quadrature.
Verify the result via quadrature.
return RealType(1.8) - 5 * (1 - p); | ||
if (p < 0.3) | ||
return p + RealType(0.45); | ||
return p + RealType(0.3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it's low priority, but casting a double precision number like 0.3 to RealType
is a little inelegant since it's not representable. Not really a big deal, but I'd use RealType(3)/RealType(10)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the whole function is inelegant – it's basically me drawing three straight lines by hand to fit the curve for an initial guess. So I'm not too worried about epsilon-sized round-off errors.
Happy to change it, but note that many/most of the tests in that folder include |
Passes K-S test on msvc 14.2 using Visual Studio IDE OK with minor moaning about test_kolmogorov_smirnov.cpp(17): warning C4100: 'T': unreferenced formal parameter looks spurious. (The curious passing of zero type dates from 2005 when compilers templates didn't yet work properly). and needed to add a include to math/test where the annoying pch_light.hpp lives to find it from #include <pch_light>. (Aside I note that you are not yet up-to-date with Boost 1.75, so I had to build library for unit_test_framework for 1.74. But updating is always tiresome.) I also added a multiprecision test (this usually smokes out any potential conversion problems) and this mostly passed but appears to fail these two tests BOOST_TEST_CHECK(pdf(dist, mode) >= pdf(dist, mode - 100 * tol)); However the difference is only a bit to two, so test is just too tight. [16.8683936202546321115046170516878736 < We haven't run multiprecision math tests routinely as they take too long (~30 sec for me), but it is nice to see that it works OK. 1>I:/Cpp/math/test_kolmogorov_smirnov/test_kolmogorov_smirnov_mp/test_kolmogorov_smirnov_mp.cpp(61): error : in "test_main": check pdf(dist, mode) >= pdf(dist, mode - 100 * tol) has failed [16.8683936202546321115046170516878736 < 16.8683936202546321115046170516878798] Using the B2 jamfile, I get failure to link and run the test and will investigate tomorrow. WIP but looking good. Will start on an example of what I am told is called the 'Vodka Test' by some novice statisticians. |
Thanks for the comments @pabristow. There's some overlap with @NAThompson's feedback, so some of the minor issues have already been addressed. I plan to change I'll take a look into updating to Boost 1.75, and will plan to kick off a CI build later today to confirm the latest round of fixes hasn't broken anything. |
100*tol fails on multiprecision (I am told). sqrt(eps) should be larger and work better across precisions.
Can someone clarify this comment? I'm not sure what's out of date exactly or how I update my repos to the mythical Boost 1.75. |
If it's built from libs/multiprecision/test/Jamfile.v2 then libs/multiprecision/test will be in the include path automatically. |
@NAThompson Are you running the test via |
@evanmiller : I don't use |
I have git pulled the boost /develop branch to get all the libraries fully up-to-date. I think you will have been testing against this using Travis. I don't think this will cause trouble. FWIW I use cd i:/boost git checkout develop |
I have now run the tests OK on Windows with mingw64 using Visual studio IDE and MSVC latest preview, and also using Codeblocks IDE with Clang 10.0.0 and GCC 10.0.1, and with b2 using the jamfile (except that Clang doesn't work quite right failing to link with unit_test_framework, a bug in Clang jamfiles that I have yet to fix). So all looking good so far :-) You might like to follow @jzmaddock view in #431 in the tests_spots. But that's a cosmetic issue. A working example is WIP. |
@pabristow : I merged this because I thought it looked good; could you open up a new PR for the example? @evanmiller : Nice work. I guess the next phase is extracting a p-value from the test statistic? This was something I think I got bogged down with on the Anderson-Darling test, IIRC. |
Thanks for the report @pabristow – it looks like we're merged in, so Thanks to everyone for the feedback and assistance on this and #394! My "journey" here began with a desire to replace about 50 lines of code of wobbly Kolmogorov-Smirnov code with something heavily tested and peer-reviewed. Ten weeks, two pull requests, 1000 LoC and several pages of documentation later I can replace those 50 lines with a few call to Boost :-). The expertise here is an invaluable resource. While I'm here, I will note a few "pain points" as a new contributor:
And some unexpected enjoyments:
I think that about covers it. I'll be available for questions about the contributed code and docs, but otherwise I will be checking out for a while to focus on other tasks. I do have another 50 lines of gerry-rigged numerical code that I might try to replace with a Boost contribution – but that will have to wait until another day! |
I'm sure @pabristow is on the case but the p-value will just be the upper tail of the distribution. I.e.
|
I think we're talking about two different things. The test statistic needs to computed from data correct? Your expression doesn't have an empirical cumulative distribution function. |
Yes - the test statistic is computed from the data (or empirical CDF) and from the theoretical distribution being compared. |
I agree; I have branch locally that has a plotting function that has roughly the same interface as the ULPs plots. Sadly that has been sitting on my machine for quite some time. The issue is that I have no competitive advantage with these plots. With the ULPs plots I had a unique insight (that the condition number could form an envelope indicating if the function was capably implemented), but spitting out an svg with linear path info is kinda bush-league. It's annoying to me that there's no interpolating spline for svg, the Catmull-Rom curve would work wonders. The approximating Bezier splines supported by svg are ridiculous. I'd have to back out the control points from interpolating points, which is a mathematics I have no interest in, since the Catmull-Rom curve made that superfluous. |
@evanmiller Thanks for all your valuable work of the KS distribution. Hope you had some fun, and found some of it educational, and sorry for the un-fun bit - the broken (by me) svg_plot (I will fix it). Before you depart this (Boost) life completely, I'd be interested to know your views on the two more recent papers, and compare them with the method you used. Although n=100s is useful, there will be many wanting more. Did you explore this region? https://arxiv.org/pdf/1802.06966.pdf Paul van Mulbregt also looks more promising by rewriting equations to avoid cancellation error. Python package SciPy (v0.19.1) [1] provides the scipy.stats.ksone https://core.ac.uk/download/pdf/25787785.pdf Evaluating Kolmogorov’s Distribution |
@pabristow No regrets at all – and I was glad to scrape some of the rust off of my C++ skills. Regarding the papers you linked: The first paper deals only with a one-sided statistic, which is a different distribution from the two-sided original. The paper claims it is easier to implement – but it is probably less useful in practice than a two-sided statistic. The MTW paper is included in the overview article that I linked in 421 which I think you're referring to: https://www.jstatsoft.org/article/view/v039i11 Carvalho found an improvement on MTW but I suspect it can be improved more. I had a brief email correspondence with Carvalho about it, but he has moved on to other research areas. Basically: MTW is an implementation of the Durbin fomula, which requires raising a KxK matrix to the Nth power and taking the bottom-right cell of the result. MTW performs the Nth power computation with divide-and-conquer (log(N) matrix multiplications). Carvalho improves it by removing some of the multiplications that will ultimately be discarded. I suspect MTW can be improved further via eigendecomposition, the old trick where you compute the eigenvalues, put them in a diagonal matrix, and then raise the entries of the diagonal to the Nth power to get a matrix power. Durbin's matrix is nearly lower-triangular – so I further suspect it will be possible to come up with a general expression for the Kth order characteristic polynomial of the Durbin matrix via Gaussian elimination (i.e. by hand), and then compute eigenvalues for specific cases with a root finder (i.e. without LAPACK). For the yucks I tried deriving that characteristic polynomial and I made a little bit of progress but I got distracted with some other things. If anybody stumbles across this and wants to see my notes I'm happy to share. Probably a better place to start would be the FFT solution that I linked in the other thread which I'll link again here: I haven't delved into that paper, so am unable to comment on the computational complexity (or numerical stability) vis a vis MTW, Carvalho, or my proposed eigendecomposition. Besides N<100 another possible direction for K-S would be a K-sample version, which for posterity I will note can be found in this paper: https://projecteuclid.org/euclid.aoms/1177706261 Boost has all the building blocks for that, including Bessel zeroes, but I will not admit whether or not I have a working double-precision implementation. |
RealType eps = policies::get_epsilon<RealType, Policy>(); | ||
int i = 0; | ||
RealType pi2 = constants::pi_sqr<RealType>(); | ||
RealType x2n = x*x*n; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very minor thing, but I would slap a const
on these constants to make it easier for the reader and, possibly, the optimizing compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is already merged, never mind. :)
This is causing testing failures in develop, see #437. Can someone please take a look? Thanks! |
@jzmaddock I think the sqrt's here: math/test/test_kolmogorov_smirnov.cpp Lines 59 to 60 in 2dec5f6
Are promoting the 2nd argument to a double, so std::sqrt should fix it. |
Add a new distribution,
kolmogorov_smirnov_distribution
, which takes a parameter that represents the number of observations used in a Kolmogorov-Smirnov test. (The K-S test is a popular test for comparing two CDFs, but the test statistic is not implemented here.)This implementation includes Kolmogorov's original 1st order Taylor expansion, which itself is an infinite sum. There is a literature on the distribution's other mathematical properties (higher order terms and exact version); this literature is summarized in the main header file for anyone who may want to expand the implementation later.
The CDF is implemented using a Jacobi theta function, and the PDF is a hand-rolled derivative of that function. Quantiles plug the CDF and PDF into a Newton-Raphson iteration. The mean, variance, skewness, and kurtosis have nice closed-form expressions, and the mode uses a dumb run-time maximizer.
This commit includes graphs, a ULP plotter for the PDF, and the usual compilation and numerical tests. The test file is on the small side, but it verifies the first four moments by integrating the entire distribution, and also covers the quantiles pretty well. As of now the numerical tests only verify self-consistency (e.g. distribution moments and CDF-quantile relations), so there's room to add some external checks.
I will add user-facing documentation after the API is approved and the implementation is finalized. (Hence WIP)
See #421 for previous discussion and below for relevant graphs.