Random distribution sampling #416

Bzero · 2024-04-27T14:00:25Z

This PR adds the function rand_uniform(a, b) which samples the continuous uniform distribution using inversion sampling. The function takes care of argument reversal.

Partially addresses #413.

sharkdp · 2024-04-27T14:55:24Z

Thank you!

I'd be very glad to accept more functions like these, especially if they can be implemented in Numbat. Before we do so, let's maybe discuss a minimum standard:

Names: it would be great if we can choose a naming scheme that is close to something existing (e.g. numpy, scipy, matlab, R, …). This will make it easier for users to find what they want.
Documentation: We do not have @name/@url etc. for functions, but I think it would be good to add a tiny bit of documentation for each function, and maybe a URL as a reference if someone wants to know more. I know that we don't have this for other functions in the prelude, but I very much would like to add this in the future, so this seems like a good place to start. Especially since some of those functions will not be universally known like sin or sqrt. This will make it easier for us to add proper "doc comments" for functions in the future.
Discoverability: it would be great if you could list those functions in book/src/list-functions.md. Feel free to add a new section. random is also missing at the moment.
Tests: This is hard to do for non-deterministic functions. I'd be okay with leaving out tests. We could add simple dummy tests like x = rand_uniform(5, 7); assert(x >= 5 && x <= 7) just to make sure that those functions will continue to work in the future. If you do so, please add them in a new examples/random_tests.nbt file (similar to what we do for datetime). For distributions without had boundary conditions, there's really not a lot we could test, but maybe it still makes sense to at least call them. Not sure.

Please feel free to add more of these in a single PR if that is easier for you.

Bzero · 2024-04-27T15:58:36Z

Thanks a lot for your feedback!

Names: Yes, I agree that this would be convenient. I would suggest to base the names on scipy.stats and to prefix them with rand_ (or random_ if you prefer) to make clear that these are non-deterministic functions.
Documentation: Sounds good, I will add documentation as if there were @name, @url, maybe @usage function decorators, so it will be easy to convert them if that feature will be added in the future.
Discoverability: Agreed.
Tests: So, I had actually written a small script which samples a large number of values and compares them to the distribution function (using scipy.stats) to check that the functions behave as intended. I think this method could serve as a test, I will see if I can do something similar with rust tests.

Please feel free to add more of these in a single PR if that is easier for you.

Yes, that might be better. I will bundle them into a single PR.

numbat/modules/core/random.nbt

Bzero · 2024-04-29T16:30:45Z

Since you are already reviewing, here are a few more notes on the implementations that might be relevant which I had not written up yet because I have not yet had time to look into the testing question:

If a function receives an invalid parameter it returns an error if it is a scalar function and NAN otherwise. I hope that is okay this way, if there is a way to return errors for non-scalar functions in the future we can change that.
The normal distribution is implemented slightly inefficiently as it has to discard every second sample.
The method used to sample the Bernoulli distribution has suboptimal linear runtime in n. Unless people start doing heavy number crunching this should not be a problem and for large n the approximation by the normal distribution can be used.
Similarly, the implementation of the Poisson distribution has linear expected runtime in λ. Again, in practice this should not be an issue and for large λ the normal distribution can be used too.

numbat/modules/core/random.nbt

sharkdp · 2024-04-29T18:30:41Z

numbat/modules/core/random.nbt

+# name: Normal distribution sampling
+# url: https://en.wikipedia.org/wiki/Normal_distribution
+# description: Samples a normal distribution with mean μ and standard deviation σ using the Box-Muller transform.
+fn rand_norm<T>(μ: T, σ: T) -> T =
+    μ + sqrt(-2 σ² × ln(random())) × sin(2π × random())


numbat/modules/core/random.nbt

sharkdp · 2024-04-29T18:59:23Z

numbat/modules/core/random.nbt

+fn rand_pareto<T>(α: Scalar, min: T) -> T =
+    if value_of(min) > 0 && α > 0
+    then min / ((1-random())^(1/α))
+    else NaN * unit_of(min)


We should really make 0, NaN and inf polymorphic (#37).

sharkdp · 2024-04-29T19:05:32Z

Thank you very much — this looks fantastic!

I wrote a small example program (in Rust... we really need a list data type in Numbat!) to sample large amounts of numbers from those distributions and compared the normalized histograms with the PDFs of those distributions. Everything looked correct 👍

use numbat::{
    module_importer::FileSystemImporter, resolver::CodeSource, Context, InterpreterResult,
};
use std::path::Path;

fn get_value(ctx: &mut Context, code: &str) -> f64 {
    match ctx.interpret(code, CodeSource::Internal).unwrap().1 {
        InterpreterResult::Value(v) => v.unsafe_as_quantity().unsafe_value().0,
        _ => todo!(),
    }
}

fn main() {
    let module_path = Path::new(&std::env::var_os("CARGO_MANIFEST_DIR").unwrap()).join("modules");

    let mut importer = FileSystemImporter::default();
    importer.add_path(module_path);
    let mut ctx = Context::new(importer);
    let _result = ctx.interpret("use all", CodeSource::Internal).unwrap();

    let code = std::env::args().nth(1).unwrap();

    let results = (0..10000)
        .map(|_| get_value(&mut ctx, code.as_str()))
        .collect::<Vec<_>>();
    for r in results {
        println!("{}", r);
    }
}

And a Python script to plot a histogram:

import matplotlib.pyplot as plt
import numpy as np
import sys

data = np.loadtxt(sys.argv[1])

weights = np.ones_like(data) / len(data)
plt.hist(data, bins=50, weights=weights)
plt.show()

Since you are already reviewing, here are a few more notes on the implementations that might be relevant which I had not written up yet because I have not yet had time to look into the testing question:

If a function receives an invalid parameter it returns an error if it is a scalar function and NAN otherwise. I hope that is okay this way, if there is a way to return errors for non-scalar functions in the future we can change that.

👍

The normal distribution is implemented slightly inefficiently as it has to discard every second sample.

I think that is completely acceptable. Changing that would require us to keep state somewhere (like in a static variable), right?

The method used to sample the Bernoulli distribution has suboptimal linear runtime in n. Unless people start doing heavy number crunching this should not be a problem and for large n the approximation by the normal distribution can be used.

👍

Similarly, the implementation of the Poisson distribution has linear expected runtime in λ. Again, in practice this should not be an issue and for large λ the normal distribution can be used too.

👍

Co-authored-by: David Peter <sharkdp@users.noreply.github.com>

sharkdp · 2024-04-29T20:02:01Z

If a function receives an invalid parameter it returns an error if it is a scalar function and NAN otherwise. I hope that is okay this way, if there is a way to return errors for non-scalar functions in the future we can change that.

I just implemented a proper version of error(…) here: #420. It would be great if you could rebase/merge with master and use error(…) instead of NaN.

Bzero · 2024-04-29T20:05:25Z

Thanks a lot for all your comments and catching my mistakes!

I wrote a small example program (in Rust... we really need a list data type in Numbat!) to sample large amounts of numbers from those distributions and compared the normalized histograms with the PDFs of those distributions. Everything looked correct 👍

use numbat::{
    module_importer::FileSystemImporter, resolver::CodeSource, Context, InterpreterResult,
};
use std::path::Path;

fn get_value(ctx: &mut Context, code: &str) -> f64 {
    match ctx.interpret(code, CodeSource::Internal).unwrap().1 {
        InterpreterResult::Value(v) => v.unsafe_as_quantity().unsafe_value().0,
        _ => todo!(),
    }
}

fn main() {
    let module_path = Path::new(&std::env::var_os("CARGO_MANIFEST_DIR").unwrap()).join("modules");

    let mut importer = FileSystemImporter::default();
    importer.add_path(module_path);
    let mut ctx = Context::new(importer);
    let _result = ctx.interpret("use all", CodeSource::Internal).unwrap();

    let code = std::env::args().nth(1).unwrap();

    let results = (0..10000)
        .map(|_| get_value(&mut ctx, code.as_str()))
        .collect::<Vec<_>>();
    for r in results {
        println!("{}", r);
    }
}

And a Python script to plot a histogram:

import matplotlib.pyplot as plt
import numpy as np
import sys

data = np.loadtxt(sys.argv[1])

weights = np.ones_like(data) / len(data)
plt.hist(data, bins=50, weights=weights)
plt.show()

Thanks, this is really helpful! I think if we wanted to integrate this in the normal testing one should be able to do a Kolmogorov-Smirnov test to automatically judge whether the expected pdf fits the sampled data. I will have a look if there is a rust library that would let us do this easily.

The normal distribution is implemented slightly inefficiently as it has to discard every second sample.

I think that is completely acceptable. Changing that would require us to keep state somewhere (like in a static variable), right?

Yes, the Box-Muller transform provides two independent random numbers at a time and we only use one of them.

Bzero · 2024-04-29T20:09:28Z

I just implemented a proper version of error(…) here: #420. It would be great if you could rebase/merge with master and use error(…) instead of NaN.

Oh perfect, thanks a lot, this will make it a lot more user friendly!

sharkdp · 2024-04-29T20:12:12Z

Thanks, this is really helpful! I think if we wanted to integrate this in the normal testing one should be able to do a Kolmogorov-Smirnov test to automatically judge whether the expected pdf fits the sampled data. I will have a look if there is a rust library that would let us do this easily.

I'd be okay with merging this without any additional tests. If you want to work on that, please feel free to go ahead. ~~But let's also keep test execution time in mind. Drawing 10000 samples in debug mode (which is what we usually have for cargo test) took ~seconds. Not sure how many samples would be needed for a proper test, but I'd rather not have any tests that take >1s, if possible.~~

Edit: nevermind. if you want to implement this, let's just do it. Worst case, we can deactivate those tests locally and only run them in CI, if they take too long.

Bzero · 2024-04-29T20:57:17Z

I'd be okay with merging this without any additional tests. If you want to work on that, please feel free to go ahead. ~~But let's also keep test execution time in mind. Drawing 10000 samples in debug mode (which is what we usually have for cargo test) took ~seconds. Not sure how many samples would be needed for a proper test, but I'd rather not have any tests that take >1s, if possible.~~

Edit: nevermind. if you want to implement this, let's just do it. Worst case, we can deactivate those tests locally and only run them in CI, if they take too long.

Well, I am not totally sure it is worth the effort then. Implementation would not be straight forward for me as I am not familiar with any of the rust libraries that can do K-S tests and these functions are not likely going to be modified a lot or rely on parts that are likely change a lot so it may be good enough to manually compare the histograms instead of doing it automatically.

sharkdp · 2024-04-30T15:43:55Z

numbat/modules/core/random.nbt

+fn rand_expon<T>(λ: T) -> 1/T =
+    if value_of(λ) > 0
+    then - ln(1-random()) / λ
+    else error("Argument λ must not be positive.")


Suggested change

else error("Argument λ must not be positive.")

else error("Argument λ must be positive.")

Thanks for catching this, that was a silly mistake.

…andom sampling functions.

sharkdp · 2024-05-01T07:05:06Z

Thank you!

Bzero · 2024-05-01T11:36:16Z

Welcome, thanks a lot for your help!

Add continuous uniform distribution sampling.

08c6c41

Bzero marked this pull request as draft April 27, 2024 15:59

Bzero added 12 commits April 28, 2024 14:38

Add description for random().

0729b6a

Add description for rand_uniform().

256b87a

Add discrete uniform distribution sampling

009b871

Add normal distribution sampling

ffe845a

Add geometric distribution sampling

16a7985

Add exponential distribution sampling

25b77c8

Add log-normal distribution sampling

d839647

Add pareto distribution sampling

18ccb0c

Add Bernoulli distribution sampling

1246a9a

Add Binomial distribution sampling

d7ecad4

Add Poisson distribution sampling

7f6c416

Add random sampling functions to the list of predefined functions

8b92a04

Bzero changed the title ~~Add continuous uniform distribution sampling~~ Random distribution sampling Apr 28, 2024