Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random distribution sampling #416

Merged

Conversation

Bzero
Copy link
Contributor

@Bzero Bzero commented Apr 27, 2024

This PR adds the function rand_uniform(a, b) which samples the continuous uniform distribution using inversion sampling. The function takes care of argument reversal.

Partially addresses #413.

@sharkdp
Copy link
Owner

sharkdp commented Apr 27, 2024

Thank you!

I'd be very glad to accept more functions like these, especially if they can be implemented in Numbat. Before we do so, let's maybe discuss a minimum standard:

  • Names: it would be great if we can choose a naming scheme that is close to something existing (e.g. numpy, scipy, matlab, R, …). This will make it easier for users to find what they want.
  • Documentation: We do not have @name/@url etc. for functions, but I think it would be good to add a tiny bit of documentation for each function, and maybe a URL as a reference if someone wants to know more. I know that we don't have this for other functions in the prelude, but I very much would like to add this in the future, so this seems like a good place to start. Especially since some of those functions will not be universally known like sin or sqrt. This will make it easier for us to add proper "doc comments" for functions in the future.
  • Discoverability: it would be great if you could list those functions in book/src/list-functions.md. Feel free to add a new section. random is also missing at the moment.
  • Tests: This is hard to do for non-deterministic functions. I'd be okay with leaving out tests. We could add simple dummy tests like x = rand_uniform(5, 7); assert(x >= 5 && x <= 7) just to make sure that those functions will continue to work in the future. If you do so, please add them in a new examples/random_tests.nbt file (similar to what we do for datetime). For distributions without had boundary conditions, there's really not a lot we could test, but maybe it still makes sense to at least call them. Not sure.

Please feel free to add more of these in a single PR if that is easier for you.

@Bzero
Copy link
Contributor Author

Bzero commented Apr 27, 2024

Thanks a lot for your feedback!

  • Names: Yes, I agree that this would be convenient. I would suggest to base the names on scipy.stats and to prefix them with rand_ (or random_ if you prefer) to make clear that these are non-deterministic functions.
  • Documentation: Sounds good, I will add documentation as if there were @name, @url, maybe @usage function decorators, so it will be easy to convert them if that feature will be added in the future.
  • Discoverability: Agreed.
  • Tests: So, I had actually written a small script which samples a large number of values and compares them to the distribution function (using scipy.stats) to check that the functions behave as intended. I think this method could serve as a test, I will see if I can do something similar with rust tests.

Please feel free to add more of these in a single PR if that is easier for you.

Yes, that might be better. I will bundle them into a single PR.

@Bzero Bzero marked this pull request as draft April 27, 2024 15:59
@Bzero Bzero changed the title Add continuous uniform distribution sampling Random distribution sampling Apr 28, 2024
@Bzero
Copy link
Contributor Author

Bzero commented Apr 29, 2024

Since you are already reviewing, here are a few more notes on the implementations that might be relevant which I had not written up yet because I have not yet had time to look into the testing question:

  • If a function receives an invalid parameter it returns an error if it is a scalar function and NAN otherwise. I hope that is okay this way, if there is a way to return errors for non-scalar functions in the future we can change that.

  • The normal distribution is implemented slightly inefficiently as it has to discard every second sample.

  • The method used to sample the Bernoulli distribution has suboptimal linear runtime in n. Unless people start doing heavy number crunching this should not be a problem and for large n the approximation by the normal distribution can be used.

  • Similarly, the implementation of the Poisson distribution has linear expected runtime in λ. Again, in practice this should not be an issue and for large λ the normal distribution can be used too.

Comment on lines +47 to +51
# name: Normal distribution sampling
# url: https://en.wikipedia.org/wiki/Normal_distribution
# description: Samples a normal distribution with mean μ and standard deviation σ using the Box-Muller transform.
fn rand_norm<T>(μ: T, σ: T) -> T =
μ + sqrt(-2 σ² × ln(random())) × sin(2π × random())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

fn rand_pareto<T>(α: Scalar, min: T) -> T =
if value_of(min) > 0 && α > 0
then min / ((1-random())^(1/α))
else NaN * unit_of(min)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really make 0, NaN and inf polymorphic (#37).

@sharkdp
Copy link
Owner

sharkdp commented Apr 29, 2024

Thank you very much — this looks fantastic!

I wrote a small example program (in Rust... we really need a list data type in Numbat!) to sample large amounts of numbers from those distributions and compared the normalized histograms with the PDFs of those distributions. Everything looked correct 👍

use numbat::{
    module_importer::FileSystemImporter, resolver::CodeSource, Context, InterpreterResult,
};
use std::path::Path;

fn get_value(ctx: &mut Context, code: &str) -> f64 {
    match ctx.interpret(code, CodeSource::Internal).unwrap().1 {
        InterpreterResult::Value(v) => v.unsafe_as_quantity().unsafe_value().0,
        _ => todo!(),
    }
}

fn main() {
    let module_path = Path::new(&std::env::var_os("CARGO_MANIFEST_DIR").unwrap()).join("modules");

    let mut importer = FileSystemImporter::default();
    importer.add_path(module_path);
    let mut ctx = Context::new(importer);
    let _result = ctx.interpret("use all", CodeSource::Internal).unwrap();

    let code = std::env::args().nth(1).unwrap();

    let results = (0..10000)
        .map(|_| get_value(&mut ctx, code.as_str()))
        .collect::<Vec<_>>();
    for r in results {
        println!("{}", r);
    }
}

And a Python script to plot a histogram:

import matplotlib.pyplot as plt
import numpy as np
import sys

data = np.loadtxt(sys.argv[1])

weights = np.ones_like(data) / len(data)
plt.hist(data, bins=50, weights=weights)
plt.show()

Since you are already reviewing, here are a few more notes on the implementations that might be relevant which I had not written up yet because I have not yet had time to look into the testing question:

If a function receives an invalid parameter it returns an error if it is a scalar function and NAN otherwise. I hope that is okay this way, if there is a way to return errors for non-scalar functions in the future we can change that.

👍

The normal distribution is implemented slightly inefficiently as it has to discard every second sample.

I think that is completely acceptable. Changing that would require us to keep state somewhere (like in a static variable), right?

The method used to sample the Bernoulli distribution has suboptimal linear runtime in n. Unless people start doing heavy number crunching this should not be a problem and for large n the approximation by the normal distribution can be used.

👍

Similarly, the implementation of the Poisson distribution has linear expected runtime in λ. Again, in practice this should not be an issue and for large λ the normal distribution can be used too.

👍

Bzero and others added 2 commits April 29, 2024 21:19
Co-authored-by: David Peter <sharkdp@users.noreply.github.com>
@sharkdp
Copy link
Owner

sharkdp commented Apr 29, 2024

  • If a function receives an invalid parameter it returns an error if it is a scalar function and NAN otherwise. I hope that is okay this way, if there is a way to return errors for non-scalar functions in the future we can change that.

I just implemented a proper version of error(…) here: #420. It would be great if you could rebase/merge with master and use error(…) instead of NaN.

@Bzero
Copy link
Contributor Author

Bzero commented Apr 29, 2024

Thanks a lot for all your comments and catching my mistakes!

I wrote a small example program (in Rust... we really need a list data type in Numbat!) to sample large amounts of numbers from those distributions and compared the normalized histograms with the PDFs of those distributions. Everything looked correct 👍

use numbat::{
    module_importer::FileSystemImporter, resolver::CodeSource, Context, InterpreterResult,
};
use std::path::Path;

fn get_value(ctx: &mut Context, code: &str) -> f64 {
    match ctx.interpret(code, CodeSource::Internal).unwrap().1 {
        InterpreterResult::Value(v) => v.unsafe_as_quantity().unsafe_value().0,
        _ => todo!(),
    }
}

fn main() {
    let module_path = Path::new(&std::env::var_os("CARGO_MANIFEST_DIR").unwrap()).join("modules");

    let mut importer = FileSystemImporter::default();
    importer.add_path(module_path);
    let mut ctx = Context::new(importer);
    let _result = ctx.interpret("use all", CodeSource::Internal).unwrap();

    let code = std::env::args().nth(1).unwrap();

    let results = (0..10000)
        .map(|_| get_value(&mut ctx, code.as_str()))
        .collect::<Vec<_>>();
    for r in results {
        println!("{}", r);
    }
}

And a Python script to plot a histogram:

import matplotlib.pyplot as plt
import numpy as np
import sys

data = np.loadtxt(sys.argv[1])

weights = np.ones_like(data) / len(data)
plt.hist(data, bins=50, weights=weights)
plt.show()

Thanks, this is really helpful! I think if we wanted to integrate this in the normal testing one should be able to do a Kolmogorov-Smirnov test to automatically judge whether the expected pdf fits the sampled data. I will have a look if there is a rust library that would let us do this easily.

The normal distribution is implemented slightly inefficiently as it has to discard every second sample.

I think that is completely acceptable. Changing that would require us to keep state somewhere (like in a static variable), right?

Yes, the Box-Muller transform provides two independent random numbers at a time and we only use one of them.

@Bzero
Copy link
Contributor Author

Bzero commented Apr 29, 2024

I just implemented a proper version of error(…) here: #420. It would be great if you could rebase/merge with master and use error(…) instead of NaN.

Oh perfect, thanks a lot, this will make it a lot more user friendly!

@sharkdp
Copy link
Owner

sharkdp commented Apr 29, 2024

Thanks, this is really helpful! I think if we wanted to integrate this in the normal testing one should be able to do a Kolmogorov-Smirnov test to automatically judge whether the expected pdf fits the sampled data. I will have a look if there is a rust library that would let us do this easily.

I'd be okay with merging this without any additional tests. If you want to work on that, please feel free to go ahead. ~~But let's also keep test execution time in mind. Drawing 10000 samples in debug mode (which is what we usually have for cargo test) took ~seconds. Not sure how many samples would be needed for a proper test, but I'd rather not have any tests that take >1s, if possible.~~

Edit: nevermind. if you want to implement this, let's just do it. Worst case, we can deactivate those tests locally and only run them in CI, if they take too long.

@Bzero Bzero force-pushed the continuous_uniform_distribution_sampling branch 2 times, most recently from 1136774 to f5c1475 Compare April 29, 2024 20:49
@Bzero
Copy link
Contributor Author

Bzero commented Apr 29, 2024

I'd be okay with merging this without any additional tests. If you want to work on that, please feel free to go ahead. ~~But let's also keep test execution time in mind. Drawing 10000 samples in debug mode (which is what we usually have for cargo test) took ~seconds. Not sure how many samples would be needed for a proper test, but I'd rather not have any tests that take >1s, if possible.~~

Edit: nevermind. if you want to implement this, let's just do it. Worst case, we can deactivate those tests locally and only run them in CI, if they take too long.

Well, I am not totally sure it is worth the effort then. Implementation would not be straight forward for me as I am not familiar with any of the rust libraries that can do K-S tests and these functions are not likely going to be modified a lot or rely on parts that are likely change a lot so it may be good enough to manually compare the histograms instead of doing it automatically.

fn rand_expon<T>(λ: T) -> 1/T =
if value_of(λ) > 0
then - ln(1-random()) / λ
else error("Argument λ must not be positive.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else error("Argument λ must not be positive.")
else error("Argument λ must be positive.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, that was a silly mistake.

@Bzero Bzero force-pushed the continuous_uniform_distribution_sampling branch from f5c1475 to d373f78 Compare April 30, 2024 15:53
@sharkdp sharkdp marked this pull request as ready for review May 1, 2024 07:04
@sharkdp sharkdp merged commit 026342f into sharkdp:master May 1, 2024
15 checks passed
@sharkdp
Copy link
Owner

sharkdp commented May 1, 2024

Thank you!

@Bzero
Copy link
Contributor Author

Bzero commented May 1, 2024

Welcome, thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants