Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace min/max, make quantile_mut return None for empty arrays, and fix docs #13

Merged
merged 4 commits into from
Nov 19, 2018

Conversation

jturner314
Copy link
Member

@jturner314 jturner314 commented Nov 18, 2018

This PR does a few things:

  • Replace the old min/max methods with the old min_partialord/max_partialord. (Rename min_partialord/max_partialord to min/max.) See this comment for more explanation. The primary issue with the old min/max methods is that they panicked on empty arrays.
  • Make quantile_mut return None for empty arrays. It's easy to forget about the possibility of empty arrays, so it's important to remind the user of the possibility and force them to handle it. See this comment for more justification.
  • Fix the conditions for panicking in the docs of histogram strategies and quantile_mut.

Edit: I'm uncomfortable with the histogram strategies panicking in so many cases, but we can fix that in a later version.

The problem with the old methods was that they panicked when the array
was empty, which was very problematic. (See rust-ndarray/ndarray#512
for discussion.) The old `min_partialord` and `max_partialord` have
been renamed to `min`/`max`.
@jturner314 jturner314 mentioned this pull request Nov 18, 2018
17 tasks
@LukeMathWalker
Copy link
Member

I'll review it tomorrow morning.

On histogram's panics: should we change the signature to return a Result?
We panic in those occasions where it is not possible to provide an answer that makes sense:

  • what is the optimal bin width for a constant/empty array?
  • how do I handle degenerate cases (e.g. IQR=0 in Freedman Diaconis)?

Returning a Result would probably give a chance to the user to specify a default that makes sense to them, especially if we provide a useful error.

@jturner314
Copy link
Member Author

For empty arrays, I think the right thing to do is to have Edges be an empty Vec (no bins).

For constant arrays, I think the right thing to do is have Edges be vec![the_constant_value] (no bins) or vec![the_constant_value, the_constant_value + 1] (one bin). Alternatively, if we change Edges to allow duplicate edges, we could use vec![the_constant_value; 2] (one bin).

For Freedman Diaconis with IQR=0, I'm not sure; that requires some thought. Panicking definitely isn't the right thing, because IMO panics shouldn't be a function of the data in the array. I think I'd return an Option/Result or have a fallback strategy. At first glance, the Option/Result appeals to me more.

By the way, in a future version, I'm thinking about replacing GridBuilder with a fit_array method on Grid and simplifying the BinsBuildingStrategy trait:

impl<A: Ord> Grid<A> {
    fn fit_array<S, B>(array: &ArrayBase<S, Ix2>, strategy: &B) -> Result<Self, Error>
    where
        S: Data<Elem = A>,
        B: BinsBuildingStrategy<A>,
    {
        // ...
    }
}

pub trait BinsBuildingStrategy<A: Ord> {
    fn fit_array<S>(&self, array: &ArrayBase<S, Ix1>) -> Result<Bins<A>, Error>
    where
        S: Data<Elem = A>;
}

The BinsBuildingStrategy types wouldn't contain the min/max/n_bins/etc. derived from a specific array; they'd contain whatever metadata would be necessary to calculate the Bins in fit_array. This would simplify handling of special cases and would simplify the public API a little. What do you think about this?

@LukeMathWalker
Copy link
Member

For empty arrays, I think the right thing to do is to have Edges be an empty Vec (no bins).

It makes sense, I agree.

For constant arrays, I think the right thing to do is have Edges be vec![the_constant_value] (no bins) or vec![the_constant_value, the_constant_value + 1] (one bin). Alternatively, if we change Edges to allow duplicate edges, we could use vec![the_constant_value; 2] (one bin).

We could use a "default" width of 1, that could be the way. I'd avoid duplicate edges since they would still lead to empty bins and a degenerate grid (no observation would be captured), thus surprising the user.

By the way, in a future version, I'm thinking about replacing GridBuilder with a fit_array method on Grid and simplifying the BinsBuildingStrategy trait:

impl<A: Ord> Grid<A> {
    fn fit_array<S, B>(array: &ArrayBase<S, Ix2>, strategy: &B) -> Result<Self, Error>
    where
        S: Data<Elem = A>,
        B: BinsBuildingStrategy<A>,
    {
        // ...
    }
}

pub trait BinsBuildingStrategy<A: Ord> {
    fn fit_array<S>(&self, array: &ArrayBase<S, Ix1>) -> Result<Bins<A>, Error>
    where
        S: Data<Elem = A>;
}

The BinsBuildingStrategy types wouldn't contain the min/max/n_bins/etc. derived from a specific array; they'd contain whatever metadata would be necessary to calculate the Bins in fit_array. This would simplify handling of special cases and would simplify the public API a little. What do you think about this?

A fit_array method sounds good to me, as a shortcut on Grid to simplify the public API, but you still need extremes and a bin width to execute all the strategies we have implemented, hence I think the new strategy would end up quite similar to the ones we have right now. They would probably just squeeze everything in a single method.

@LukeMathWalker LukeMathWalker merged commit 882a411 into master Nov 19, 2018
@LukeMathWalker LukeMathWalker deleted the rework-min-max branch November 19, 2018 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants