Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of -0.0 in histograms #768

Merged
merged 3 commits into from
Mar 26, 2022
Merged

Fix handling of -0.0 in histograms #768

merged 3 commits into from
Mar 26, 2022

Conversation

nalimilan
Copy link
Member

searchsortedfirst and searchsortedlast use isless for comparisons and therefore consider -0.0 to be different from 0.0. This means that these two values do not end up in the same bin when an edge is 0.
This does not make much sense statistically, but even worse is that when an extreme edge is 0, -0.0 is not counted at all.

Fix this by replacing -0.0 with 0.0 before the search.
Closes #766.

`searchsortedfirst` and `searchsortedlast` use `isless` for comparisons
and therefore consider `-0.0` to be different from `0.0`. This means
that these two values do not end up in the same bin when an edge is 0.
This does not make much sense statistically, but even worse is that
when an extreme edge is 0, `-0.0` is not counted at all.

Fix this by replacing `-0.0` with `0.0` before the search.
@nalimilan
Copy link
Member Author

Cc: @Moelf, @oschulz

@Moelf
Copy link
Contributor

Moelf commented Feb 20, 2022

is -0.0 and 0.0 the only edge case? I think so? the only other weird thing come to mind is -NaN and NaN but they can't be bin edge.

src/hist.jl Outdated
@@ -226,11 +226,17 @@ binindex(h::AbstractHistogram{T,1}, x::Real) where {T} = binindex(h, (x,))[1]
binindex(h::Histogram{T,N}, xs::NTuple{N,Real}) where {T,N} =
map((edge, x) -> _edge_binindex(edge, h.closed, x), h.edges, xs)

_normalize_zero(x::AbstractFloat) = isequal(x, -0.0) ? oftype(x, 0.0) : x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_normalize_zero(x::AbstractFloat) = isequal(x, -0.0) ? oftype(x, 0.0) : x
_normalize_zero(x::AbstractFloat) = ifelse(isequal(x, -0.0), oftype(x, 0.0), x)

for the performance? also I hope it's automatically inlined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT LLVM is smart enough that this gives exactly the same code.

@@ -226,11 +226,17 @@ binindex(h::AbstractHistogram{T,1}, x::Real) where {T} = binindex(h, (x,))[1]
binindex(h::Histogram{T,N}, xs::NTuple{N,Real}) where {T,N} =
map((edge, x) -> _edge_binindex(edge, h.closed, x), h.edges, xs)

_normalize_zero(x::AbstractFloat) = isequal(x, -0.0) ? oftype(x, 0.0) : x
_normalize_zero(x::Any) = x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_normalize_zero(x::Any) = x
_normalize_zero(x) = x

@Moelf
Copy link
Contributor

Moelf commented Feb 20, 2022

I thought we need to

ifelse(isequal(x, -0.0), zero(x), x)

for type stability of _edge_binindex?

@nalimilan
Copy link
Member Author

oftype(x, 0.0) is equivalent to zero(x), but yeah the latter is probably nicer.

Do you think the spirit of this PR is right? Have you encountered this problem in FHist.jl too?

@nalimilan
Copy link
Member Author

is -0.0 and 0.0 the only edge case? I think so? the only other weird thing come to mind is -NaN and NaN but they can't be bin edge.

Yes I think zero is the only tricky case apart from NaN. Currently NaNs in the data trigger an error if you don't specify edges manually, and if you do they are just ignored, which is dangerous and differs from what we do elsewhere. They are also accepted as edges, which seems absurd. Maybe we should disallow this in another PR.

@nalimilan
Copy link
Member Author

Unfortunately, passing by=_normalize_zero seems to incur a 30% performance penalty on a random Vector{Float64}. It would be better to check edges in the Histogram constructor, but given that the edges vector is passed by the caller, mutating it to replace -0.0 with 0.0 could be problematic. Maybe we should just print a deprecation warning if edges contain -0.0, and later we will throw an error telling the caller to fix this. This is a relatively unlikely case I guess.

@Moelf
Copy link
Contributor

Moelf commented Feb 20, 2022

Do you think the spirit of this PR is right?

yeah totally. For physics we don't run into -0.0 much and also we don't care about losing a few events... (they are weighted to 0.0001) or something.

Unfortunately, passing by=_normalize_zero seems to incur a 30% performance penalty on a random

yeah I don't completely understand the performance model of searchsorted*, and also the way we have histogram implemented as recursive pushing and map() to find dimensions etc may also be a problem when everything is compiled together: I'd expect performance to be much better if we simply have a for loop with searchsorted* inlined together with the by = in the loop-body.

@oschulz
Copy link
Contributor

oschulz commented Feb 21, 2022

Thanks for this!

@nalimilan
Copy link
Member Author

I've found a solution though it's unfortunately a bit complex. It turns out that the overhead of normalizing -0.0 to 0.0 only appears when edges are a range, as passing by to searchsortedfirst/searchsortedlast forces using the fallback AbstractVector method. So I figured we can avoid normalizing zeros when edges are a range.

Luckily it's almost impossible to construct a range which includes -0.0. Even -0.0:1.0 starts with 0.0 rather than -0.0. So we never build a range which contains -0.0 when edges are omitted. Neither range, x:y nor x:y:z seem to allow creating a range containing -0.0: the only way I've found is to do UnitRange(-0.0, 1.0) or LinRange(-1.0, -0.0, 2). This is so unlikely that I've added a check in the constructor to throw an error if we encounter such a case so that people can fix it or report it instead of silently getting incorrect results. For standard range types, we could automatically use a new range after replacing -0.0 with 0.0, but I'm not sure it's worth it.

src/hist.jl Outdated
# so check the former just in case as it is cheap
foreach(edges) do e
e isa AbstractRange &&
(isequal(first(e), -0.0) || isequal(last(e), -0.0)) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not any(isequal(-0.0), e) as this would be a bit safer and cost is negligible I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess in normal use the number of bins is much lower than the number of observations so checking for all of them has a negligible cost. I've pushed a commit to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Histogram dropping values when dealing with signed zero
4 participants