Fix handling of -0.0 in histograms #768

nalimilan · 2022-02-20T14:35:44Z

searchsortedfirst and searchsortedlast use isless for comparisons and therefore consider -0.0 to be different from 0.0. This means that these two values do not end up in the same bin when an edge is 0.
This does not make much sense statistically, but even worse is that when an extreme edge is 0, -0.0 is not counted at all.

Fix this by replacing -0.0 with 0.0 before the search.
Closes #766.

`searchsortedfirst` and `searchsortedlast` use `isless` for comparisons and therefore consider `-0.0` to be different from `0.0`. This means that these two values do not end up in the same bin when an edge is 0. This does not make much sense statistically, but even worse is that when an extreme edge is 0, `-0.0` is not counted at all. Fix this by replacing `-0.0` with `0.0` before the search.

nalimilan · 2022-02-20T14:39:43Z

Cc: @Moelf, @oschulz

Moelf · 2022-02-20T14:54:24Z

is -0.0 and 0.0 the only edge case? I think so? the only other weird thing come to mind is -NaN and NaN but they can't be bin edge.

Moelf · 2022-02-20T14:57:24Z

src/hist.jl

@@ -226,11 +226,17 @@ binindex(h::AbstractHistogram{T,1}, x::Real) where {T} = binindex(h, (x,))[1]
 binindex(h::Histogram{T,N}, xs::NTuple{N,Real}) where {T,N} =
    map((edge, x) -> _edge_binindex(edge, h.closed, x), h.edges, xs)

+_normalize_zero(x::AbstractFloat) = isequal(x, -0.0) ? oftype(x, 0.0) : x


Suggested change

_normalize_zero(x::AbstractFloat) = isequal(x, -0.0) ? oftype(x, 0.0) : x

_normalize_zero(x::AbstractFloat) = ifelse(isequal(x, -0.0), oftype(x, 0.0), x)

for the performance? also I hope it's automatically inlined.

AFAICT LLVM is smart enough that this gives exactly the same code.

Moelf · 2022-02-20T14:57:39Z

src/hist.jl

@@ -226,11 +226,17 @@ binindex(h::AbstractHistogram{T,1}, x::Real) where {T} = binindex(h, (x,))[1]
 binindex(h::Histogram{T,N}, xs::NTuple{N,Real}) where {T,N} =
    map((edge, x) -> _edge_binindex(edge, h.closed, x), h.edges, xs)

+_normalize_zero(x::AbstractFloat) = isequal(x, -0.0) ? oftype(x, 0.0) : x
+_normalize_zero(x::Any) = x


Suggested change

_normalize_zero(x::Any) = x

_normalize_zero(x) = x

Moelf · 2022-02-20T15:01:13Z

I thought we need to

ifelse(isequal(x, -0.0), zero(x), x)

for type stability of _edge_binindex?

nalimilan · 2022-02-20T17:25:57Z

oftype(x, 0.0) is equivalent to zero(x), but yeah the latter is probably nicer.

Do you think the spirit of this PR is right? Have you encountered this problem in FHist.jl too?

nalimilan · 2022-02-20T17:31:03Z

is -0.0 and 0.0 the only edge case? I think so? the only other weird thing come to mind is -NaN and NaN but they can't be bin edge.

Yes I think zero is the only tricky case apart from NaN. Currently NaNs in the data trigger an error if you don't specify edges manually, and if you do they are just ignored, which is dangerous and differs from what we do elsewhere. They are also accepted as edges, which seems absurd. Maybe we should disallow this in another PR.

nalimilan · 2022-02-20T17:44:41Z

Unfortunately, passing by=_normalize_zero seems to incur a 30% performance penalty on a random Vector{Float64}. It would be better to check edges in the Histogram constructor, but given that the edges vector is passed by the caller, mutating it to replace -0.0 with 0.0 could be problematic. Maybe we should just print a deprecation warning if edges contain -0.0, and later we will throw an error telling the caller to fix this. This is a relatively unlikely case I guess.

Moelf · 2022-02-20T18:04:07Z

Do you think the spirit of this PR is right?

yeah totally. For physics we don't run into -0.0 much and also we don't care about losing a few events... (they are weighted to 0.0001) or something.

Unfortunately, passing by=_normalize_zero seems to incur a 30% performance penalty on a random

yeah I don't completely understand the performance model of searchsorted*, and also the way we have histogram implemented as recursive pushing and map() to find dimensions etc may also be a problem when everything is compiled together: I'd expect performance to be much better if we simply have a for loop with searchsorted* inlined together with the by = in the loop-body.

oschulz · 2022-02-21T10:00:43Z

Thanks for this!

nalimilan · 2022-02-21T21:56:50Z

I've found a solution though it's unfortunately a bit complex. It turns out that the overhead of normalizing -0.0 to 0.0 only appears when edges are a range, as passing by to searchsortedfirst/searchsortedlast forces using the fallback AbstractVector method. So I figured we can avoid normalizing zeros when edges are a range.

Luckily it's almost impossible to construct a range which includes -0.0. Even -0.0:1.0 starts with 0.0 rather than -0.0. So we never build a range which contains -0.0 when edges are omitted. Neither range, x:y nor x:y:z seem to allow creating a range containing -0.0: the only way I've found is to do UnitRange(-0.0, 1.0) or LinRange(-1.0, -0.0, 2). This is so unlikely that I've added a check in the constructor to throw an error if we encounter such a case so that people can fix it or report it instead of silently getting incorrect results. For standard range types, we could automatically use a new range after replacing -0.0 with 0.0, but I'm not sure it's worth it.

bkamins · 2022-02-25T14:29:27Z

src/hist.jl

+        # so check the former just in case as it is cheap
+        foreach(edges) do e
+            e isa AbstractRange &&
+                (isequal(first(e), -0.0) || isequal(last(e), -0.0)) &&


why not any(isequal(-0.0), e) as this would be a bit safer and cost is negligible I think?

Yeah, I guess in normal use the number of bins is much lower than the number of observations so checking for all of them has a negligible cost. I've pushed a commit to do that.

nalimilan mentioned this pull request Feb 20, 2022

Histogram dropping values when dealing with signed zero #766

Closed

Moelf reviewed Feb 20, 2022

View reviewed changes

Disallow -0.0 in ranges

a669bdb

nalimilan mentioned this pull request Feb 21, 2022

searchsortedfirst slower for ranges than for vectors JuliaLang/julia#44296

Closed

bkamins reviewed Feb 25, 2022

View reviewed changes

Check all range values

8268ba0

bkamins approved these changes Mar 20, 2022

View reviewed changes

nalimilan merged commit 26947bc into master Mar 26, 2022

nalimilan deleted the nl/hist branch March 26, 2022 14:22

t-bltg mentioned this pull request Mar 26, 2022

Erroneous value count when dealing with signed zero JuliaPlots/UnicodePlots.jl#229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of -0.0 in histograms #768

Fix handling of -0.0 in histograms #768

nalimilan commented Feb 20, 2022

nalimilan commented Feb 20, 2022

Moelf commented Feb 20, 2022

Moelf Feb 20, 2022

nalimilan Feb 20, 2022

Moelf Feb 20, 2022

Moelf commented Feb 20, 2022 •

edited

Loading

nalimilan commented Feb 20, 2022

nalimilan commented Feb 20, 2022

nalimilan commented Feb 20, 2022

Moelf commented Feb 20, 2022 •

edited

Loading

oschulz commented Feb 21, 2022

nalimilan commented Feb 21, 2022

bkamins Feb 25, 2022

nalimilan Mar 20, 2022

	_normalize_zero(x::AbstractFloat) = isequal(x, -0.0) ? oftype(x, 0.0) : x
	_normalize_zero(x::AbstractFloat) = ifelse(isequal(x, -0.0), oftype(x, 0.0), x)

Fix handling of -0.0 in histograms #768

Fix handling of -0.0 in histograms #768

Conversation

nalimilan commented Feb 20, 2022

nalimilan commented Feb 20, 2022

Moelf commented Feb 20, 2022

Moelf Feb 20, 2022

Choose a reason for hiding this comment

nalimilan Feb 20, 2022

Choose a reason for hiding this comment

Moelf Feb 20, 2022

Choose a reason for hiding this comment

Moelf commented Feb 20, 2022 • edited Loading

nalimilan commented Feb 20, 2022

nalimilan commented Feb 20, 2022

nalimilan commented Feb 20, 2022

Moelf commented Feb 20, 2022 • edited Loading

oschulz commented Feb 21, 2022

nalimilan commented Feb 21, 2022

bkamins Feb 25, 2022

Choose a reason for hiding this comment

nalimilan Mar 20, 2022

Choose a reason for hiding this comment

Moelf commented Feb 20, 2022 •

edited

Loading

Moelf commented Feb 20, 2022 •

edited

Loading