-
Notifications
You must be signed in to change notification settings - Fork 75
Description
Continuation of #558 which fixed the most annoying bugs related to describe
.
See #558 for more information.
Our statistics functions need some more love. We used to have many missing types (mostly fixed by #937), but there are yet some more inconsistencies to be solved:
As mentioned here #543, some functions like median(ints) might result in an unexpectedly rounded Int in return. It might be better to let all functions return
Double
and then handleBigInteger
/BigDecimal
separately for now, as they're java-specific for now.
There are plenty of public overloads onIterable
andSequence
. It's fine to have them internally, but I feel like we're clogging the public scope here. mean, for instance, is already covered in the stdlib.
We'll need to hide public functions that are not on DataColumn as @AndreiKingsley will probably make a statistics library for that anyway.
We need to honor some conversion table (see below)
We won't support UByte
, UShort
, UInt
, and ULong
since they don't inherit Number
.
We also drop support for BigNumber
and BigDecimal
as this makes generic typing and conversion very difficult and unpredictable.
Progress:
- underlying fixes Aggregator implementation rework #1078median Median overhaul #1122percentile Percentile #1149cumSum CumSum #1152To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Function | Conversion | extra information | nulls in input |
---|---|---|---|
mean | Int -> Double | For all: Double.NaN if no elements | All nulls are filtered out |
Short -> Double | |||
Byte -> Double | |||
Long -> Double | |||
Double -> Double | skipNaN option, false by default | ||
Float -> Double | skipNaN option, false by default | ||
Number -> Conversion(Common number type) -> Double | skipNaN option, false by default | ||
Nothing / no values -> Double.NaN | |||
sum | Int -> Int | All default to zero if no values | All nulls are filtered out |
Short -> Int | |||
Byte -> Int | |||
Long -> Long | |||
Double -> Double | skipNaN option, false by default | ||
Float -> Float | skipNaN option, false by default | ||
Number -> Conversion(Common number type) -> Number | skipNaN option, false by default | ||
Nothing / no values -> Double (0.0) | |||
cumSum | Int -> Int | All default to zero if no values | All can optionally skip nulls in input with skipNull option, true by default |
Short -> Int | important because order matters with cumSum | ||
Byte -> Int | |||
Long -> Long | |||
Double -> Double | skipNaN option, true by default | ||
Float -> Float | skipNaN option, true by default | ||
Number -> Conversion(Common number type) -> Number | skipNaN option, true by default | ||
Nothing / no values -> Double (0.0) | |||
min/max | T -> T? where T : Comparable<T> | For all: null if no elements, has -OrNull overloads | All nulls are filtered out |
Int -> Int? | |||
Short -> Short? | |||
Byte -> Byte? | |||
Long -> Long? | |||
Double -> Double? | skipNaN option, false by default, returns NaN when in the input | ||
Float -> Float? | skipNaN option, false by default, returns NaN when in the input | ||
Would need more overloads and more work | |||
Nothing / no values -> Nothing? (null) | |||
median/percentile | T -> T? where T : Comparable<T> | For all: median of even list will cause conversion to Double if possible, else lower middle | All nulls are filtered out |
Int -> Double? | null if no elements | ||
Short -> Double? | |||
Byte -> Double? | |||
Long -> Double? | |||
Double -> Double? | |||
Float -> Double? | |||
Would need more overloads and more work | |||
Nothing / no values -> Nothing? (null) | |||
std | Int -> Double | All have DDoF (Delta Degrees of Freedom) argument | All nulls are filtered out |
Short -> Double | and Double.NaN if no elements | ||
Byte -> Double | |||
Long -> Double | |||
Double -> Double | skipNaN option, false by default | ||
Float -> Double | skipNaN option, false by default | ||
Number -> Conversion(Common number type) -> Double | skipNaN option, false by default | ||
Nothing / no values -> Double.NaN | |||
var (want to add?) | same as std |
Activity
Number
column #558Jolanrensen commentedon Jan 22, 2025
Also see #961
median
is broken for "mixed" number types #566Jolanrensen commentedon Feb 14, 2025
Check all
AnyRow.rowXXX
functions, likerowMean
,rowMin
, etc.rowMin
for instance is defined like:This will break if you have a Number and String column in your row. While they both are
Comparable
, they are not comparable to each other. We probably need to expand theinterComparableColumns()
orvaluesAreComparable()
function for these cases.AndreiKingsley commentedon Feb 14, 2025
#1060 adds
percentile
which is similar to all these functions and inherits all the above problems. After merge we will have to fix this all this stuff for it as well.Iterable
#1065Jolanrensen commentedon Feb 18, 2025
I've adjusted the table. We can support mixed number types auto-conversion to
Double
, except when there's aBigInteger
orBigDecimal
is among the values. Converting a big number toDouble
is lossy and can result in infinities. Best to throw an exception and tell users to first convert their values all toBigDecimal
and then call theBigDecimal -> BigDecimal
overload of the function.rowSum()
breaks forInt
+Float
#1068Jolanrensen commentedon Feb 19, 2025
#10686 remaining items
Jolanrensen commentedon Apr 9, 2025
#1121
Jolanrensen commentedon Apr 22, 2025
For the compiler plugin,
calculateReturnType
etc. should be public for Aggregators. Need to decide what to do with thisJolanrensen commentedon Apr 22, 2025
We also need to decide (but probably later) whether all overloads on DataFrame-like objects that take multiple columns and generate a single value should be kept or removed. It's for cases like
df.sum { age1 and age2 } -> Int
etc.sum
operation with tests #1148Jolanrensen commentedon Apr 27, 2025
Everything is merged :D
Now all that needs to be done is a revisit of the (K)Docs