add percentile and p25 and p75 to describe #1060

AndreiKingsley · 2025-02-13T12:08:46Z

closes Add 25 and 75 percentiles in describe #1054.
add percentile function (similar to median)
add percentile site docs
add simple percentile unit tests
add p25 and p75 statistics to describe
add describe KDocs
update describe site docs
update describe tests

Jolanrensen

Thanks a lot! good addition :)

Jolanrensen · 2025-02-14T12:05:00Z

desc.ipynb

this should not be here, right?

Jolanrensen · 2025-02-14T12:06:57Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/describe.kt

+ * such as `mean` and `std` will return `null`. If a column is not [Comparable],
+ * percentile values (`min`, `p25`, `median`, `p75`, `max`) will also return `null`.
+ */
+internal interface SummaryMetrics


You can mark this documentation interface as @ExcludeFromSources, just like the interface Describe below, since their contents are only included in other places.

oh btw, probably link to ColumnDescription so users can click on it and explore the DataSchema result.

Jolanrensen · 2025-02-14T12:10:43Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/describe.kt

+ *
+ * This function provides a statistical summary for each column, including its type, count, uniqueness,
+ * missing values, most frequent values, and statistical measures if applicable.
+ * It automatically traverses nested column groups to include all non-grouped columns in the summary.


I don't understand this sentence

Jolanrensen · 2025-02-14T12:11:11Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/describe.kt

 public fun <T> DataColumn<T>.describe(): DataFrame<ColumnDescription> = describeImpl(listOf(this))

 // endregion

 // region DataFrame

+/**
+ * {@include [Describe]}
+ * (


there's a "(" that doesn't do anything

Jolanrensen · 2025-02-14T12:12:23Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/describe.kt

 public fun <T> DataFrame<T>.describe(): DataFrame<ColumnDescription> =
    describe {
        colsAtAnyDepth { !it.isColumnGroup() }
    }

+/**


very nice usage of the template :D

Jolanrensen · 2025-02-14T12:26:48Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/percentile.kt

+
+public fun AnyRow.rowPercentileOrNull(percentile: Double): Any? =
+    Aggregators.percentile(percentile).aggregate(
+        values().filterIsInstance<Comparable<Any?>>().toValueColumn(),


This will break when you have a String and Number column for instance. While they both are Comparable, they are not comparable to each other. I see we make the same mistake in rowMin etc. (it's noted). It's probably best to filter for Number values, like rowMean.

Let's add an unified solution for all these functions

sure! I'd probably go with a route like this, similar to describeImpl with convertToComparableOrNull:

First filter for all Comparable<*> values in the row

If all these values are comparable to each other, run the statistic function on it

If not, filter for Number values. These are not directly comparable to each other but can be made comparable by converting them to Double/BigDecimal, and thus all statistics functions will work. Though, then we skip any other potentially comparable values, like String.

Else return null

I think this would produce the most expected behavior. wdyt?

Looks nice! But let's do it in a new PR for all statistical functions.

sure! Then I'd make this Number for now until we fix it for all

Jolanrensen · 2025-02-14T12:31:20Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/math/median.kt

-    }
-    return list[0]
-}
+internal inline fun <reified T : Comparable<T>> Iterable<T?>.median(type: KType): T? = percentile(50.0, type)


Jolanrensen · 2025-02-14T12:34:32Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/math/percentile.kt

+import kotlin.reflect.KType
+import kotlin.reflect.typeOf
+
+public inline fun <reified T : Comparable<T>> Iterable<T>.percentileOrNull(percentile: Double): T? =


I don't think DataFrame should have these functions public. It pollutes the scope on Iterables and it's outside the scope of DataFrame to offer these to users. They should be part of a statistics library if you want them publicly accessible :). (mentioned here #961)

yeah, that's true. Let's get rid of all of them In the other PR?

Jolanrensen · 2025-02-14T12:37:07Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/math/percentile.kt

+internal inline fun <reified T : Comparable<T>> Iterable<T?>.q3(type: KType): T? = percentile(75.0, type)
+
+@PublishedApi
+internal inline fun <reified T : Comparable<T>> Iterable<T?>.percentile(percentile: Double, type: KType): T? {


It's fine for now to return the same type it receives, as that's what median did too. However, I'll likely change it pretty soon according to this table: #961

Yes, we could reconsider this logic, I add all of this in accordance with existing median

Jolanrensen · 2025-02-14T12:42:36Z

docs/StardustDocs/topics/describe.md

+- **`max`** — The maximum value in the column.
+
+For non-numeric columns, statistical metrics
+such as `mean` and `std` will return `null`. If a column is not `Comparable`,


maybe specify it like "if the values in a column are incomparable" which is closer to the truth. A column can be Comparable<Nothing> but then these statistics will fail.

Jolanrensen · 2025-02-14T13:39:21Z

core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/statistics/percentile.kt

+            2, 4, 10,
+            7, 7, 1,
+        )
+        df.mapToColumn("", Infer.Type) { it.rowPercentile(25.0) } shouldBe columnOf(1, 2, 1)


Btw. Do you think it would be a good idea to offer an enum? Like Percentile.Q1, Percentile.Q2, etc.

or probably an object with val Q1 = 0.25, etc.

Well, it could be a good idea, but for statistics library :). I believe simple percentile is enough for now.

zaleslaw · 2025-02-14T17:39:32Z

docs/StardustDocs/topics/percentile.md

+
+```kotlin
+df.percentile(25.0)
+df.age.percentile(25.0)


is it possible to embed a table here, like in valueColumns https://kotlin.github.io/dataframe/valuecounts.html

Other statistics are poorly covered with examples, should be improved

We could review all statistical functions documentation.

# Conflicts: # core/api/core.api

AndreiKingsley added 3 commits February 13, 2025 02:42

add percentile and p25 and p75 to describe

6f5bc25

rollback SampleAggregator.kt

b07dec3

update generated sources

efbe105

zaleslaw requested review from zaleslaw and Jolanrensen and removed request for zaleslaw and Jolanrensen February 13, 2025 17:37

AndreiKingsley requested review from Jolanrensen, zaleslaw and koperagen February 13, 2025 18:25

fix code in describe.md

e4a26e9

Jolanrensen requested review from Jolanrensen and zaleslaw and removed request for Jolanrensen, zaleslaw and koperagen February 14, 2025 12:03

Jolanrensen requested changes Feb 14, 2025

View reviewed changes

AndreiKingsley added 2 commits February 14, 2025 17:01

remove desc.ipynb

b225aab

improve describe docs

ae49b41

Jolanrensen reviewed Feb 14, 2025

View reviewed changes

AndreiKingsley mentioned this pull request Feb 14, 2025

☂ Statistics streamlining #961

Open

9 tasks

AndreiKingsley requested a review from Jolanrensen February 14, 2025 14:50

zaleslaw reviewed Feb 14, 2025

View reviewed changes

zaleslaw approved these changes Feb 14, 2025

View reviewed changes

Jolanrensen approved these changes Feb 17, 2025

View reviewed changes

AndreiKingsley added 3 commits February 17, 2025 21:37

fix describe kdoc

8430df9

Merge branch 'master' into percentile

e833fde

# Conflicts: # core/api/core.api

fix p25 and p75

fd9c950

AndreiKingsley merged commit ead907e into master Feb 18, 2025
3 checks passed

AndreiKingsley deleted the percentile branch February 18, 2025 10:00

Jolanrensen mentioned this pull request Mar 5, 2025

Add support for percentiles #543

Closed

Jolanrensen mentioned this pull request Apr 9, 2025

Percentile/quantile estimation types #1121

Open

add percentile and p25 and p75 to describe #1060

add percentile and p25 and p75 to describe #1060

Uh oh!

Conversation

AndreiKingsley commented Feb 13, 2025

Uh oh!

Jolanrensen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!