Skip to content

add percentile and p25 and p75 to describe #1060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 18, 2025
Merged

Conversation

AndreiKingsley
Copy link
Collaborator

  • closes Add 25 and 75 percentiles in describe #1054.
  • add percentile function (similar to median)
  • add percentile site docs
  • add simple percentile unit tests
  • add p25 and p75 statistics to describe
  • add describe KDocs
  • update describe site docs
  • update describe tests

@zaleslaw zaleslaw requested review from zaleslaw and Jolanrensen and removed request for zaleslaw and Jolanrensen February 13, 2025 17:37
@Jolanrensen Jolanrensen requested review from Jolanrensen and zaleslaw and removed request for Jolanrensen, zaleslaw and koperagen February 14, 2025 12:03
Copy link
Collaborator

@Jolanrensen Jolanrensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! good addition :)

desc.ipynb Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be here, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* such as `mean` and `std` will return `null`. If a column is not [Comparable],
* percentile values (`min`, `p25`, `median`, `p75`, `max`) will also return `null`.
*/
internal interface SummaryMetrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can mark this documentation interface as @ExcludeFromSources, just like the interface Describe below, since their contents are only included in other places.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh btw, probably link to ColumnDescription so users can click on it and explore the DataSchema result.

*
* This function provides a statistical summary for each column, including its type, count, uniqueness,
* missing values, most frequent values, and statistical measures if applicable.
* It automatically traverses nested column groups to include all non-grouped columns in the summary.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public fun <T> DataColumn<T>.describe(): DataFrame<ColumnDescription> = describeImpl(listOf(this))

// endregion

// region DataFrame

/**
* {@include [Describe]}
* (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a "(" that doesn't do anything

public fun <T> DataFrame<T>.describe(): DataFrame<ColumnDescription> =
describe {
colsAtAnyDepth { !it.isColumnGroup() }
}

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice usage of the template :D


public fun AnyRow.rowPercentileOrNull(percentile: Double): Any? =
Aggregators.percentile(percentile).aggregate(
values().filterIsInstance<Comparable<Any?>>().toValueColumn(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break when you have a String and Number column for instance. While they both are Comparable, they are not comparable to each other. I see we make the same mistake in rowMin etc. (it's noted). It's probably best to filter for Number values, like rowMean.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an unified solution for all these functions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! I'd probably go with a route like this, similar to describeImpl with convertToComparableOrNull:

  • First filter for all Comparable<*> values in the row
  • If all these values are comparable to each other, run the statistic function on it
  • If not, filter for Number values. These are not directly comparable to each other but can be made comparable by converting them to Double/BigDecimal, and thus all statistics functions will work. Though, then we skip any other potentially comparable values, like String.
  • Else return null

I think this would produce the most expected behavior. wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! But let's do it in a new PR for all statistical functions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! Then I'd make this Number for now until we fix it for all

}
return list[0]
}
internal inline fun <reified T : Comparable<T>> Iterable<T?>.median(type: KType): T? = percentile(50.0, type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

import kotlin.reflect.KType
import kotlin.reflect.typeOf

public inline fun <reified T : Comparable<T>> Iterable<T>.percentileOrNull(percentile: Double): T? =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think DataFrame should have these functions public. It pollutes the scope on Iterables and it's outside the scope of DataFrame to offer these to users. They should be part of a statistics library if you want them publicly accessible :). (mentioned here #961)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's true. Let's get rid of all of them In the other PR?

internal inline fun <reified T : Comparable<T>> Iterable<T?>.q3(type: KType): T? = percentile(75.0, type)

@PublishedApi
internal inline fun <reified T : Comparable<T>> Iterable<T?>.percentile(percentile: Double, type: KType): T? {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine for now to return the same type it receives, as that's what median did too. However, I'll likely change it pretty soon according to this table: #961

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could reconsider this logic, I add all of this in accordance with existing median

- **`max`** — The maximum value in the column.

For non-numeric columns, statistical metrics
such as `mean` and `std` will return `null`. If a column is not `Comparable`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe specify it like "if the values in a column are incomparable" which is closer to the truth. A column can be Comparable<Nothing> but then these statistics will fail.

2, 4, 10,
7, 7, 1,
)
df.mapToColumn("", Infer.Type) { it.rowPercentile(25.0) } shouldBe columnOf(1, 2, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. Do you think it would be a good idea to offer an enum? Like Percentile.Q1, Percentile.Q2, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or probably an object with val Q1 = 0.25, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it could be a good idea, but for statistics library :). I believe simple percentile is enough for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true


```kotlin
df.percentile(25.0)
df.age.percentile(25.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to embed a table here, like in valueColumns https://kotlin.github.io/dataframe/valuecounts.html

Other statistics are poorly covered with examples, should be improved

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could review all statistical functions documentation.

@AndreiKingsley AndreiKingsley merged commit ead907e into master Feb 18, 2025
3 checks passed
@AndreiKingsley AndreiKingsley deleted the percentile branch February 18, 2025 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add 25 and 75 percentiles in describe
3 participants