-
Notifications
You must be signed in to change notification settings - Fork 73
add percentile and p25 and p75 to describe #1060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! good addition :)
desc.ipynb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* such as `mean` and `std` will return `null`. If a column is not [Comparable], | ||
* percentile values (`min`, `p25`, `median`, `p75`, `max`) will also return `null`. | ||
*/ | ||
internal interface SummaryMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can mark this documentation interface as @ExcludeFromSources
, just like the interface Describe
below, since their contents are only included in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh btw, probably link to ColumnDescription
so users can click on it and explore the DataSchema result.
* | ||
* This function provides a statistical summary for each column, including its type, count, uniqueness, | ||
* missing values, most frequent values, and statistical measures if applicable. | ||
* It automatically traverses nested column groups to include all non-grouped columns in the summary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this sentence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public fun <T> DataColumn<T>.describe(): DataFrame<ColumnDescription> = describeImpl(listOf(this)) | ||
|
||
// endregion | ||
|
||
// region DataFrame | ||
|
||
/** | ||
* {@include [Describe]} | ||
* ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a "(" that doesn't do anything
public fun <T> DataFrame<T>.describe(): DataFrame<ColumnDescription> = | ||
describe { | ||
colsAtAnyDepth { !it.isColumnGroup() } | ||
} | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice usage of the template :D
|
||
public fun AnyRow.rowPercentileOrNull(percentile: Double): Any? = | ||
Aggregators.percentile(percentile).aggregate( | ||
values().filterIsInstance<Comparable<Any?>>().toValueColumn(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break when you have a String
and Number
column for instance. While they both are Comparable
, they are not comparable to each other. I see we make the same mistake in rowMin
etc. (it's noted). It's probably best to filter for Number
values, like rowMean
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add an unified solution for all these functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure! I'd probably go with a route like this, similar to describeImpl
with convertToComparableOrNull
:
- First filter for all
Comparable<*>
values in the row - If all these values are comparable to each other, run the statistic function on it
- If not, filter for
Number
values. These are not directly comparable to each other but can be made comparable by converting them to Double/BigDecimal, and thus all statistics functions will work. Though, then we skip any other potentially comparable values, likeString
. - Else return
null
I think this would produce the most expected behavior. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice! But let's do it in a new PR for all statistical functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure! Then I'd make this Number
for now until we fix it for all
} | ||
return list[0] | ||
} | ||
internal inline fun <reified T : Comparable<T>> Iterable<T?>.median(type: KType): T? = percentile(50.0, type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
import kotlin.reflect.KType | ||
import kotlin.reflect.typeOf | ||
|
||
public inline fun <reified T : Comparable<T>> Iterable<T>.percentileOrNull(percentile: Double): T? = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think DataFrame should have these functions public. It pollutes the scope on Iterables and it's outside the scope of DataFrame to offer these to users. They should be part of a statistics library if you want them publicly accessible :). (mentioned here #961)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's true. Let's get rid of all of them In the other PR?
internal inline fun <reified T : Comparable<T>> Iterable<T?>.q3(type: KType): T? = percentile(75.0, type) | ||
|
||
@PublishedApi | ||
internal inline fun <reified T : Comparable<T>> Iterable<T?>.percentile(percentile: Double, type: KType): T? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine for now to return the same type it receives, as that's what median did too. However, I'll likely change it pretty soon according to this table: #961
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could reconsider this logic, I add all of this in accordance with existing median
docs/StardustDocs/topics/describe.md
Outdated
- **`max`** — The maximum value in the column. | ||
|
||
For non-numeric columns, statistical metrics | ||
such as `mean` and `std` will return `null`. If a column is not `Comparable`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe specify it like "if the values in a column are incomparable" which is closer to the truth. A column can be Comparable<Nothing>
but then these statistics will fail.
2, 4, 10, | ||
7, 7, 1, | ||
) | ||
df.mapToColumn("", Infer.Type) { it.rowPercentile(25.0) } shouldBe columnOf(1, 2, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw. Do you think it would be a good idea to offer an enum? Like Percentile.Q1
, Percentile.Q2
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or probably an object
with val Q1 = 0.25
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it could be a good idea, but for statistics library :). I believe simple percentile is enough for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true
|
||
```kotlin | ||
df.percentile(25.0) | ||
df.age.percentile(25.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to embed a table here, like in valueColumns https://kotlin.github.io/dataframe/valuecounts.html
Other statistics are poorly covered with examples, should be improved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could review all statistical functions documentation.
# Conflicts: # core/api/core.api
describe
#1054.percentile
function (similar tomedian
)percentile
site docsdescribe
describe
KDocsdescribe
site docsdescribe
tests