-
Notifications
You must be signed in to change notification settings - Fork 71
Add support for DataFrame sum
operation with tests
#1148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Introduced the `sum` operation for DataFrames, supporting numerical columns aggregation. Updated relevant tests and added new test cases to verify functionality. Included schema modifications for handling numerical column operations.
Converted various internal classes, interfaces, and functions related to aggregation into public entities. This change expands their visibility, enabling external usage and facilitating integration with other modules or libraries.
…re compatibility and correctness in sum calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for the DataFrame sum operation along with comprehensive tests covering various summation scenarios and aggregation handlers. Key changes include:
- Adding a new sum operation test in both generated tests and dedicated test data for verifying correct summation.
- Extending the aggregation framework with new Sum0 and Sum1 implementations.
- Changing visibility modifiers from internal to public in several core aggregator components.
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
plugins/kotlin-dataframe/tests-gen/org/jetbrains/kotlin/fir/dataframe/DataFrameBlackBoxCodegenTestGenerated.java | Adds a new test method for the sum operation |
plugins/kotlin-dataframe/testData/box/sum.kt | Introduces test scenarios for sum over all, selective, and expression-based columns |
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/loadInterpreter.kt | Updates load interpreter to support new Sum aggregators |
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/impl/api/statistics.kt | Adds implementations for summation aggregators |
core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/statistics.kt | Expands tests for groupBy operations to include new numerical columns |
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/sum.kt | Annotates and exposes the new sum APIs |
Other core aggregator files | Adjusts visibility and typing to support public sum operation API |
Comments suppressed due to low confidence (1)
core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/statistics.kt:571
- In the groupBy maxBy test, the column names are checked using 'res4' while the aggregation result is later accessed from 'res5'. This inconsistency might lead to erroneous test behavior; ensure the same result object is used throughout the test.
res4.columnNames() shouldBe listOf("city", "name", "age", "weight", "height", "yearsToRetirement", "workExperienceYears", "dependentsCount", "annualIncome")
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
val kClass = this.classifier as? KClass<*> ?: return null | ||
val classId = kClass.toClassId() ?: return null | ||
|
||
return ConeClassLikeTypeImpl( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's FirSession available inside Arguments scope to access primitive types
fun Arguments.test() {
session.builtinTypes.doubleType.coneType
}
For class id, there's StandardClassIds
And, if needed, there's ClassId.constructClassLikeType
instead of calling ConeClassLikeTypeImpl
directly
Please use these APIs
@@ -233,6 +233,12 @@ public void testFlexibleReturnType() { | |||
runTest("testData/box/flexibleReturnType.kt"); | |||
} | |||
|
|||
@Test | |||
@TestMetadata("sum.kt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val res11 = personsDf.sumFor { dependentsCount } | ||
val sum111: Int? = res11.dependentsCount | ||
|
||
// scenario #2: sum of all values in two columns of Int hierarchy of types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove scenario 2 from plugin tests for DataFrame.sum? It returns Number so no schema changes
val inputHandler: AggregatorInputHandler<Value, Return>, | ||
val multipleColumnsHandler: AggregatorMultipleColumnsHandler<Value, Return>, | ||
val name: String, | ||
public class Aggregator<in Value : Any, Return : Any?>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why'd you remove out
? it should only appear in the out-position of the aggregation, at least, that's the idea
@@ -31,7 +31,7 @@ import kotlin.reflect.KType | |||
* If not supplied, the handler of the first step is reused. | |||
* @see [FlatteningMultipleColumnsHandler] | |||
*/ | |||
internal class TwoStepMultipleColumnsHandler<in Value : Any, out Return : Any?>( | |||
internal class TwoStepMultipleColumnsHandler<in Value : Any, Return : Any?>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here with out
, though, I could understand the removal here because Return
is also used as input for stepTwo
@@ -31,7 +30,7 @@ internal class AggregatorOptionSwitch1<in Param1, in Value : Any, out Return : A | |||
* MyAggregator.Factory(param1) | |||
* } | |||
*/ | |||
fun <Param1, Value : Any, Return : Any?> Factory( | |||
internal fun <Param1, Value : Any, Return : Any?> Factory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we make everything public, the factory can also be public imo
// scenario #0: all numerical columns | ||
val res0 = personsDf.sum() | ||
|
||
val sum01: Int? = res0.age |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sum of a column is never nullable, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, and i think it can't be helped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, do we really return nullable int for an empty dataframe? maybe we should throw an exception instead?
but we should be careful about how it will affect GroupBy.sum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #961
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For min/max/median/percentile, the result is nullable however, but only null if the input is empty.
@koperagen, is there a compile-time check for when a DF is empty or not? Otherwise we could generate the result-type more tightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no such check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, do we really return nullable int for an empty dataframe? maybe we should throw an exception instead? but we should be careful about how it will affect GroupBy.sum
No we don't, the result of sum is never null, the sum of an empty Int column is 0. Even if we have a column of type Nothing?
we have a non-null return type, defaulting to 0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no such check
Okay then for those we can follow what I wrote here #1148 (comment)
): SimpleCol { | ||
val originalType = column.type.type | ||
val inputKType = originalType.toKType() | ||
val resultKType = inputKType?.let { statisticAggregator.calculateReturnType(it, emptyInput = false) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might cause issues for min/max/median/percentile to set emptyInput = false
. I'd set it to true
. The name emptyInput
is also kind of a misnomer, it should be "inputPotentiallyEmpty" or something.
It changes intCol.min()
from Int
(when it's set to false) to Int?
(when it's set to true).
For intCol.sum()
the result will always be Int
, regardless of what it's set to.
Feel free to rename the argument everywhere it's used :)
No description provided.