Skip to content

Add support for DataFrame sum operation with tests #1148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

zaleslaw
Copy link
Collaborator

No description provided.

Introduced the `sum` operation for DataFrames, supporting numerical columns aggregation. Updated relevant tests and added new test cases to verify functionality. Included schema modifications for handling numerical column operations.
Converted various internal classes, interfaces, and functions related to aggregation into public entities. This change expands their visibility, enabling external usage and facilitating integration with other modules or libraries.
…re compatibility and correctness in sum calculations.
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the DataFrame sum operation along with comprehensive tests covering various summation scenarios and aggregation handlers. Key changes include:

  • Adding a new sum operation test in both generated tests and dedicated test data for verifying correct summation.
  • Extending the aggregation framework with new Sum0 and Sum1 implementations.
  • Changing visibility modifiers from internal to public in several core aggregator components.

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
plugins/kotlin-dataframe/tests-gen/org/jetbrains/kotlin/fir/dataframe/DataFrameBlackBoxCodegenTestGenerated.java Adds a new test method for the sum operation
plugins/kotlin-dataframe/testData/box/sum.kt Introduces test scenarios for sum over all, selective, and expression-based columns
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/loadInterpreter.kt Updates load interpreter to support new Sum aggregators
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/impl/api/statistics.kt Adds implementations for summation aggregators
core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/statistics.kt Expands tests for groupBy operations to include new numerical columns
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/sum.kt Annotates and exposes the new sum APIs
Other core aggregator files Adjusts visibility and typing to support public sum operation API
Comments suppressed due to low confidence (1)

core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/statistics.kt:571

  • In the groupBy maxBy test, the column names are checked using 'res4' while the aggregation result is later accessed from 'res5'. This inconsistency might lead to erroneous test behavior; ensure the same result object is used throughout the test.
res4.columnNames() shouldBe listOf("city", "name", "age", "weight", "height", "yearsToRetirement", "workExperienceYears", "dependentsCount", "annualIncome")

zaleslaw and others added 2 commits April 23, 2025 18:39
val kClass = this.classifier as? KClass<*> ?: return null
val classId = kClass.toClassId() ?: return null

return ConeClassLikeTypeImpl(
Copy link
Collaborator

@koperagen koperagen Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's FirSession available inside Arguments scope to access primitive types

fun Arguments.test() {
session.builtinTypes.doubleType.coneType
}

For class id, there's StandardClassIds
And, if needed, there's ClassId.constructClassLikeType instead of calling ConeClassLikeTypeImpl directly
Please use these APIs

@@ -233,6 +233,12 @@ public void testFlexibleReturnType() {
runTest("testData/box/flexibleReturnType.kt");
}

@Test
@TestMetadata("sum.kt")
Copy link
Collaborator

@koperagen koperagen Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you create new file in testData, please use this task to update this
image

val res11 = personsDf.sumFor { dependentsCount }
val sum111: Int? = res11.dependentsCount

// scenario #2: sum of all values in two columns of Int hierarchy of types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove scenario 2 from plugin tests for DataFrame.sum? It returns Number so no schema changes

val inputHandler: AggregatorInputHandler<Value, Return>,
val multipleColumnsHandler: AggregatorMultipleColumnsHandler<Value, Return>,
val name: String,
public class Aggregator<in Value : Any, Return : Any?>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why'd you remove out? it should only appear in the out-position of the aggregation, at least, that's the idea

@@ -31,7 +31,7 @@ import kotlin.reflect.KType
* If not supplied, the handler of the first step is reused.
* @see [FlatteningMultipleColumnsHandler]
*/
internal class TwoStepMultipleColumnsHandler<in Value : Any, out Return : Any?>(
internal class TwoStepMultipleColumnsHandler<in Value : Any, Return : Any?>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here with out, though, I could understand the removal here because Return is also used as input for stepTwo

@@ -31,7 +30,7 @@ internal class AggregatorOptionSwitch1<in Param1, in Value : Any, out Return : A
* MyAggregator.Factory(param1)
* }
*/
fun <Param1, Value : Any, Return : Any?> Factory(
internal fun <Param1, Value : Any, Return : Any?> Factory(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we make everything public, the factory can also be public imo

// scenario #0: all numerical columns
val res0 = personsDf.sum()

val sum01: Int? = res0.age
Copy link
Collaborator

@Jolanrensen Jolanrensen Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum of a column is never nullable, right?

Copy link
Collaborator

@koperagen koperagen Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, and i think it can't be helped

Copy link
Collaborator

@koperagen koperagen Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, do we really return nullable int for an empty dataframe? maybe we should throw an exception instead?
but we should be careful about how it will affect GroupBy.sum

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #961

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For min/max/median/percentile, the result is nullable however, but only null if the input is empty.

@koperagen, is there a compile-time check for when a DF is empty or not? Otherwise we could generate the result-type more tightly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such check

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, do we really return nullable int for an empty dataframe? maybe we should throw an exception instead? but we should be careful about how it will affect GroupBy.sum

No we don't, the result of sum is never null, the sum of an empty Int column is 0. Even if we have a column of type Nothing? we have a non-null return type, defaulting to 0.0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such check

Okay then for those we can follow what I wrote here #1148 (comment)

): SimpleCol {
val originalType = column.type.type
val inputKType = originalType.toKType()
val resultKType = inputKType?.let { statisticAggregator.calculateReturnType(it, emptyInput = false) }
Copy link
Collaborator

@Jolanrensen Jolanrensen Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might cause issues for min/max/median/percentile to set emptyInput = false. I'd set it to true. The name emptyInput is also kind of a misnomer, it should be "inputPotentiallyEmpty" or something.

It changes intCol.min() from Int (when it's set to false) to Int? (when it's set to true).

For intCol.sum() the result will always be Int, regardless of what it's set to.

Feel free to rename the argument everywhere it's used :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants