Skip to content

JSON reading: unified numbers #1073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 27, 2025
Merged

JSON reading: unified numbers #1073

merged 9 commits into from
Feb 27, 2025

Conversation

Jolanrensen
Copy link
Collaborator

@Jolanrensen Jolanrensen commented Feb 21, 2025

Fixes #557
Helps #961 by preventing Number columns from appearing in the first place, and by making it easier to create columns with unified number types.
Makes behavior also more consistent with other reading options, like CSV.

  • Json can now read to Float. This was done for consistency with old behavior in the past (double parsing was checked before float parsing), but it can halve the memory usage of DataFrame in a lot of cases.
  • Introduces unifyNumbers parameter to guessValueType and createColumnGuessingType.
  • Json reading now uses unifyNumbers = true

@Jolanrensen Jolanrensen marked this pull request as ready for review February 24, 2025 13:57
@Jolanrensen Jolanrensen added the enhancement New feature or request label Feb 24, 2025
@AndreiKingsley
Copy link
Collaborator

@Jolanrensen may be add unifyNumbers to readJson? I had an experience with APIs that returns JSON with different numeric types inside one column 😒 (and it was important).

@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Feb 26, 2025

@Jolanrensen may be add unifyNumbers to readJson? I had an experience with APIs that returns JSON with different numeric types inside one column 😒 (and it was important).

definitely, that sounds like a good idea :) I will make it true by default though. We want to generally avoid having Number columns unless it's absolutely necessary.

@Jolanrensen Jolanrensen requested review from zaleslaw and AndreiKingsley and removed request for zaleslaw and AndreiKingsley February 26, 2025 12:53
# Conflicts:
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/util/deprecationMessages.kt
@@ -383,7 +410,7 @@ class JsonTests {
).alsoDebug("df:")

val res = DataFrame.readJsonStr(df.toJson()).alsoDebug("res:")
res shouldBe df
res shouldBe df.convert { colsOf<Double?>() }.toFloat()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we are reducing Double -> Float what is possible, but why not Float -> Double as a most common type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no number unification happening in this test at all; in fact, we're not reducing Double to Float on purpose, however because we're writing a Double "1.0" and "3.0" to JSON it can be read back from JSON as Float. If any of the numbers were too large to fit in a Float, or if the column contained both floats and integers, the result would have been Double.

v.floatOrNull != null -> collector.add(v.float)

v.doubleOrNull != null -> collector.add(v.double)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaleslaw It's because of this change that we can now get floats out of json too instead of just doubles.

@Jolanrensen Jolanrensen merged commit 3674edd into master Feb 27, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling of Number types can be unexpected
3 participants