Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast double parser #935

Merged
merged 4 commits into from
Oct 31, 2024
Merged

Fast double parser #935

merged 4 commits into from
Oct 31, 2024

Conversation

Jolanrensen
Copy link
Collaborator

@Jolanrensen Jolanrensen commented Oct 28, 2024

Fixes #607

Created under csv umbrella issue: #827

Introduces FastDoubleParser as optional alternative double parser. It can be enabled as experimental option (for now) in ParserOptions.

The configurable version of this double parser was created for us :) and the way I implemented it, it can solve some of the locale-fallback-issues we were facing before.
Check the tests I created in this PR for more some examples.

It's also faster than NumberFormat!

FastDoubleParser is also used by Deephaven CSV, which helps with our new csv implementation.

@Jolanrensen Jolanrensen added csv CSV / delim related issues bug Something isn't working performance Something related to how fast the library can handle data labels Oct 28, 2024
This was referenced Oct 28, 2024
@Jolanrensen Jolanrensen marked this pull request as ready for review October 28, 2024 14:04
@@ -19,24 +19,31 @@ import kotlin.annotation.AnnotationTarget.VALUE_PARAMETER
* {@include [Indent]}
*
*/
@ExcludeFromSources
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These documentation interfaces can be excluded from the sources as they are only ever accessed from kdoc.

@Jolanrensen Jolanrensen self-assigned this Oct 30, 2024
private val logger = KotlinLogging.logger {}

// (lowercase) strings that are recognized to represent infinity and NaN in doubles in all locales
private val INFINITIES = arrayOf("∞", "inf", "infinity", "infty")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you collect this information? Are some standards somewhere available?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already had "inf" and "infinity". I added "infty" because it's used in the math world, like LaTeX, and finally, I supposed the infinity symbol itself should also be recognizable. Any others we should add?

*
* If [ParserOptions.useFastDoubleParser] is enabled, it will try to parse the input with an _EXPERIMENTAL_
* fast double parser, [FastDoubleParser](https://github.com/wrandelshofer/FastDoubleParser).
* If not, or if it fails, it will use [NumberFormat] to parse the input.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great approach!

.withPlusSign(setOf('+'))
.withDecimalSeparator(DecimalFormatSymbols::getDecimalSeparator.fromLocalWithFallBack())
.withGroupingSeparator(DecimalFormatSymbols::getGroupingSeparator.fromLocalWithFallBack())
.withExponentSeparator(DecimalFormatSymbols::getExponentSeparator.fromLocalWithFallBack())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean set(get.fromLocalWithFallBack) construction?

Copy link
Collaborator Author

@Jolanrensen Jolanrensen Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can read it as a sentence :) "::getDecimalSeparator.fromLocalWithFallBack()". This means it calls localDecimalFormatSymbols.getDecimalSeparator() first and then adds the fallback character from fallbackDecimalFormatSymbols.getDecimalSeparator(). All these characters are then provided to the NumberFormatSymbols builder here :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it work on the function reference so that the function can be called from different DecimalFormatSymbols instances

in NANS -> Double.NaN

else -> {
// not thread safe; must be created here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is thread safety important here? Or this is a reason to create an Instance of Number Format?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deephaven holds a single instance of this FastDoubleParser class but can parse columns in parallel. Previously I put numberFormat outside the function, but this caused bugs, so I put it inside here and it works fine :). I put the warning here because you might assume you could save an object creation by putting val numberFormat = NumberFormat.getInstance(locale) inside the class and reusing it, but no, you cannot.

}
}.also {
if (it == null) {
logger.debug { "Could not parse '$this' as Double with NumberFormat with locale '$locale'." }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we could also add some logs about experimental fast double parsing like a marker for our logs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KLogger already provides a marker. It recognizes the file/class it's called in. So this will look something like: DEBUG org.jetbrains.kotlinx.dataframe.impl.io.FastDoubleParser - Could not parse 'hello' as Double with NumberFormat with locale 'en_US'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just search for "FastDoubleParser" :)

"100123.35",
"-204,235.23",
"1.234e3",
"3e-04", // failed with old double parser
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it fail, if we disable fast parsing? Or they aligned now in the terms of supported functionality, locales?
And one just is faster than another?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it will fail in the old one. We cannot make the old one behave the same, as it's not supported by NumberFormat.


@Test
fun `can fast parse german locale`() {
val parser = FastDoubleParser(ParserOptions(locale = Locale.GERMANY, useFastDoubleParser = true))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vielen Dank im Voraus!


@Test
fun `can fast parse french locale`() {
val parser = FastDoubleParser(ParserOptions(locale = Locale.FRANCE, useFastDoubleParser = true))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to think: could we write a test, which just generates checks for all possible locales and constants are obtained from Double values -> String -> parseDouble again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"all" locales? XD I can try, but that's like hundreds of locales and may be system-dependent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there are some locales like Arabic where NumberFormat itself fails to add "-" in front of negative numbers :/ It's fixed in later Java versions, but not 8 yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this was actually a really good idea! It made me catch two separate bugs which I reported to FastDoubleParser.

  • Minus signs in RTL languages, like Arabic, are (in Java 8) marked with the Arabic letter mark instead of, you know, a minus sign... FastDoubleParser cannot handle this, because the character is a format symbol, not a normal one. NumberFormat does handle it, so to align our behavior with theirs, I now manually check for format symbol minus signs and replace them with '-' before parsing.
  • In Finnish and other langauges, they use exponentSeparator='e' and nan="epäluku", which breaks as both contain the letter 'e'. I've reported this and luckily our fallback method catches it.

@Jolanrensen Jolanrensen requested a review from zaleslaw October 31, 2024 13:57
# Conflicts:
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/util/deprecationMessages.kt
@Jolanrensen
Copy link
Collaborator Author

Thanks for the review :) I'll merge it for now so I can continue with the csv parser

@Jolanrensen Jolanrensen merged commit 903f58b into master Oct 31, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working csv CSV / delim related issues performance Something related to how fast the library can handle data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect parsing of CSV file. Locale is ignored for some columns
2 participants