Skip to content

Transformation into data class corrupts TimeZones of Instant #1047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
martinitus opened this issue Jan 31, 2025 · 3 comments · Fixed by #1057
Closed

Transformation into data class corrupts TimeZones of Instant #1047

martinitus opened this issue Jan 31, 2025 · 3 comments · Fixed by #1057
Assignees
Labels
bug Something isn't working csv CSV / delim related issues

Comments

@martinitus
Copy link

Reproduction:

@Test
fun foo() {

    data class TimeZonesTest(val with_timezone_offset: Instant, val without_timezone_offset: LocalDateTime)

    val csvContent =
        """
        with_timezone_offset,without_timezone_offset
        2024-12-12T13:00:00+01:00,2024-12-12T13:00:00
        """.trimIndent()

    val df = DataFrame.readCsv(
        csvContent.byteInputStream(),
//            colTypes = mapOf("with_timezone_offset" to ColType.Instant)   // *1
//            parserOptions = ParserOptions(dateTimeFormatter = ISO_OFFSET_DATE_TIME), // *2
    )

    println(df)
    println(df.schema())
    val parsed = df.toListOf<TimeZonesTest>().first()
    assertEquals(Instant.parse("2024-12-12T13:00:00+01:00"), parsed.with_timezone_offset)
}

This outputs:

with_timezone_offset without_timezone_offset
0     2024-12-12T12:00        2024-12-12T13:00

with_timezone_offset: kotlinx.datetime.LocalDateTime
without_timezone_offset: kotlinx.datetime.LocalDateTime

org.opentest4j.AssertionFailedError: 
Expected :2024-12-12T12:00:00Z
Actual   :2024-12-12T11:00:00Z

Changing the dateTimeFormatter (*2) has no effect on the test outcome.
Explicitly telling the parser to parse as Instant (*1) fixes the issue.

However, following the principle of least surprise IMHO it would be A LOT better to:

  • either have the conversion fail because the colum was processed as LocalDateTime, hence has no timezone information, and hence cannot be converted into an instant.
  • or automagically detect, that there is a timezone in the data, and directly parse the column as Instant.
@martinitus
Copy link
Author

I guess I could figure out how to change the implementation to fail (probably boils down to remove the respective converter somewhere). I am not sure if I could figure out how implement the "automagical instant detection" approach.

@Jolanrensen
Copy link
Collaborator

Hi @martinitus, thanks for letting us know!

I see you're using the new experimental DataFrame.readCsv() function, so I appreciate the feedback :)

Normally, DataFrame uses the parse operation in the order described in the docs, so Instant is checked before LocalDateTime. You can test this by running your input for a single column, like:

val column = columnOf("2024-12-12T13:00:00+01:00").parse()
println(column.type()) // Instant

However, for the new readCsv() implementation, we rely on Deephaven CSV which has its own type inference and an extremely fast date-time parser, which is enabled by default (for now). Deephaven parses 2024-12-12T13:00:00+01:00, and 2024-12-12T13:00:00 as LocalDateTime("2024-12-12T12:00"), and LocalDateTime("2024-12-12T13:00") respectively.

This looks correct to me; it's just that, since Deephaven cannot create Instants, it corrects the offset to create a LocalDateTime instead. This deletes the timezone information. If you then convert this LocalDateTime column to Instant with toListOf<TimeZonesTest>() you will thus get Instant("2024-12-12T12:00:00Z").

You can skip Deephaven's type inference for any column by providing a ColType.String, which allows you to call the .parse() operation manually afterwards, or you can supply any other type like ColType.Instant, like you did.

I do agree that it's confusing to have two different "defaults" for inference of types. The user shouldn't have to worry about which implementation is used to infer types and the results should be predictable. We already disable the Deephaven LocalDateTime parser whenever a user provides a non-US Locale, a dateTime pattern, or -formatter, but I suppose, we can disable it altogether unless a user explicitly provides ColType.LocalDateTime.

@Jolanrensen Jolanrensen self-assigned this Feb 10, 2025
@Jolanrensen Jolanrensen mentioned this issue Feb 10, 2025
28 tasks
@Jolanrensen Jolanrensen added bug Something isn't working csv CSV / delim related issues labels Feb 13, 2025
@martinitus
Copy link
Author

Hi @Jolanrensen thanks for the elaborate feedback - sorry for the late answer (vacation ;-))!

I like the changes in f42bd58! Not to do type inference and have the user do that explicitly to avoid nasty bugs caused by insensitive defaults always sounds good to me!

That said, I already suspected that this has to do with Java not providing an Instant type and hence doing a roundtrip through some "datetime without timezone" type.
While the LocalDateTime from deephaven is correct, the thing is, that I did not get Instant("2024-12-12T12:00:00Z") but 2024-12-12T11:00:00Z (i.e. 11 instead of 12). Maybe it is because my timezone setting is off by one, or because of german winter time - but to me it looks like in the conversion from LocalDateTime to Insant the direction is wrong, or some delta of 1 hour is applied twice leading to an "off by one error".

That would mean, that there might be another issue hidden in the .toListOf<TimeZonesTest>() code path where an implicit conversion from LocalDateTime to Insant is done - independent of the chosen CSV parser implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working csv CSV / delim related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants