Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp filtering returns empty DataFrame #202

Closed
osopardo1 opened this issue Jul 17, 2023 · 2 comments · Fixed by #203
Closed

Timestamp filtering returns empty DataFrame #202

osopardo1 opened this issue Jul 17, 2023 · 2 comments · Fixed by #203
Labels
type: bug Something isn't working

Comments

@osopardo1
Copy link
Member

osopardo1 commented Jul 17, 2023

What went wrong?

Since version 0.3.x we are able to index Timestamp and Date. But when we try to filter but those Timestamps, the Query is not well parsed and returns an empty DataFrame.

After a few prints and tests, I discovered that the d.getTime we use in LinearTransformer to initialize min and max of LinearTransformation, returns a Long X. In contrast, the filtering from the user returns a different Long Y.

This difference does not allow the QuerySpecBuilder to initialize the correct space. QuerySpaceFromTo returns an EmptySpace, which contains 0 files to read.

How to reproduce?

1. Code that triggered the bug, or steps to reproduce:

Using some code in the TransformerIndexingTest:

    import spark.implicits._
    val df =
      Seq(
        "2017-01-03 12:02:00",
        "2017-01-02 12:02:00",
        "2017-01-02 12:02:00",
        "2017-01-02 12:02:00",
        "2017-01-01 12:02:00",
        "2017-01-01 12:02:00")
        .toDF("date")
        .withColumn("my_date", to_timestamp($"date"))

    df.write.format("qbeast").option("columnsToIndex", "my_date").save(tmpDir)

    val indexed = spark.read.format("qbeast").load(tmpDir)

    indexed.filter("my_date == '2017-01-02 12:02:00'").count() shouldBe 3

But the returned DataFrame contains 0 rows.

2. Branch and commit id:

I'm using photon branch: photon-datasource-standalone

3. Spark version:

On the spark shell run spark.version.

Spark version 3.3.0 and Delta version 2.1.0

4. Hadoop version:

On the spark shell run org.apache.hadoop.util.VersionInfo.getVersion().

3.3.4

5. How are you running Spark?

Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?

Local

@osopardo1 osopardo1 added the type: bug Something isn't working label Jul 17, 2023
@osopardo1
Copy link
Member Author

osopardo1 commented Jul 25, 2023

After more research, we found out that:

  • We are indexing in milliseconds, while Spark is filtering in microseconds (which gives a difference of 1000x).
  • Spark has two functions that convert a date to a Timestamp: unix_timestamp and to_timestamp. The first one pre-processes the value as seconds, while the second one converts it to milli. This is at writing time.
  • While reading, we do not have a special way of treating Timestamp type. So the filtering gets parsed to microseconds and QuerySpecBuilder is unable to find any areas of the index that match the space.

Possible solutions:

  1. Ensure that all writings are done in microseconds. Or explicitly add an option for the preferred measure.
  2. Parse the Timestamp correctly when reading. Detect the Timestamp data type in the filters from Spark and apply any transformation to match the index area.

I am working on defining a document with a proper review of how to store Time-Series. Thanks for your patience.

@osopardo1
Copy link
Member Author

Since JSON format only stores dates in milliseconds https://blog.devgenius.io/the-optimal-way-to-include-dates-in-json-2220c2a355a8 , I think the second is the best approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant