Timestamp filtering returns empty DataFrame #202

osopardo1 · 2023-07-17T15:44:03Z

What went wrong?

Since version 0.3.x we are able to index Timestamp and Date. But when we try to filter but those Timestamps, the Query is not well parsed and returns an empty DataFrame.

After a few prints and tests, I discovered that the d.getTime we use in LinearTransformer to initialize min and max of LinearTransformation, returns a Long X. In contrast, the filtering from the user returns a different Long Y.

This difference does not allow the QuerySpecBuilder to initialize the correct space. QuerySpaceFromTo returns an EmptySpace, which contains 0 files to read.

How to reproduce?

1. Code that triggered the bug, or steps to reproduce:

Using some code in the TransformerIndexingTest:

    import spark.implicits._
    val df =
      Seq(
        "2017-01-03 12:02:00",
        "2017-01-02 12:02:00",
        "2017-01-02 12:02:00",
        "2017-01-02 12:02:00",
        "2017-01-01 12:02:00",
        "2017-01-01 12:02:00")
        .toDF("date")
        .withColumn("my_date", to_timestamp($"date"))

    df.write.format("qbeast").option("columnsToIndex", "my_date").save(tmpDir)

    val indexed = spark.read.format("qbeast").load(tmpDir)

    indexed.filter("my_date == '2017-01-02 12:02:00'").count() shouldBe 3

But the returned DataFrame contains 0 rows.

2. Branch and commit id:

I'm using photon branch: photon-datasource-standalone

3. Spark version:

On the spark shell run spark.version.

Spark version 3.3.0 and Delta version 2.1.0

4. Hadoop version:

On the spark shell run org.apache.hadoop.util.VersionInfo.getVersion().

3.3.4

5. How are you running Spark?

Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?

Local

The text was updated successfully, but these errors were encountered:

osopardo1 · 2023-07-25T06:15:28Z

After more research, we found out that:

We are indexing in milliseconds, while Spark is filtering in microseconds (which gives a difference of 1000x).
Spark has two functions that convert a date to a Timestamp: unix_timestamp and to_timestamp. The first one pre-processes the value as seconds, while the second one converts it to milli. This is at writing time.
While reading, we do not have a special way of treating Timestamp type. So the filtering gets parsed to microseconds and QuerySpecBuilder is unable to find any areas of the index that match the space.

Possible solutions:

Ensure that all writings are done in microseconds. Or explicitly add an option for the preferred measure.
Parse the Timestamp correctly when reading. Detect the Timestamp data type in the filters from Spark and apply any transformation to match the index area.

I am working on defining a document with a proper review of how to store Time-Series. Thanks for your patience.

osopardo1 · 2023-07-25T07:08:57Z

Since JSON format only stores dates in milliseconds https://blog.devgenius.io/the-optimal-way-to-include-dates-in-json-2220c2a355a8 , I think the second is the best approach.

osopardo1 added the type: bug Something isn't working label Jul 17, 2023

osopardo1 mentioned this issue Jul 25, 2023

Fix time-series filtering #203

Merged

5 tasks

osopardo1 closed this as completed in #203 Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamp filtering returns empty DataFrame #202

Timestamp filtering returns empty DataFrame #202

osopardo1 commented Jul 17, 2023 •

edited

Loading

osopardo1 commented Jul 25, 2023 •

edited

Loading

osopardo1 commented Jul 25, 2023

Timestamp filtering returns empty DataFrame #202

Timestamp filtering returns empty DataFrame #202

Comments

osopardo1 commented Jul 17, 2023 • edited Loading

What went wrong?

How to reproduce?

1. Code that triggered the bug, or steps to reproduce:

2. Branch and commit id:

3. Spark version:

4. Hadoop version:

5. How are you running Spark?

osopardo1 commented Jul 25, 2023 • edited Loading

osopardo1 commented Jul 25, 2023

osopardo1 commented Jul 17, 2023 •

edited

Loading

osopardo1 commented Jul 25, 2023 •

edited

Loading