-
Notifications
You must be signed in to change notification settings - Fork 347
Description
spark 1.6.0, spark-redshift 0.6.0
A few different issues here, seemingly the same place in the code, so lumping them together.
https://github.com/databricks/spark-redshift/blob/master/src/main/scala/com/databricks/spark/redshift/FilterPushdown.scala#L44
DATEs not quoted
For columns of (Redshift) DATE type, when using the following syntax
t.filter("c_date = cast('2016-01-01' as DATE)")
or
t.filter("c_date = to_date('2016-01-01')")
or
t.filter("c_date < current_date()")
the filter is pushed as part of the UNLOAD command, but the date string ends up not being quoted on the SQL side as such
WHERE "c_date" = 2016-01-01
which really results in the following
WHERE "c_date" = 2014
because 2016-01-01 is interpreted as a numeric expression (2016 minus 1 minus 1), not a date string.
redshift=# select 2016-01-01;
?column?
----------
2014
(1 row)
Thus, the wrong results are returned to the user (bad).
DATE strings not pushed
Interestingly enough using
t.filter($"c_date" >= "2016-01-01")
does not push any SQL filter through to the UNLOAD command which, while returns the correct result, scans much more data on the Redshift side than need be and should also be fixed.
TIMESTAMPs not quoted
If the column type is TIMESTAMP, we have the same problem as DATE when a string is passed in
t.filter($"c_timestamp" === "2016-01-01 00:00:00")
results in the following SQL
WHERE "c_timestamp" = 2016-01-01 00:00:00.0
which simply throws a Redshift SQL parse error.
Also happens when a Spark TimestampType is passed
t.filter($"c_timestamp" < current_timestamp())
which converts to
WHERE "c_timestamp" < 2016-01-14 15:03:43.928
It seems the right thing to do here for DATE/TIMESTAMP is check for both types and use the correct casting functions on the database side with the corresponding string format masks.