-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading Parquet file with timestamp column with 9999
year results in overflow panic
#982
Comments
fwiw, both spark and pyarrow give the wrong result in different ways. pyarrowimport pyarrow.parquet
path = "data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet"
table = pyarrow.parquet.read_table(path)
print(table["dimension_load_date"])
sparkWhile it provides the correct result in your case, it only reads up to microseconds (i.e. it truncates nanoseconds). See source code, the I do not think there is a correct answer here: "9999-12-31" is not represented by i64 in nanoseconds. Given that int96 original scope was to support nanoseconds, pyarrow seems to preserve that behavior. OTOH, to avoid crashing, it spits something, even if that something is meaningless in this context. Panicking is a bit too harsh, but at least it does not allow you to go back to the 19th century xD Note that int96 has been deprecated. |
Here are multiple things to discuss. Even though Regarding the Spark implementation here is a function that returns the nanos: https://github.com/apache/spark/blob/HEAD/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L192. The nanos precision is not discarded in Spark. Apache Flink maps the Can you provide the part of the code where the overflow issue happens? I would like to understand more. |
Here are the values returns regardless if is
Pretty consistent. |
Point taken wrt to the int96 deprecation. The datetime "9999-12-31" is $ python -c "import datetime; print(datetime.datetime(year=9999,month=12,day=31).timestamp())"
253402214400.0 in nanoseconds, this corresponds to
This was the rational I used to conclude that we can't fit "9999-12-31" in an i64 nanosecond since epoch. Since Java's Long is also i64 with the same maximum as Rust, I concluded that Spark must be discarding something to fit such a date in a Long, since there is simply not enough precision to represent that date in i64 ns. So, I looked for what they did.
I may be wrong. The parquet code in Rust is here. Note that it only goes to millis. The conversion to ns is done here. |
I boiled down the conversion in a minimal example (playground): // to nanos
pub fn int96_to_i64_ns(value: [u32; 3]) -> i64 {
let nanoseconds = value[0] as i64 + ((value[1] as i64) << 32);
let days = value[2] as i64;
const JULIAN_DAY_OF_EPOCH: i64 = 2_440_588;
const SECONDS_PER_DAY: i64 = 86_400;
const NANOS_PER_SECOND: i64 = 1_000_000_000;
let seconds = (days - JULIAN_DAY_OF_EPOCH) * SECONDS_PER_DAY;
seconds * NANOS_PER_SECOND + nanoseconds
}
// to micros
pub fn int96_to_i64_us(value: [u32; 3]) -> i64 {
let nanoseconds = value[0] as i64 + ((value[1] as i64) << 32);
let days = value[2] as i64;
const JULIAN_DAY_OF_EPOCH: i64 = 2_440_588;
const SECONDS_PER_DAY: i64 = 86_400;
const MICROS_PER_SECOND: i64 = 1_000_000;
let seconds = (days - JULIAN_DAY_OF_EPOCH) * SECONDS_PER_DAY;
seconds * MICROS_PER_SECOND + nanoseconds / 1000
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_nanos() {
let value = [1, 0, 2_440_588 + 2_932_896];
let a = int96_to_i64_ns(value);
}
#[test]
fn test_micros() {
let value = [1, 0, 2_440_588 + 2_932_896];
let a = int96_to_i64_us(value);
}
} where
EDIT: changed |
Thank you @jorgecarleitao! I do understand that the nanos, that is expressed as From my calculations, the current implementation over Now, what can we do?
I hope I ain't a nuisance for you. 😀 |
(Transferred to arrow-rs, since this is not related to datafusion per-se but to the parquet reading). |
I was trying to communicate that what Spark does does not solve the problem in general, just the particular date 9999; it essentially shifts the problem by a factor of 1000x. I would argue that 200 years for the world to migrate away from int96 should be enough, but probably someone has said the same about the imperial system 200 years ago and here we are. ^_^ IMO panicking is not a valid behavior because it is susceptible to DOS (e.g. an application accepting parquet files from the internet will now panic and unwind on every request). I think that there are 3 options within the current arrow specification:
1.
2.
3.
I am tempted to argue for 3 because it preserves the two important semantic properties (order and nano precision), but I would be fine with any of them, tbh. |
Not a nuisance at all; quite fun problem! |
I'm also inclining for 3. It is somehow covering multiple aspects and one of the most important ones is the backward compatibility. |
#2481 adopts option 2, as it is consistent with how we handle overflow elsewhere, and more importantly doesn't have a runtime cost. Signed overflow in Rust is defined as twos complement, it isn't UB like it technically is in C++, and arrow-rs follows this example. I hope that is ok, I wasn't actually aware of this ticket before I implemented #2481 😅 FWIW we do now have the schema inference plumbing where we could opt to truncate to milliseconds or similar... 🤔 |
Describe the bug
Reading Parquet file with timestamp column containing a future date like
9999-12-31 02:00:00
year results in overflow panic with the following output:To Reproduce
Steps to reproduce the behavior:
data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet
file.cargo new read-parquet
, create adata
folder in your project and put the parquet file in thedata
folder inside your project.Cargo.toml
file to contain the following:main.rs
to read the given parquet file:cargo run
.Expected behavior
To be able to read that parquet file. The parquet file can be read with
parquet-tools
CLI and Apache Spark.Additional context
The root cause is the fact that the parquet file contains some rows with
9999-12-31 02:00:00
in thedimension_load_date
column. This future date is supported by Parquet and Spark.The content of the parquet file is:
To find out more about how the root cause was detected you can follow apache/datafusion#1359.
The text was updated successfully, but these errors were encountered: