Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add hint to support parquet with nanosecond timestamps #1782

Closed
wants to merge 3 commits into from

Conversation

hntd187
Copy link
Collaborator

@hntd187 hntd187 commented Oct 28, 2023

Description

This gives a hint on the schema convert telling convert to delta that the timestamps are nanosecond and need to be converted to the appropriate microseconds, at which point the schema written should use the standard SchemaDataType::primitive("timestamp") style.

Related Issue(s)

For #1721 @junjunjd

Documentation

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate rust labels Oct 28, 2023
@rtyler
Copy link
Member

rtyler commented Oct 30, 2023

Thanks for the pull request, I'm still mulling this over and kind of hoping somebody else chimes in with an opinion on the approach 🤔

@rtyler rtyler added this to the Correct timestamp handling milestone Oct 30, 2023
@hntd187
Copy link
Collaborator Author

hntd187 commented Oct 30, 2023

My own two cents is this is a very meh solution. But it's the only obvious way I saw to carry some context to someone converting an arrow schema to do a conver_to_delta It may make more sense for the conversion to delta to not use the standard schema conversion facilities and have something more purpose fit that gives the context on what if anything has to be converted.

@mightyshazam
Copy link
Contributor

What about modifying the conversion function to something like

impl TryFrom<&ArrowField> for schema::SchemaField {
    type Error = ArrowError;
    fn try_from(arrow_field: &ArrowField) -> Result<Self, ArrowError> {
        let mut metadata: HashMap<String, serde_json::Value> = arrow_field
            .metadata()
            .iter()
            .map(|(k, v)| (k.clone(), serde_json::Value::String(v.clone())))
            .collect();
        match arrow_field.data_type() {
            ArrowDataType::Timestamp(TimeUnit::Nanosecond, _) => {
                metadata.insert("_delta-rs.timestamp.convert".into(), "nano".into());
            }
            _ => {}
        };

        Ok(schema::SchemaField::new(
            arrow_field.name().clone(),
            arrow_field.data_type().try_into()?,
            arrow_field.is_nullable(),
            arrow_field
                .metadata()
                .iter()
                .map(|(k, v)| (k.clone(), serde_json::Value::String(v.clone())))
                .collect(),
        ))
    }
}

The magic string _delta-rs.timestamp.convert could come from a well known set of metadata that we use internally. It can be used for writing and value conversion, or any other thing someone can think of, but it should be ignored when read back from the table since it will have lost its value.

@rtyler rtyler marked this pull request as draft December 11, 2023 23:02
@rtyler rtyler self-assigned this Feb 11, 2024
@hntd187 hntd187 closed this Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants