Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support native uploading of parquet files with Tuple columns as json #10272

Open
haydenflinner opened this issue Feb 28, 2023 · 2 comments
Open

Comments

@haydenflinner
Copy link

Summary
My databend workflow is very easy to use; I create a pandas dataframe of the data, I call df.to_parquet to create an in-memory Parquet object, then I use the streaming_load http endpoint to upload it. This mostly just works and with better error msgs than most databases support when doing native SQL inserts.

However, with a "JSON" column in my table, I can't figure out how to insert this way. If I have a column that in python is a dictionary, that results in a 'Table' type in the parquet file. Uploading that yields this error:

{"error":{"code":"400","message":"execute fail: parquet schema mismatch for field 6(start from 0),
expect: TableField { name: \\"extra_json\\", default_expr: None, data_type: Nullable(Variant), column_id: 6 },
got TableField { name: \\"extra_json\\", default_expr: None, data_type: Nullable(Tuple { fields_name: [\\"field1\\", \\"myfield2\\"], fields_type: [Nullable(String), Nullable(String)] }), column_id: 0 }"}}

If I instead write the column as a json-encoded string, I get this error:

"error":{"code":"400","message":"execute fail: parquet schema mismatch for field 6(start from 0), expect: TableField { name: \\"extra_json\\", default_expr: None, data_type: Nullable(Variant), column_id: 6 }, got TableField { name: \\"extra_json\\", default_expr: None, data_type: Nullable(String), column_id: 0

It would be nice if the Tuple type were automatically walked and turned into a dictionary. Or if it were a string that was valid json, just take that, though that is a little bit of a departure from the usual typesafety of databend, it sure would be convenient.

It seems Snowflake has a similar problem, which this user solved by modifying his INSERT statement. I am not sure how I could downcast within the INSERT statement sent to streaming_load, though I tried.
https://stackoverflow.com/q/70984773

@sundy-li
Copy link
Member

sundy-li commented Mar 1, 2023

Parquet did not have Variant DataType, we should allow casting in streaming load.

Related #10173

@sundy-li
Copy link
Member

sundy-li commented Mar 17, 2023

@haydenflinner Can you verify #10621 , it will add auto cast transform when loading parquet files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants