Support native uploading of parquet files with Tuple columns as json #10272

haydenflinner · 2023-02-28T22:03:41Z

Summary
My databend workflow is very easy to use; I create a pandas dataframe of the data, I call df.to_parquet to create an in-memory Parquet object, then I use the streaming_load http endpoint to upload it. This mostly just works and with better error msgs than most databases support when doing native SQL inserts.

However, with a "JSON" column in my table, I can't figure out how to insert this way. If I have a column that in python is a dictionary, that results in a 'Table' type in the parquet file. Uploading that yields this error:

{"error":{"code":"400","message":"execute fail: parquet schema mismatch for field 6(start from 0),
expect: TableField { name: \\"extra_json\\", default_expr: None, data_type: Nullable(Variant), column_id: 6 },
got TableField { name: \\"extra_json\\", default_expr: None, data_type: Nullable(Tuple { fields_name: [\\"field1\\", \\"myfield2\\"], fields_type: [Nullable(String), Nullable(String)] }), column_id: 0 }"}}

If I instead write the column as a json-encoded string, I get this error:

"error":{"code":"400","message":"execute fail: parquet schema mismatch for field 6(start from 0), expect: TableField { name: \\"extra_json\\", default_expr: None, data_type: Nullable(Variant), column_id: 6 }, got TableField { name: \\"extra_json\\", default_expr: None, data_type: Nullable(String), column_id: 0

It would be nice if the Tuple type were automatically walked and turned into a dictionary. Or if it were a string that was valid json, just take that, though that is a little bit of a departure from the usual typesafety of databend, it sure would be convenient.

It seems Snowflake has a similar problem, which this user solved by modifying his INSERT statement. I am not sure how I could downcast within the INSERT statement sent to streaming_load, though I tried.
https://stackoverflow.com/q/70984773

The text was updated successfully, but these errors were encountered:

sundy-li · 2023-03-01T01:07:27Z

Parquet did not have Variant DataType, we should allow casting in streaming load.

Related #10173

sundy-li · 2023-03-17T08:47:56Z

@haydenflinner Can you verify #10621 , it will add auto cast transform when loading parquet files.

youngsofun mentioned this issue Mar 7, 2023

RFC(new grammar): copy/insert ... with transform #10398

Closed

sundy-li mentioned this issue Mar 17, 2023

feat(query): enable runtime cast transform in loading parquet files #10621

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support native uploading of parquet files with Tuple columns as json #10272

Support native uploading of parquet files with Tuple columns as json #10272

haydenflinner commented Feb 28, 2023

sundy-li commented Mar 1, 2023

sundy-li commented Mar 17, 2023 •

edited

Loading

Support native uploading of parquet files with Tuple columns as json #10272

Support native uploading of parquet files with Tuple columns as json #10272

Comments

haydenflinner commented Feb 28, 2023

sundy-li commented Mar 1, 2023

sundy-li commented Mar 17, 2023 • edited Loading

sundy-li commented Mar 17, 2023 •

edited

Loading