Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to infer schema from files(CSV/Parquet) #6345

Closed
BohuTANG opened this issue Jun 30, 2022 · 3 comments
Closed

Add function to infer schema from files(CSV/Parquet) #6345

BohuTANG opened this issue Jun 30, 2022 · 3 comments
Assignees
Labels
C-feature Category: feature good first issue Category: good first issue

Comments

@BohuTANG
Copy link
Member

BohuTANG commented Jun 30, 2022

Summary

As arrow2 has infer_schema:
Parquet:
https://github.com/jorgecarleitao/arrow2/blob/5725fa3c18830f88ed0ed67067dda5c1d2080dca/src/io/parquet/read/schema/mod.rs#L23-L31

CSV:
https://github.com/jorgecarleitao/arrow2/blob/79b87b629d8fdc3fa40b5ad7abc3889c5102a52b/src/io/csv/read/infer_schema.rs#L15

We can add a function INFER_SCHEMA to do the schema infer, syntax like:

INFER_SCHEMA(
LOCATION => '{ internalStage | externalStage }'
  , FILE_FORMAT => '<format_name>'
)

example(from snow):

select * from infer_schema(location=>'@mystage/geography/cities.parquet' , file_format=>'parquet')


+-------------+---------+----------+---------------------+--------------------------+
| COLUMN_NAME | TYPE    | NULLABLE | EXPRESSION          | FILENAMES                |
|-------------+---------+----------+---------------------+--------------------------|
| continent   | TEXT    | True     | $1:continent::TEXT  | geography/cities.parquet |
| country     | VARIANT | True     | $1:country::VARIANT | geography/cities.parquet |
+-------------+---------+----------+---------------------+--------------------------+

Reference:
https://docs.snowflake.com/en/sql-reference/functions/infer_schema.html

@BohuTANG BohuTANG added the C-feature Category: feature label Jun 30, 2022
@BohuTANG
Copy link
Member Author

cc @Xuanwo

@Xuanwo Xuanwo self-assigned this Jun 30, 2022
@Xuanwo Xuanwo moved this to 📋 Backlog in Xuanwo's Work Jul 2, 2022
@BohuTANG BohuTANG added the good first issue Category: good first issue label Jul 21, 2022
@Xuanwo Xuanwo self-assigned this Sep 16, 2022
@Xuanwo
Copy link
Member

Xuanwo commented Sep 16, 2022

I'm taking this issue now.

@Xuanwo
Copy link
Member

Xuanwo commented Nov 3, 2023

Already implemented by @youngsofun

@Xuanwo Xuanwo closed this as completed Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature good first issue Category: good first issue
Projects
None yet
Development

No branches or pull requests

2 participants