Currently in aws glue while running the crawler for parquet files, glue is only checking the metadata but not checking if those colums actually exist or not #2317
Sach1nAgarwal
started this conversation in
General
Replies: 1 comment 3 replies
-
Thanks for asking this interesting question. What exactly is your use case that you want the crawler to validate that all of the columns exist? I was looking over the Glue docs and I only see lists of which input types are valid. I think it assumes that the input data will be correct and to correctly classify the file it only needs to check the first 4 and last 8 bytes. This would allow it to be a lot faster over potentially very large data sets. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Currently in AWS Glue, while running the crawler for parquet files stored in s3, glue is only checking the first 4 bytes, metadata and last 8 bytes for creating the schema, partition and index but not checking if columns present in metadata actually exist in parquet file or not.
So should glue also check if metadata columns actually exist or not ?
Means if I create a file (let's call it dummy.parquet) from a correct parquet file, by only copying start 4 byte header, meta_data, and last 8bytes, then glue will consider that new file (dummy.parquet) a correct parquet file, and give the schema, partition and index.
So Glue will not check if those columns actually exist or not.
Consider this file as dummy.parquet, it don't contain columns, but has first 4 bytes, metadata and last 8 bytes of a correct parquet file which contain columns.
c_dummy_parquet_first_4_bytes_and_last_full_metadata_and_last_8bytes.txt
Beta Was this translation helpful? Give feedback.
All reactions