Currently in aws glue while running the crawler for parquet files, glue is only checking the metadata but not checking if those colums actually exist or not #2317

Sach1nAgarwal · 2023-01-30T08:05:38Z

Sach1nAgarwal
Jan 30, 2023

Currently in AWS Glue, while running the crawler for parquet files stored in s3, glue is only checking the first 4 bytes, metadata and last 8 bytes for creating the schema, partition and index but not checking if columns present in metadata actually exist in parquet file or not.
So should glue also check if metadata columns actually exist or not ?

Means if I create a file (let's call it dummy.parquet) from a correct parquet file, by only copying start 4 byte header, meta_data, and last 8bytes, then glue will consider that new file (dummy.parquet) a correct parquet file, and give the schema, partition and index.
So Glue will not check if those columns actually exist or not.

Consider this file as dummy.parquet, it don't contain columns, but has first 4 bytes, metadata and last 8 bytes of a correct parquet file which contain columns.
c_dummy_parquet_first_4_bytes_and_last_full_metadata_and_last_8bytes.txt

jmklix · 2023-01-30T18:28:01Z

jmklix
Jan 30, 2023
Maintainer

Thanks for asking this interesting question. What exactly is your use case that you want the crawler to validate that all of the columns exist?

I was looking over the Glue docs and I only see lists of which input types are valid. I think it assumes that the input data will be correct and to correctly classify the file it only needs to check the first 4 and last 8 bytes. This would allow it to be a lot faster over potentially very large data sets.

3 replies

Sach1nAgarwal Jan 31, 2023
Author

So my Application assumes that parquet file will be correct after glue shows the schema.

So if somebody uploads dummy parquet file then Application can break, if Application do not handle the error.
Or in a file if instead of columns data somebody fill it with random data then it can cause in break of Application.

So if any customer put dummy parquet file in s3, its possible Application can break if didn't handle the error?

So either application needs to handle the errors or glue can ensure by adding another crawler for ensuring of correction of data? So Which one is the better approach ?

jmklix Feb 6, 2023
Maintainer

Your application will need to handle the errors. If your application expects the correct file format then it probably would break when given a bad file.

You can also create a feature request for this, but I can't guarantee that it is something that would be added or any timeline for it to get implemented.

Sach1nAgarwal Feb 7, 2023
Author

Okay, Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Currently in aws glue while running the crawler for parquet files, glue is only checking the metadata but not checking if those colums actually exist or not #2317

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Currently in aws glue while running the crawler for parquet files, glue is only checking the metadata but not checking if those colums actually exist or not #2317

Sach1nAgarwal Jan 30, 2023

Replies: 1 comment · 3 replies

jmklix Jan 30, 2023 Maintainer

Sach1nAgarwal Jan 31, 2023 Author

jmklix Feb 6, 2023 Maintainer

Sach1nAgarwal Feb 7, 2023 Author

Sach1nAgarwal
Jan 30, 2023

Replies: 1 comment 3 replies

jmklix
Jan 30, 2023
Maintainer

Sach1nAgarwal Jan 31, 2023
Author

jmklix Feb 6, 2023
Maintainer

Sach1nAgarwal Feb 7, 2023
Author