Glue Table's schema, partition and index is not updating, after new file in s3 is added and re-ran the crawler #2322

Sach1nAgarwal · 2023-01-31T10:16:12Z

Sach1nAgarwal
Jan 31, 2023

Suppose some set of parquet files present in S3 folder, has same metadata, then AWS Glue will show the one table (containing schema, partition and index) for all files after running the crawler for that folder.

Case1:-
Suppose if a new .txt file (or some different type of file, or parquet with different metadata (different column names)) is added in same S3 folder, then after re-running the crawler, nothing changes in table's schemas, partitions and indexes. Which is not correct?

So AWS Glue is showing wrong behavior ?

Means If I created a new crawler for the S3 folder after uploading the .txt file, then glue crawler will generate n number of tables, where n will be equal to number of files present in the S3. So this is correct behavior of AWS Glue. But same behavior is not seen when re-running the crawler in above case1.
@jmklix

Answered by jmklix

Feb 6, 2023

If the Crawler is set as a regular crawler than the new file that's added would be parsed and if it does not match the schema of the expected table, it would marked it as distinct.
if the Crawler is set as an incremental crawler, then even though the new file is scanned it would not alter the table of what is in the data catalog.

View full answer

Sach1nAgarwal · 2023-02-05T10:48:29Z

Sach1nAgarwal
Feb 5, 2023
Author

@jmklix Do I need to delete the old table for getting the updated table??

0 replies

jmklix · 2023-02-06T18:51:13Z

jmklix
Feb 6, 2023
Maintainer

If the Crawler is set as a regular crawler than the new file that's added would be parsed and if it does not match the schema of the expected table, it would marked it as distinct.
if the Crawler is set as an incremental crawler, then even though the new file is scanned it would not alter the table of what is in the data catalog.

0 replies

2023-12-01T20:01:02Z

github-actions[bot]
bot Dec 1, 2023

Hello! Reopening this discussion to make it searchable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glue Table's schema, partition and index is not updating, after new file in s3 is added and re-ran the crawler #2322

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Glue Table's schema, partition and index is not updating, after new file in s3 is added and re-ran the crawler #2322

Sach1nAgarwal Jan 31, 2023

Replies: 3 comments

Sach1nAgarwal Feb 5, 2023 Author

jmklix Feb 6, 2023 Maintainer

github-actions[bot] bot Dec 1, 2023

Sach1nAgarwal
Jan 31, 2023

Sach1nAgarwal
Feb 5, 2023
Author

jmklix
Feb 6, 2023
Maintainer

github-actions[bot]
bot Dec 1, 2023