You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Hive compatible metastores, such as AWS Glue (#2206) do not store the individual files within a partition, and instead rely on listing the files in object storage at query time.
This becomes problematic when interacting with data that is either:
Not partitioned in the way that Hive expects
Rewrites data leaving parquet files behind that no longer form part of the most recent snapshot (e.g. Delta Lake / IOx)
Describe the solution you'd like
Much like we currently support a FileFormat of CSV or Parquet, I would like to support a FileFormat of SymlinkTextInputFormat. This is just a newline-delimited list of files, stored in object storage alongside a table or partition.
The best documentation for this functionality I can find is here, and there is documentation here on how it is used to enable inter-operation between Presto and Data Lake.
I'm not entirely sure how the query engine determines the format of the symlink targets, but I guess it must use the file suffix??
Describe alternatives you've considered
We could not support this
Additional context
I am not hugely familiar with the precise inner-workings of the Hive ecosystem, as I've only interacted with tooling that uses it under-the-hood. I therefore could be mistaken on some aspect, if so please feel free to correct me 😄
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Hive compatible metastores, such as AWS Glue (#2206) do not store the individual files within a partition, and instead rely on listing the files in object storage at query time.
This becomes problematic when interacting with data that is either:
Describe the solution you'd like
Much like we currently support a FileFormat of CSV or Parquet, I would like to support a FileFormat of
SymlinkTextInputFormat
. This is just a newline-delimited list of files, stored in object storage alongside a table or partition.The best documentation for this functionality I can find is here, and there is documentation here on how it is used to enable inter-operation between Presto and Data Lake.
I'm not entirely sure how the query engine determines the format of the symlink targets, but I guess it must use the file suffix??
Describe alternatives you've considered
We could not support this
Additional context
I am not hugely familiar with the precise inner-workings of the Hive ecosystem, as I've only interacted with tooling that uses it under-the-hood. I therefore could be mistaken on some aspect, if so please feel free to correct me 😄
The text was updated successfully, but these errors were encountered: