-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support relative paths in Table Metadata #1617
Comments
I remember we discussed this use case some time ago in a sync up meeting, and the related issue was #1531. The general feedback was that it is better to expose this as a Spark procedure, and rewrite the manifests with new file path URIs to reflect new location after replication or backup. This avoids the trouble of redefining the Iceberg spec. Any thoughts? |
I don't think it is a good idea in general to use relative paths. We recently had an issue where using a But there is still a way to do both. Catalogs and tables can inject their own |
@jackye1995 and @rdblue Our current thought process is as follows:
I don't think the scenario called out about a wrong HDFS location being cleaned will be worsened or improved as a result of this change. If the absolute path of the table is incorrectly interpreted to be something else then problems shall still happen. We will explore the option of FileIO in parallel while we discuss further the complications of using relative paths. |
Relative paths would help us too. We certainly need to have support for a federated namespace with a virtual file system. We're moving to HDFS federation and moving our tables to be located in a virtual filesystem (eg: gridfs://cluster1/table ) so that we can move the data around without changing all of the metadata. The challenge here is that we need to make sure that the manifests don't contain the physical hdfs path, which is likely to break. For data access control, it seems like a really good invariant that all data is under the table's path. |
I'll add this as a discussion topic for the next Iceberg sync. |
We've put together a design doc for this proposal. We thought through a few scenarios and added an implementation. We will be glad to consider other ideas as well. |
Thanks for the design doc! I wonder if this can be done through changes in Catalog and FileIO without touching the Iceberg specification, which can be much simpler to ship without conflicting with already quite a few changes for the format v2 specifications. The key difficulty I see is that table property is a part of the table metadata file. Therefore we cannot know if a table should use any special file path override without actually reading that file. If we have a way to know it without using FileIO, then we can decide for a specific table, all files should be read through a region-specific client for replication, or a different URI scheme for backup and HDFS federation. So here is my alternative proposal:
Could this proposal satisfy your requirements? |
Thanks for your comments @jackye1995. We discussed your proposal and here are our thoughts.
With our approach,
For the reasons above, we prefer the approach laid out in the design doc. We understand the concerns around Iceberg spec with v2 format already changing a lot. We are open to ideas on shipping this with the least interference (even if it means we wait till v2 gets out). Please let us know your thoughts. |
I read the doc through the Proposal and Approach sections and I am quite confused about what your proposal is. Could you clarify it a bit? It may be obvious from the following example sections what the changes are and why they are needed, but I think that information should be in the proposal/approach.
|
@flyrain, could you take a look at what can be done here? |
@flyrain, in my latest comment to Ryan's comment in the design doc, I proposed starting with "supporting relative paths in metadata" which does not need changes to the spec. We can then look at refactoring the fields of metadata.json file. This will make it easier to implement and review the change. What are your thoughts? |
Thanks @anuragmantri for filing the PR and updating the design doc. It is great that there is no spec change in the latest design. |
I looked through the design doc. It seems to be much better than the last version, so thank you for the update. Overall, I think it is still incorrect on a few points and is quite a bit longer than it will need to be in the end. The main confusion seems to be where the table location comes from. Iceberg table metadata tracks a location string that is written into all metadata.json files. This is the table's location. Iceberg does not require any location other than this one and there isn't a way to pass a different location in the There are also a few of table properties that can affect locations:
There are two broad classes of tables:
For table metadata files, there is only an external location for Hadoop tables. Otherwise, the table location must come from the table's metadata.json file. That makes it difficult to make We could update the spec to allow relative metadata.json paths for Hadoop tables, but will need to state how to handle a different Metadata locations aside, I think that it is a good plan to make the relative paths transparent. Most of the library should continue to operate on absolute paths and relative paths should be made while writing and made absolute while reading. That makes it so the changes are fairly easily tested. If you do this, then I think there is a lot less to cover in the design doc. |
@flyrain and @anuragmantri, see my comment above. Thanks for working on this! |
@rdblue: We would like to use relative path for replication of the Iceberg tables between on-prem Hive instances, or moving on-prem Iceberg tables to the cloud and sometimes even moving data from the cloud to on-prem. While we can do this by rewriting the manifest and snapshot files, it would be much easier to synchronize only the files themselves and update the metadata_location / previous_metadata_location for the target table and be done with it. Because of the use-cases above it would be useful to have relative paths in the Iceberg metadata for Metastore tables as well, and the absolute path could be generated using the Metastore table location. Maybe even the |
@pvary, we can explore that, but it is fairly easy to write a new metadata file and use that for a table. At least that's a small operation, compared to rewriting all metadata and position delete files. My main point is that we need to have a well-documented plan for tracking the table location. |
Thanks @rdblue for explaining how locations are tracked in Iceberg tables. Based on your and @pvary's comments, I updated the design doc to include
What are your thoughts? |
Hi @rdblue, we might leave metadata_location and previous_metadata_location as is, they are table properties in Metastore like HMS, are not affected when we just move files from the source table to the target. They are absolute path, but they still point to the right place without any change. |
@rdblue: I am not sure that I understand what are you proposing here. I might be mistaken, but AFAIK we store the locations of the files in the
Is there an easier way to create a full working replica of an Iceberg table where we do not use any files/data from the original table and the 2 tables (original and the new) can live independently after the creation of the replica?
Totally agree with this 👍 Thanks, |
@pvary, I'm talking only about the metadata.json files. I agree that using relative paths for manifest lists, manifests, and data files is a good idea. But we keep the table location in the root metadata file and so it is difficult to make its location relative. I think we could do this for Hadoop tables by ignoring the location in metadata.json because it must match the table location. But Hadoop tables are mostly for testing and I don't recommend using them in practice. The bigger issue is how to track the table location for Hive or other metastore tables. In that case, I don't think it is unreasonable to use |
@pvary , ideally, table replication doesn't involve data file rewrite and metadata(manifest-list, manifest, metadata.json) rewrite. The process would be as simple as that user copys all files needed, then changes the target table properties to get the new status. It isn't the case in reality though. In this issue thread, we were talking about two ways to replicate a table. 1. relative path 2. rebuild the metadata files. Neither of them require data file rewrite. However, the relative-path approach requires the minimal metadata file rewrite, probably only metadata.json per our discussion. But metadata-rebuild approach involves rewrite of all three type of metadata files. They are metadata.json, manifest-list, and manifest. Every type of file stores table information cannot be recreated only by looking at the data files. For example, the partition spec in metadata.json and its id in manifest file, and the snapshot relative metadata. To your question, both source and target tables should be able to live independently after the replication. That's relative easy to archive. The hard part is to enable incremental sync-up between them and bidirectional replication, which are quite common DR(Disaster recovery) use cases. |
Thanks for the detailed answer @flyrain! This really helps to have a clear understanding of the tasks at hand. I would like to share my thoughts about 2 points:
I think that if we chose the relative path approach then replication becomes quite straightforward, since we do not have to handle the mix of different paths (source absolute path for the new metadata files, destination absolute path for the old metadata files). We just need to copy the metadata files for one directional replication. The bidirectional replication is a different kettle of fish because of the commit resolution complexity, but I think it is also easier since we do not have to care about manifests and manifest-lists
What do we want to change in the json? Is it only the path of the table, or we have to rewrite something else as well? Could we use something like the LocationProvider (which generates new datafile locations) to make the path resolution pluggable and store only the config in the table? Thanks, Peter |
Just trying to catch up with the thread, and reread the design doc. For the design doc, I think we should not make To have a brief summary (and also potentially answer Peter's question), from my understanding we are leading towards a solution that following items to keep track of the true table roots, and should be absolute, while all other paths can be relative:
Then these 4 properties are used to get 2 absolute root paths:
These probably should be exposed as new methods to make sure the same location derivation logic is used everywhere. Once the content of a file is written, the paths are written to one metadata layer above with/without absolute path based on a flag proposed in the design doc, and the writer of this path needs to know what are the true root paths used in the 2 places above. On the read side, the absolute path is created based on the 2 absolute root paths. Those operations are done inside each specific IO method by passing the relative/absolute path, the correct absolute root path, and the boolean flag, to make sure operations are transparent above the IO level. All updates to the 4 locations on the top can be done in a UpdateProperties + UpdateLocation transaction, and when the metadata path is updated, the I think if we go with this route then it sounds good to me, and we can try to hash out some more details in PRs. Please let me know if this summary is accurate, or if I have any misunderstanding. |
Thanks @jackye1995 for highlighting the properties which we need to handle! I have not know about the
The process you have described above sounds useful to me if we have a place where both of the replicas are accessible, like a central node where we can initiate the copy of the data and then call UpdateProperties + UpdateLocation, so the replica could have it's own metadata json which holds the correct info. In our cases for the on-prem to cloud migration it is possible that the file copy is done by a different method and then we need to read and update the metadata json file, in which the locations are not accessible. So it would be good if we can run UpdateProperties + UpdateLocation on an "invalid" table. |
Yes, that is why I was proposing if we decide to go with relative path, On Trino side it's quite more complicated because some paths are hard coded and directly generated based on Hive conventions, but I think that is an issue to fix on Trino side: https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java#L303-L308 But just keep in mind that there is always a risk for a compute engine to not stick with those things, and it will be hard to enforce. |
@pvary, yes, the relative path approach looks more elegant than the metadata rebuild one. As I mentioned, it is close to the ideal case, which doesn't rewrite the metadata files.
Need to change the location in metadata.json, or a combination of three table properties mentioned by @rdblue and @jackye1995. Thanks @jackye1995 's proposal. It makes sense to have a centralized place to get the root for data and metadata since there are multiple ways to specify their locations. We can also hide the logic of how to prioritize them. cc @anuragmantri. |
Why just get the root for the data/metadata from the LocationProvider? Why not expose these methods instead:
|
Catching up on the comments from past couple of days. Thanks everyone for providing inputs. @flyrain - Although it will increase the scope of this design to include bi-directional replication, we should consider covering that here. Overall, I agree with @jackye1995's proposal. Thanks for the detailed explanation.
In the initial design doc, my thinking was the writes mentioned above need to change at all. However, reading the comments from @pvary above, we may indeed have to store a My takeaways from this are:
An update transaction will change all the above except relative path boolean. Please correct me if my understanding is incorrect. |
@pvary - This may make the code change larger. Other than that, I don't see why this cannot be done unless I'm missing something. |
One more use-case which also means that we have to manipulate path in the snapshot/metadata files is when we would like to enable temporary access for Iceberg tables stored on blobstores. In this case we have to create presigned URLs to access the specific files and we have to use this URL in the metadata of the table |
The design doc has been updated to add initial support for relative paths and multiple root locations for a table. Please review the design. |
can we use migrate_table stored procedure for this? |
@asheeshgarg, no, the procedure is mainly for migrating existing Hive or Spark tables to Iceberg, https://iceberg.apache.org/docs/latest/spark-procedures/#table-migration. |
@rdblue We tried injecting own FileIO via spark config which will replace the table/metadata path prefix with new location, this works till reading table metadata but while reading parquet files it struck in BatchDataReader.java in following function, Please provide your thoughts on this if there is some other way of achieving this. protected CloseableIterator open(FileScanTask task) {
/*** below given code line is causing issue because its searching name given in metadata in the map which will be replaced by custom FileIO, changing this line to InputFile inputFile = table.io().newInputFile(filePath); is making it work but this remove the encryption logic, in short we couldn't make it work with only Custom FileIO
} |
Hey any news related to this subject? We would like to have a load process that would write to object storage (table A pointing to fs://location) and then create a new table table pointing to alluxio fs mounted in the object storage (table A1 pointing to alluxio://location) and read from the same files. The problem today is even if we get the new table location in alluxio the metadata files and manifest files will be pointing to the writing FS an not use the relative path from alluxio to get the files via alluxio and leverage the cache. Instead the files will be read from the FS ignoring alluxio, only using it to get the 1st metadata file. Does this makes a case for the relative path for manifests and metadata files? Or do you think there is another way to achieve such architecture? |
@jotarada yes, relative path should help with this use case However, you don't really need relative paths if you use the same paths in alluxio. you can just configure the s3 endpoint config to use alluxio endpoint and read from alluvia, you don't even have to create table A1 table A set s3 endpoint to alluxio for read, in spark set 'spark.fs.s3a.endpoint' and that should read from alluxio instead of s3, no need for new table or relative paths. |
@abmo-x Can i do that on trino as well? |
Any updates on this? |
Hi @jotarada, unfortunately, I had to move on to other priorities and could not continue working on this issue. If you are interested, feel free to contribute, I will be happy to help review changes and provide any other support. |
@jotarada We encountered a similar issue when we switched S3 vendors, necessitating the copying of all data to another bucket, resulting in unreadable iceberg tables. Currently, we have two options: either attempt to rewrite metadata with custom code or inject custom file IO. We have experimented with both approaches, and I can provide you with our code for reference. |
I'm trying to use Cloudflare R2 Sippy in front of S3, and I also run into this issue. Snowflake uses a full absolute path to S3 when generating manifest and when I try querying the Iceberg table on Duckdb via its own R2 integration, it still goes to S3. I also tried downloading the files from S3 locally and query them in DuckDB and Clickhouse and they both try to pull the data from S3 while the copies exist locally. Unfortunately rewriting the metadata is not an option as we don't write the data but provide a read-only solution. |
That's right. The metadata files have absolute paths pointing to s3 locations. We need to revisit the relative path approach for your use case. |
We also face the same issue. |
For anyone who is interested: https://github.com/lightmelodies/iceberg-relative-io |
I'm primarily interested this as a means to enable unit testing; pyicberg seems to be able to read and write relative paths of the form:
|
Refactored, now it should be very user-friendly with just one-line change.
|
Background
Iceberg specification captures file references for the following:
location
that determines the base location of the table.manifests
andmanifest-list
to determining the manifests that make up the snapshot.manifest_path
that identifies the location of a manifest file.file_path
that identifies a data file.file_path
identifies a data file on which position based delete is to beapplied.
All the file references are absolute paths right now.
Challenge
Table copy to another location could arise in the following use cases:
Absolute file references require the file references to reflect the new target location of the table before the table
can be consumed.
Solution Option
We could support relative paths as follows:
location
shall always reference the absolute location of the tablelocation
from the table metadata.We might further consider splitting the Table metadata into two pieces:
format-version
table-uuid
location
sequence-number
current-snapshot-id
schema
This will ensure the following:
location
attribute in the table metadata.The text was updated successfully, but these errors were encountered: