Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Core: Support relative paths in metadata #2658

Closed
wants to merge 13 commits into from

Conversation

anuragmantri
Copy link
Contributor

Add initial support for relative paths in metadata(#1617).

Summary of changes:

  • Introduce write.metadata.use.relative-path property that governs the use of relative paths
  • Modify metadata writers and readers to use relative paths when the above property is enabled
  • No changes were made to public APIs

@anuragmantri anuragmantri changed the title Support relative paths in metadata Core: Support relative paths in metadata Jun 1, 2021
@@ -177,6 +177,9 @@ private TableProperties() {
public static final String UPDATE_ISOLATION_LEVEL = "write.update.isolation-level";
public static final String UPDATE_ISOLATION_LEVEL_DEFAULT = "serializable";

public static final String WRITE_METADATA_USE_RELATIVE_PATH = "write.metadata.use.relative-path";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth to comment out in the code that this property is immutable once the table is created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@anuragmantri
Copy link
Contributor Author

All tests in spark3/src/test/java/org/apache/iceberg/actions/TestRemoveOrphanFilesAction3.java are passing for me locally. However, there is a test failure (testHiveCatalogTable.java) in CI. Has this happed to anyone before?

@@ -122,7 +122,7 @@ public long targetSplitSize() {
@Override
public CloseableIterable<StructLike> rows() {
return CloseableIterable.transform(
ManifestFiles.read(manifest, io).project(schema),
ManifestFiles.read(manifest, io, null, null).project(schema),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep the original signatures as well, so we do not have to rewrite the method calls everywhere and add null, null? I think this could greatly reduce the changes needed in this patch

@@ -71,7 +71,8 @@ protected TableScan newRefinedScan(TableOperations ops, Table table, Schema sche
public CloseableIterable<FileScanTask> planFiles(TableOperations ops, Snapshot snapshot,
Expression rowFilter, boolean ignoreResiduals,
boolean caseSensitive, boolean colStats) {
ManifestGroup manifestGroup = new ManifestGroup(ops.io(), snapshot.dataManifests(), snapshot.deleteManifests())
ManifestGroup manifestGroup = new ManifestGroup(ops.io(), snapshot.dataManifests(), snapshot.deleteManifests(),
ops.current().location(), ops.current().properties())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we push the whole properties? Shall we just push a flag useRelative, or we can just push the location and if it is not null then we use it as a base for the relative paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of pushing properties around, I introduced a flag shouldUseRelativePaths.

}

@Override
public <F extends ContentFile<F>> ManifestEntry<F> apply(ManifestEntry<F> manifestEntry) {
if (manifestEntry.file() instanceof BaseFile) {
BaseFile<?> file = (BaseFile<?>) manifestEntry.file();
file.setSpecId(specId);
if (MetadataPaths.useRelativePath(tableProperties)) {
if (!file.path().toString().startsWith(tableLocation)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do not we push this logic inside the otAbsolutePath method?

This is only a question which we might to consider, I myself not sure about the answer yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Pushed the logic into toAbsolutePath().

@@ -126,10 +126,10 @@ public long targetSplitSize() {
@Override
public CloseableIterable<StructLike> rows() {
if (manifest.content() == ManifestContent.DATA) {
return CloseableIterable.transform(ManifestFiles.read(manifest, io).project(fileSchema).entries(),
return CloseableIterable.transform(ManifestFiles.read(manifest, io, null, null).project(fileSchema).entries(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, keep the original signatures as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/**
* Utility class that contains path conversion methods.
*/
public final class MetadataPaths {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name it as MetadataPathUtils?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return path;
}
// convert to absolute path by appending the table location
return useRelativePath(properties) && !path.startsWith(tableLocation) ? tableLocation + "/" + path : path;
Copy link
Contributor

@pvary pvary Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me extending the LocationProvider with additional default implementation which handles relative patch would be more natural than handling it in some random place in the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the LocationProvider class is for data files and is used to provide locations to write the files. In this PR, we do not wish to change where data files and/or manifest files are written. We just want to change the references to any paths inside metadata files to relative paths. This way, if a table is relocated/moved, we don't have to change a bunch of metadata files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Would it be better to introduce something similar for metadata file locations as well? We can have 2 implementations:

  • AbsoluteMetadataLocationProvider
  • RelativeMetadataLocationProvider

Or even a single implementation at this point handling both?
This could hide the complexity of calculating the path whenever it is needed and we can just pass down this provider for every metadata file (even encapsulate the io, so we do not have to pass that around too)


import static org.apache.iceberg.types.Types.NestedField.optional;

public class RelativePathHiveFunctionalityTest extends HiveMetastoreTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am correct then we try to name test to start with Test, like TestRelativePathHiveFunctionality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Renamed the class.

@pvary
Copy link
Contributor

pvary commented Jun 9, 2021

@anuragmantri: Tried to go through the PR, but this one is huge 😄

Some general takeaways:

  • Can we keep the original signatures for the extended methods as well? They still can call the new ones, but if we kept the original ones then we do not have to change the callers to add null, null which could reduce the number of changes greatly
  • Can we decrease the number of parameters we push down the new methods? I would prefer to send a boolean useRelativePath instead of the whole property list, or even only send just the baseLocation if we have to use it.
  • Would it be possible to hide the whole stuff inside a LocationProvider? It could be a base implementation for all of the LocationProviders, and would be a natural place in my opinion. We can avoid changing the public APIs if we provide a default implementation.
  • If there is any way to split this PR to smaller, more reviewer friendly ones that would be a big help for us, reviewers.

Thanks,
Peter

@anuragmantri
Copy link
Contributor Author

Thanks for the review @pvary :)

Would it be possible to hide the whole stuff inside a LocationProvider? It could be a base implementation for all of the LocationProviders, and would be a natural place in my opinion. We can avoid changing the public APIs if we provide a default implementation.

  • In this change, we only want to replace the references of paths in the content of metadata files. We do not wish to change where data/metadata is actually written. My understanding of LocationProviders is that they are used to get locations to write the data files and hence for this change, we don't want to change any LocationProvider implementation.

If there is any way to split this PR to smaller, more reviewer friendly ones that would be a big help for us, reviewers.

  • I tried to keep this change small initially. Specifically, in this PR, I tried to just introduce the new property and modify the read and write paths to use table property to do the necessary changes. Property is turned off my default. Yet, this turned out to be huge :) There will be several follow-up PRs such has handling metadata tables and writing unit tests. I will be happy to split this further if you think we can logically do so.

@anuragmantri anuragmantri force-pushed the support-relative-paths branch 2 times, most recently from 75a55d7 to de261fd Compare June 22, 2021 07:11
@@ -180,14 +210,20 @@ public String manifestListLocation() {
return manifestListLocation;
}

@Override
public FileIO io() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave a comment here, because I would rather not leak io here if possible

return path;
}
// convert to absolute path by appending the table location
return shouldUseRelativePaths && !path.startsWith(tableLocation) ? tableLocation + "/" + path : path;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like concatenating "/" in path. Maybe using Path object would be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paths.get?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need shouldUseRelativePaths here? Can't we just use the return !path.startsWith(tableLocation) ? Paths.join(tableLocation, path) : path?

* @param properties table properties
* @return true if "write.metadata.use.relative-path" property is true, false otherwise
*/
public static boolean shouldUseRelativePath(Map<String, String> properties) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Table as a parameter as mentioned in the comments?

return path;
}
// convert to relative path by removing the table location
return shouldUseRelativePaths && path.startsWith(tableLocation) ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we should use relative paths, it might be an error if the path is not starts with location, and it is not relative...
Do we want that check here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for both of these functions we are basically care more about the output than the input

toAbsolute
(Absolute Path | Relative to table Location) -> Absolute Path

toRelative
(Absolute Path | Relative to table location) -> relativizedPath to table location | absolutePath not including table location

As @pvary notes it's possible for AbsolutePath to be in a completely different file system, in that case this method sometimes returns relative paths, but sometimes returns absolute paths.

I think this is actually the right thing to do for this method since we can only ever deal with paths relative to the table location ... but maybe this needs another name though? Since currently it doesn't always return relative paths

@flyrain
Copy link
Contributor

flyrain commented Jun 22, 2021

Hi @aokolnychyi, @rdblue, can you take a look at this PR? Data replication is a pretty common request for any serious Iceberg user. Let's see how can we move forward with #1617. Thanks.

@anuragmantri
Copy link
Contributor Author

anuragmantri commented Jun 25, 2021

CC: @dpaani - Who has also contributed to this work.

@rdblue
Copy link
Contributor

rdblue commented Jun 25, 2021

@flyrain, thanks for pointing out this PR. Looks like the design doc was updated, so I'll make some time to take another look at that.

@@ -43,22 +47,54 @@ private ManifestLists() {
.reuseContainers(false)
.build()) {

return Lists.newLinkedList(files);
return shouldUseRelativePaths ? updateManifestFilePathsToAbsolutePaths(files,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was commented on the issue as well, but shouldn't we always call updateManifestFilePathstoAbsolute here? Instead of passing through a boolean we could just see if we need to update any files and return the updated set? So a use with only absolute paths would essentially just pass through the list and not change anything?

This would remove our requirement to pass through "shouldUseRelativePaths"

* @param tableLocation the base table location
* @return relative path with respect to the base table location
*/
public static String toRelativePath(String path, String tableLocation, boolean shouldUseRelativePaths) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need shouldUseRelativePaths here? Shouldn't the function toRelativePath always return a relative path? The caller would know whether it needs to call toRelativePath or toAbsolutePath?

*
* @return file io used for this snapshot
*/
FileIO io();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

@anuragmantri anuragmantri changed the title Core: Support relative paths in metadata [WIP] Core: Support relative paths in metadata Mar 17, 2022
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 17, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants