-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Core: Support relative paths in metadata #2658
Conversation
@@ -177,6 +177,9 @@ private TableProperties() { | |||
public static final String UPDATE_ISOLATION_LEVEL = "write.update.isolation-level"; | |||
public static final String UPDATE_ISOLATION_LEVEL_DEFAULT = "serializable"; | |||
|
|||
public static final String WRITE_METADATA_USE_RELATIVE_PATH = "write.metadata.use.relative-path"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth to comment out in the code that this property is immutable once the table is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
f5f9855
to
fd894b6
Compare
All tests in |
@@ -122,7 +122,7 @@ public long targetSplitSize() { | |||
@Override | |||
public CloseableIterable<StructLike> rows() { | |||
return CloseableIterable.transform( | |||
ManifestFiles.read(manifest, io).project(schema), | |||
ManifestFiles.read(manifest, io, null, null).project(schema), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we keep the original signatures as well, so we do not have to rewrite the method calls everywhere and add null, null
? I think this could greatly reduce the changes needed in this patch
@@ -71,7 +71,8 @@ protected TableScan newRefinedScan(TableOperations ops, Table table, Schema sche | |||
public CloseableIterable<FileScanTask> planFiles(TableOperations ops, Snapshot snapshot, | |||
Expression rowFilter, boolean ignoreResiduals, | |||
boolean caseSensitive, boolean colStats) { | |||
ManifestGroup manifestGroup = new ManifestGroup(ops.io(), snapshot.dataManifests(), snapshot.deleteManifests()) | |||
ManifestGroup manifestGroup = new ManifestGroup(ops.io(), snapshot.dataManifests(), snapshot.deleteManifests(), | |||
ops.current().location(), ops.current().properties()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we push the whole properties
? Shall we just push a flag useRelative
, or we can just push the location
and if it is not null then we use it as a base for the relative paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of pushing properties
around, I introduced a flag shouldUseRelativePaths
.
} | ||
|
||
@Override | ||
public <F extends ContentFile<F>> ManifestEntry<F> apply(ManifestEntry<F> manifestEntry) { | ||
if (manifestEntry.file() instanceof BaseFile) { | ||
BaseFile<?> file = (BaseFile<?>) manifestEntry.file(); | ||
file.setSpecId(specId); | ||
if (MetadataPaths.useRelativePath(tableProperties)) { | ||
if (!file.path().toString().startsWith(tableLocation)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do not we push this logic inside the otAbsolutePath
method?
This is only a question which we might to consider, I myself not sure about the answer yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Pushed the logic into toAbsolutePath()
.
@@ -126,10 +126,10 @@ public long targetSplitSize() { | |||
@Override | |||
public CloseableIterable<StructLike> rows() { | |||
if (manifest.content() == ManifestContent.DATA) { | |||
return CloseableIterable.transform(ManifestFiles.read(manifest, io).project(fileSchema).entries(), | |||
return CloseableIterable.transform(ManifestFiles.read(manifest, io, null, null).project(fileSchema).entries(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, keep the original signatures as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/** | ||
* Utility class that contains path conversion methods. | ||
*/ | ||
public final class MetadataPaths { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe name it as MetadataPathUtils
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return path; | ||
} | ||
// convert to absolute path by appending the table location | ||
return useRelativePath(properties) && !path.startsWith(tableLocation) ? tableLocation + "/" + path : path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me extending the LocationProvider with additional default implementation which handles relative patch would be more natural than handling it in some random place in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the LocationProvider
class is for data files and is used to provide locations to write the files. In this PR, we do not wish to change where data files and/or manifest files are written. We just want to change the references to any paths inside metadata
files to relative paths. This way, if a table is relocated/moved, we don't have to change a bunch of metadata files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Would it be better to introduce something similar for metadata file locations as well? We can have 2 implementations:
- AbsoluteMetadataLocationProvider
- RelativeMetadataLocationProvider
Or even a single implementation at this point handling both?
This could hide the complexity of calculating the path whenever it is needed and we can just pass down this provider for every metadata file (even encapsulate the io, so we do not have to pass that around too)
|
||
import static org.apache.iceberg.types.Types.NestedField.optional; | ||
|
||
public class RelativePathHiveFunctionalityTest extends HiveMetastoreTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am correct then we try to name test to start with Test
, like TestRelativePathHiveFunctionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Renamed the class.
@anuragmantri: Tried to go through the PR, but this one is huge 😄 Some general takeaways:
Thanks, |
fd894b6
to
61f0907
Compare
Thanks for the review @pvary :)
|
75a55d7
to
de261fd
Compare
@@ -180,14 +210,20 @@ public String manifestListLocation() { | |||
return manifestListLocation; | |||
} | |||
|
|||
@Override | |||
public FileIO io() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave a comment here, because I would rather not leak io
here if possible
return path; | ||
} | ||
// convert to absolute path by appending the table location | ||
return shouldUseRelativePaths && !path.startsWith(tableLocation) ? tableLocation + "/" + path : path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like concatenating "/" in path. Maybe using Path
object would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paths.get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need shouldUseRelativePaths here? Can't we just use the return !path.startsWith(tableLocation) ? Paths.join(tableLocation, path) : path?
* @param properties table properties | ||
* @return true if "write.metadata.use.relative-path" property is true, false otherwise | ||
*/ | ||
public static boolean shouldUseRelativePath(Map<String, String> properties) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Table
as a parameter as mentioned in the comments?
return path; | ||
} | ||
// convert to relative path by removing the table location | ||
return shouldUseRelativePaths && path.startsWith(tableLocation) ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we should use relative paths, it might be an error if the path is not starts with location, and it is not relative...
Do we want that check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for both of these functions we are basically care more about the output than the input
toAbsolute
(Absolute Path | Relative to table Location) -> Absolute Path
toRelative
(Absolute Path | Relative to table location) -> relativizedPath to table location | absolutePath not including table location
As @pvary notes it's possible for AbsolutePath to be in a completely different file system, in that case this method sometimes returns relative paths, but sometimes returns absolute paths.
I think this is actually the right thing to do for this method since we can only ever deal with paths relative to the table location ... but maybe this needs another name though? Since currently it doesn't always return relative paths
Hi @aokolnychyi, @rdblue, can you take a look at this PR? Data replication is a pretty common request for any serious Iceberg user. Let's see how can we move forward with #1617. Thanks. |
CC: @dpaani - Who has also contributed to this work. |
@flyrain, thanks for pointing out this PR. Looks like the design doc was updated, so I'll make some time to take another look at that. |
@@ -43,22 +47,54 @@ private ManifestLists() { | |||
.reuseContainers(false) | |||
.build()) { | |||
|
|||
return Lists.newLinkedList(files); | |||
return shouldUseRelativePaths ? updateManifestFilePathsToAbsolutePaths(files, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was commented on the issue as well, but shouldn't we always call updateManifestFilePathstoAbsolute here? Instead of passing through a boolean we could just see if we need to update any files and return the updated set? So a use with only absolute paths would essentially just pass through the list and not change anything?
This would remove our requirement to pass through "shouldUseRelativePaths"
* @param tableLocation the base table location | ||
* @return relative path with respect to the base table location | ||
*/ | ||
public static String toRelativePath(String path, String tableLocation, boolean shouldUseRelativePaths) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need shouldUseRelativePaths here? Shouldn't the function toRelativePath always return a relative path? The caller would know whether it needs to call toRelativePath or toAbsolutePath?
* | ||
* @return file io used for this snapshot | ||
*/ | ||
FileIO io(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
30308fd
to
0bb516e
Compare
bedb035
to
02f650e
Compare
02f650e
to
92f402b
Compare
- Kept all the default signatures to prevent changes that are not necessary - Introduced a new flag shouldUseRelativePaths to determine if a table uses relative paths
92f402b
to
f710a8e
Compare
f5798bc
to
e5f02c5
Compare
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Add initial support for relative paths in metadata(#1617).
Summary of changes:
write.metadata.use.relative-path
property that governs the use of relative paths