-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Use InternalData with avro and common DataIterable for readers. #12476
base: main
Are you sure you want to change the base?
Conversation
83fe1ee
to
6f1eb95
Compare
6f1eb95
to
fa60870
Compare
.project(ManifestEntry.getSchema(Types.StructType.of()).select("status")) | ||
.classLoader(GenericManifestEntry.class.getClassLoader()) | ||
.build()) { | ||
metadata = headerReader.getMetadata(); | ||
|
||
if (headerReader instanceof InternalData.DataIterable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little bit of a workaround since we can't switch fully over to DataIterable
due to binary incompatibility, but this is about the only place other than tests where it's used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is needed and I'd prefer not to add the interface if possible. We should not really be using the file metadata to recover things like the partition spec. I think that this is only used by older code paths that we didn't migrate to pass the id to spec map through.
At this point, I think we should go see where those methods are used and try to remove them, instead of adding this.
.project(ManifestFile.schema()) | ||
.classLoader(GenericManifestFile.class.getClassLoader()) | ||
.reuseContainers(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just using the default, right?
@@ -163,6 +180,11 @@ public interface ReadBuilder { | |||
/** Set a custom class for in-memory objects at the given field ID. */ | |||
ReadBuilder setCustomType(int fieldId, Class<? extends StructLike> structClass); | |||
|
|||
/** Set the classloader used for custom types. */ | |||
default ReadBuilder classLoader(ClassLoader classLoader) { | |||
throw new UnsupportedOperationException("Classloader not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this is needed. I originally added it in the InternalData commit, but because we are passing the classes themselves (rather than loading them dynamically by name) they are already loaded.
.rename("r508", GenericPartitionFieldSummary.class.getName()) | ||
InternalData.read(FileFormat.AVRO, io.newInputFile(manifestListLocation)) | ||
.setRootType(GenericManifestFile.class) | ||
.setCustomType(508, GenericPartitionFieldSummary.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a constant for this field id?
This PR expands the use of InternalData for metadata reads to
ManifestsLists
andAllManifestsTable
and uses a common iterable that exposes access to file metadata like Avro previously supported. Implements the new metadata read path for both Avro and Parquet.