Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Use InternalData with avro and common DataIterable for readers. #12476

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

danielcweeks
Copy link
Contributor

@danielcweeks danielcweeks commented Mar 7, 2025

This PR expands the use of InternalData for metadata reads to ManifestsLists and AllManifestsTable and uses a common iterable that exposes access to file metadata like Avro previously supported. Implements the new metadata read path for both Avro and Parquet.

@danielcweeks danielcweeks marked this pull request as draft March 7, 2025 17:27
@danielcweeks danielcweeks changed the title Core: Use InternalData with avro with common DataIterable for readers. Core: Use InternalData with avro and common DataIterable for readers. Mar 7, 2025
.project(ManifestEntry.getSchema(Types.StructType.of()).select("status"))
.classLoader(GenericManifestEntry.class.getClassLoader())
.build()) {
metadata = headerReader.getMetadata();

if (headerReader instanceof InternalData.DataIterable) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit of a workaround since we can't switch fully over to DataIterable due to binary incompatibility, but this is about the only place other than tests where it's used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is needed and I'd prefer not to add the interface if possible. We should not really be using the file metadata to recover things like the partition spec. I think that this is only used by older code paths that we didn't migrate to pass the id to spec map through.

At this point, I think we should go see where those methods are used and try to remove them, instead of adding this.

@danielcweeks danielcweeks marked this pull request as ready for review March 7, 2025 21:54
.project(ManifestFile.schema())
.classLoader(GenericManifestFile.class.getClassLoader())
.reuseContainers(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just using the default, right?

@@ -163,6 +180,11 @@ public interface ReadBuilder {
/** Set a custom class for in-memory objects at the given field ID. */
ReadBuilder setCustomType(int fieldId, Class<? extends StructLike> structClass);

/** Set the classloader used for custom types. */
default ReadBuilder classLoader(ClassLoader classLoader) {
throw new UnsupportedOperationException("Classloader not supported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is needed. I originally added it in the InternalData commit, but because we are passing the classes themselves (rather than loading them dynamically by name) they are already loaded.

.rename("r508", GenericPartitionFieldSummary.class.getName())
InternalData.read(FileFormat.AVRO, io.newInputFile(manifestListLocation))
.setRootType(GenericManifestFile.class)
.setCustomType(508, GenericPartitionFieldSummary.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a constant for this field id?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants