Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Jul 14, 2019

This removes the partition column from the files and entries metadata tables when the underlying table is not partitioned. For unpartitioned tables, this is an empty struct and cannot be projected in Spark.

In support of suppressing the partition column, this also updates ManifestReader and FilteredManifest to support schema-based projection. Because this is already updating the Filterable API, this fixes #145 and adds case sensitivity methods.

Schema schema = new Schema(DataFile.getType(table.spec().partitionType()).fields());
if (table.spec().fields().size() < 1) {
// avoid returning an empty struct, which is not always supported. instead, drop the partition field (id 102)
return TypeUtil.selectNot(schema, Sets.newHashSet(102));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we enumerate these fields rather than using magic numbers (i.e. 102)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These IDs are part of the spec, so I think it is better to use the IDs than to use names. That's why I added a comment to explain which field is being removed.


public static class FilesTableScan extends BaseTableScan {
private static final long TARGET_SPLIT_SIZE = 32 * 1024 * 1024; // 32 MB
private final Schema fileSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we're just double capturing a field from BaseTableScan here. Seem like we should just expose a way to get at the original schema or use the refined schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, data tasks aren't responsible for projection because that's done easily by the engines. So we don't want to use the refined schema.

We could use the base table's schema without passing it through by making this a non-static inner class, but doing that seems odd with the refinement pattern to me. I thought it would be better to explicitly pass this through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more of exposing the ability get the original, unrefined schema from the base table, which wouldn't require making this an inner static class.

I'll leave it up to you though as I don't have a strong opinion about it.

@danielcweeks
Copy link
Contributor

+1 LGTM

@rdblue rdblue merged commit 33a3882 into apache:master Jul 16, 2019
@rdblue rdblue deleted the fix-data-files-table branch July 16, 2019 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filterable does not support case insensitivity

2 participants