Update metadata tables for unpartitioned tables #285

rdblue · 2019-07-14T01:48:59Z

This removes the partition column from the files and entries metadata tables when the underlying table is not partitioned. For unpartitioned tables, this is an empty struct and cannot be projected in Spark.

In support of suppressing the partition column, this also updates ManifestReader and FilteredManifest to support schema-based projection. Because this is already updating the Filterable API, this fixes #145 and adds case sensitivity methods.

danielcweeks · 2019-07-16T15:43:48Z

core/src/main/java/org/apache/iceberg/DataFilesTable.java

+    Schema schema = new Schema(DataFile.getType(table.spec().partitionType()).fields());
+    if (table.spec().fields().size() < 1) {
+      // avoid returning an empty struct, which is not always supported. instead, drop the partition field (id 102)
+      return TypeUtil.selectNot(schema, Sets.newHashSet(102));


Should we enumerate these fields rather than using magic numbers (i.e. 102)?

These IDs are part of the spec, so I think it is better to use the IDs than to use names. That's why I added a comment to explain which field is being removed.

danielcweeks · 2019-07-16T15:58:46Z

core/src/main/java/org/apache/iceberg/DataFilesTable.java


  public static class FilesTableScan extends BaseTableScan {
    private static final long TARGET_SPLIT_SIZE = 32 * 1024 * 1024; // 32 MB
+    private final Schema fileSchema;


Seems like we're just double capturing a field from BaseTableScan here. Seem like we should just expose a way to get at the original schema or use the refined schema.

Right now, data tasks aren't responsible for projection because that's done easily by the engines. So we don't want to use the refined schema.

We could use the base table's schema without passing it through by making this a non-static inner class, but doing that seems odd with the refinement pattern to me. I thought it would be better to explicitly pass this through.

I was thinking more of exposing the ability get the original, unrefined schema from the base table, which wouldn't require making this an inner static class.

I'll leave it up to you though as I don't have a strong opinion about it.

danielcweeks · 2019-07-16T16:32:33Z

+1 LGTM

Fix data files table for unpartitioned tables.

1d5185c

danielcweeks reviewed Jul 16, 2019

View reviewed changes

rdblue merged commit 33a3882 into apache:master Jul 16, 2019

rdblue deleted the fix-data-files-table branch July 16, 2019 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update metadata tables for unpartitioned tables #285

Update metadata tables for unpartitioned tables #285

Uh oh!

rdblue commented Jul 14, 2019 •

edited

Loading

Uh oh!

danielcweeks Jul 16, 2019

Uh oh!

rdblue Jul 16, 2019

Uh oh!

danielcweeks Jul 16, 2019

Uh oh!

rdblue Jul 16, 2019

Uh oh!

danielcweeks Jul 16, 2019

Uh oh!

danielcweeks commented Jul 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update metadata tables for unpartitioned tables #285

Update metadata tables for unpartitioned tables #285

Uh oh!

Conversation

rdblue commented Jul 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielcweeks Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

danielcweeks Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

danielcweeks Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

danielcweeks commented Jul 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rdblue commented Jul 14, 2019 •

edited

Loading