Skip to content

Conversation

@yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Nov 21, 2020

public void testReadEntriesWithFilterAndSelectIncludesFullStats() throws IOException {
ManifestFile manifest = writeManifest(1000L, FILE);
try (ManifestReader<DataFile> reader = ManifestFiles.read(manifest, FILE_IO)
.select(ImmutableSet.of("record_count"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change this record_count to something else it will result in NPE due to InclusiveMetrisEvaluator.eval needing record count, however STATS_COLUMNS in manifest reader doesn't have it. I know the reader normally will only be used internally so we don't expect to run into this often, but wonder if we want to ensure record_count is always added when populating stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we do. Maybe we should do that in a separate update, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll create a separate pr for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR: #1820

assertBounds(6, BinaryType.get(),
ByteBuffer.wrap("A".getBytes()), ByteBuffer.wrap("A".getBytes()), metrics);
assertCounts(7, 1L, 0L, 1L, metrics);
assertBounds(7, DoubleType.get(), Double.NaN, Double.NaN, metrics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are NaN values getting into the lower and upper bounds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was because I added NaN as the only value in this column during the creation of the record in buildNestedTestRecord, and currently this will result in upper and lower bound being both NaN (similar behavior as in this test. I added this extra column in order to test NaN handling in metrics modes, and change to this test was a side effect. Do you want me to remove the bound check in this test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess this will continue to happen until we ignore NaN values and keep track of the lower and upper bounds ourselves for Parquet and ORC?

This is fine for now, but I would want this to be correct eventually.

@rdblue rdblue merged commit b1296bc into apache:master Nov 25, 2020
@rdblue
Copy link
Contributor

rdblue commented Nov 25, 2020

Nice work. Thanks @yyanyy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants