-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Add NaN value count to content file #1803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
yyanyy
commented
Nov 21, 2020
- counterpart of API: add isNaN and notNaN predicates #1747
- will only add the new count in V2 table
- also added some tests for stats handling in manifest reader
| public void testReadEntriesWithFilterAndSelectIncludesFullStats() throws IOException { | ||
| ManifestFile manifest = writeManifest(1000L, FILE); | ||
| try (ManifestReader<DataFile> reader = ManifestFiles.read(manifest, FILE_IO) | ||
| .select(ImmutableSet.of("record_count")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I change this record_count to something else it will result in NPE due to InclusiveMetrisEvaluator.eval needing record count, however STATS_COLUMNS in manifest reader doesn't have it. I know the reader normally will only be used internally so we don't expect to run into this often, but wonder if we want to ensure record_count is always added when populating stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we do. Maybe we should do that in a separate update, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll create a separate pr for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR: #1820
core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestManifestWriterVersions.java
Outdated
Show resolved
Hide resolved
| assertBounds(6, BinaryType.get(), | ||
| ByteBuffer.wrap("A".getBytes()), ByteBuffer.wrap("A".getBytes()), metrics); | ||
| assertCounts(7, 1L, 0L, 1L, metrics); | ||
| assertBounds(7, DoubleType.get(), Double.NaN, Double.NaN, metrics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are NaN values getting into the lower and upper bounds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was because I added NaN as the only value in this column during the creation of the record in buildNestedTestRecord, and currently this will result in upper and lower bound being both NaN (similar behavior as in this test. I added this extra column in order to test NaN handling in metrics modes, and change to this test was a side effect. Do you want me to remove the bound check in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess this will continue to happen until we ignore NaN values and keep track of the lower and upper bounds ourselves for Parquet and ORC?
This is fine for now, but I would want this to be correct eventually.
|
Nice work. Thanks @yyanyy! |