-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: update record_count behavior, include in manifest reader #1820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| // the difference between the two stats set below is to support ContentFile.copyWithoutStats(), which | ||
| // still keeps record count. | ||
| private static final Set<String> STATS_COLUMNS = Sets.newHashSet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeps record count or discards record count?
I think it was an oversight to not include record count in stats. I think we should just have one list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think copyWithoutStats doesn't discard record count will discarding all column-specific stats.
I do agree that having one list is simpler, the reason for me to do this is
- If we add
record_countto this list then it will result in a behavior change, that if people selectrecord_countwithout other stats listed here, earlier they will not receive those stats, but now they will receive a full list. This is becausedropStatsrelies on this list. - Alternatively we can stop copying
recordCountover withincopyWithoutStatsbut I'm not entirely sure if we want to do that since currently the metrics that can be discarded are all map, andrecordCountislong; and I guess if we no longer copyrecordCountwe may as well not copyfileSizeInByteswhich is anotherlong. After this change since these two attributes return primitive type, they will return -1, which I'm not sure if it's the best thing to do.
I think the first approach is safer, but I wasn't sure if it's worth changing the behavior to keep the code simpler. Do you have a recommendation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we special-case record_count? I don't think that record_count should be dropped in copyWithoutStats, but I also agree that simply selecting record_count should not select all stats columns.
This set is primarily for ensuring that all stats required to filter the instances of DataFile or DeleteFile stored in a manifest are present. I think we should limit it to that purpose. I don't think that this list should affect the behavior or copyWithoutStats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll add record_count to the set so that the set tracks all stats required for filtering data/delete file, and special-case record_count in dropStats() to account for the case that selecting only record_count will still result in copyWithoutStats instead of a full copy. (This is actually the same as the current logic, just consolidate the two sets into one)
| } | ||
|
|
||
| private static Collection<String> withStatsColumns(Collection<String> columns) { | ||
| static Collection<String> withStatsColumns(Collection<String> columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this returned List, then we wouldn't need to copy the list in ManifestGroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one of the usage columns directly comes from the field column which is a collection so I didn't change that, but it seems that from all its current usage this field could be changed to a list. I'll need to update a few tests after this but I think it's doable. Will do!
| private FileFormat format = null; | ||
| private PartitionData partitionData = null; | ||
| private Long recordCount = null; | ||
| private long recordCount = -1L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should make this change. I like that the NullPointerException ensures that an incorrect record count just can't be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly did this to simplify testing as I noticed fileSizeInBytes which is also a long has the pattern of defaulting to -1, but I don't have strong opinion either way. Will revert!
| } | ||
|
|
||
| private void assertNoStats(DataFile dataFile) { | ||
| Assert.assertEquals(-1L, dataFile.recordCount()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should always contain the record count, even after copyWithoutStats. That's primarily to drop the stats maps, which can be really large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code actually is not testing results from copyWithoutStats, copyWithoutStats results are actually tested by another method assertStatsDropped which still keeps recordCount. In the test case that uses this assertNoStats, only select but no filter operation is applied to the manifest reader, so the reader doesn't project stats columns when reading (since manifest entries don't have to go through evaluators when no filter is applied), and thus only return the field being selected (in this case file_path). But I can see that this name is confusing so I updated it a bit to hopefully makes things clearer.
bb09e68 to
76d6056
Compare
| } | ||
|
|
||
| public ManifestReader<F> select(Collection<String> newColumns) { | ||
| public ManifestReader<F> select(List<String> newColumns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was it necessary to change this to List instead of Collection? That's not a binary-compatible change. Couldn't we just make a copy of the collection if we need the field to be a list? I'm also not sure why columns can't be a Collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I modified this when trying to address this comment, as I noticed that all usage of columns could be done via list so I directly changed the type to avoid the list copying, without thinking about backward compatibility of this method. I guess your original suggestion was to move the copying from ManifestGroup to ManifestReader, but I misinterpreted it to get rid of the list copy completely?
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small details, otherwise it looks good to me.
| static final Set<String> STATS_COLUMNS = Sets.newHashSet( | ||
| "value_counts", "null_value_counts", "nan_value_counts", "lower_bounds", "upper_bounds"); | ||
|
|
||
| private static final Set<String> STATS_COLUMNS = Sets.newHashSet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use immutable set
| Assert.assertNull(dataFile.nanValueCounts()); | ||
| Assert.assertNull(dataFile.lowerBounds()); | ||
| Assert.assertNull(dataFile.upperBounds()); | ||
| Assert.assertNull(dataFile.nanValueCounts()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to minimize changes, no need to move line for this. Same comment for L206
|
Looks great, thanks @yyanyy! |
Please note that this changes the behavior ofrecordCountinBaseFile; originally ifBaseFilewas created by avro schema reflection without populatingrecordCount, callingrecordCount()will throw NPE because its return type is primitive. I'm currently following the same style asfileSizeInBytesto return -1 when it is not populated.One implication of this is that the NPE problem described in the original comment will no longer exist, instead metrics evaluators will not filter out anything.Alternatively I can refrain from changing this and accept thatdata.recordCount()could throw NPE in tests, or change the return type ofrecordCount()to beLong; I don't really have a strong preference so suggestions are welcome!