-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: fix NPE in manifests table for contains_nan column, update spec #2521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right approach for this fix, one of the Spec changes seems unrelated to this issues (field field) but I have no problem with fixing typos in the same pr.
|
Should we add a unit test to read a v1 manifest file? |
I would also like to see this, but I think we probably need a dedicated backwards compatibility suite somewhere |
+1. As we are thinking about v3 now, I am trying to implement this in a way that can be flexible for future spec versions, will send a PR about this soon. |
|
Thank you for the quick review and feedback everyone! Regarding backward compatibility tests for reading manifest files, I think there are a few things worth mentioning:
I think the follow up item after this PR should be to create an issue to add metadata tables tests to make sure they don't break when reading different versions of tables. Any comment/feedback/suggestions? |
chenjunjiedada
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to follow up with a separated PR.
|
I was wondering about the other NPE in #2495, i was hoping that would have been caught by some backwards compatibility tests. I also have no problem with doing more testing and such in another pr |
I think the hard part is that this specific change (and some other spec changes) is adding a field regardless of table versions, and to ensure backward compatibility which means to test on a table before this logic is introduced, we need to have a fine grain control of creating these metadata files to mimic the old behavior when creating the table, so that we can drop things introduced in the new logic. And this basically means to create the table from scratch.
I think a way of achieving it could be to add resource files (including data file and all metadata files all the way to the json file) with all optional/later introduced fields missing, and these resource files could be readily readable as an existing table (instead of creating tables via code as those tests currently do). And then we let tests that extend I'm merging in this change to unblock bug fixes, but please feel free to continue the discussion here/in #2542 or ping me directly on slack. Thank you again @chenjunjiedada @RussellSpitzer @jackye1995 for the review and quick response! And thanks Jack for creating the issue! |
containsNaNfield within partition field summary from primitive to boxed boolean within Core: add contains_nan to field_summary #1872, I forgot to change other places that refers to the primitive boolean, causing NPE when reading metadata for manifests. #2405 and Core: Fix NPE caused by Unboxing a Null in ManifestFileUtil (#2492) #2495 (thanks @RussellSpitzer again for the fix!)containsNaNa primitive to avoid potential NPE. This may have the following implication:trueto this column so that we don't falsely skip it; but in this case non-floating-point columns will also havecontainsNaN=truein memory which doesn't make sense. This field isn't used anywhere for non-floating-point columns so it won't have correctness concern. However when displaying manifests table, this default value will be used, and for any files written beforecontainsNaNgetting populated in field summary,containsNaNwill be displayed astruewhich may not be correct.truebetween actually having NaN and default valueIn addition to fixing the NPE problem, this PR attempts to follow the current approach (of nullable
containsNaN) to update spec with the current state. We can continue the discussion of how to handle it here and I'll make sure to reflect it in the actual spec before merge.