-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Declare enums explicitly, fix type hints #74
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, thanks @vnlitvinov. Just one question about ColumnNullType
I'm converting this to draft until I finish migrating the prototype to pandas - I've already had to change a few things here... |
I hope this is enough to get chewing, I'll probably polish the formatting so at least PEP-8 checks would be happy. |
Do note three changes to the spec:
Any feedback on those changes is welcome! |
Thanks @vnlitvinov!
This seems fine to me.
These are the implementations:
The docstrings also talk about a dictionary in both places; for cuDF the type annotation matches the code (tuple), in Vaex it matches the docs. I think we're fine making backwards incompatible changes at this point. So I'd vote for using a dictionary, it's not much more verbose, and makes it possible to add another return value in case that would ever be needed. @maartenbreddels, @shwina what do you think? |
This was inherited from:
Both of those define this, and it kinda makes sense to have it. There was discussion elsewhere too that I cannot find right now. The rationale for removing it is unclear, and you are also removing docs and keywords that are needed. Can you please explain in more detail why you want to do this @vnlitvinov? |
The comment you're linking at describes something which is somewhat different from what we have now in the spec. My main point is Alternatives to removing the method on exchange object are:
I am personally inclined to just remove the method on the exchange object, but I'm curios what others think (cc @jorisvandenbossche @shwina @jreback @aregm) Oh, forgot to add... as for changing the |
I reviewed this again:
I think
The point of this would be that if you want code that handles a mix of dataframe types/objects, the only guarantee you have that if you call
Not sure about that, that'd be quite verbose and quite a bit of churn. Given that we'd have to version the protocol to do anything with it, and can deal with a backwards-compatible extension via versioning, I'd suggest to leave it as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed again, all looks fine except for my one comment about not deleting docs and the _nan_as_null
and _allow_copy
attributes.
@rgommers I've updated the PR again following my logic at Pandas PR I've changed I don't like the idea of saving the arguments as attributes - those should be implementation-specific IMHO, hence I'm marking it as |
""" | ||
|
||
@property | ||
def size(self) -> Optional[int]: | ||
@abstractmethod | ||
def size(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this may be None
is that the size may be unknown, or expensive to calculate. IIRC there was a fair bit of discussion about this, and this was on purpose. Same as null_count
further down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rgommers, could you elaborate when the size could be unknow? Also, if the size is expensive to calculate, why should we return None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think of a library using delayed evaluation of method calls that can give a variable size (filter/select/etc.). Accessing .size
will trigger immediate evaluation, which can take an arbitrary amount of time. Dask is an example of such a library.
And why: a property lookup is supposed to be fast, and not trigger expensive computations. If so, it should be converted to a method. But .size
is well-established, and the preference was for this property to remain but for libraries to allow returning None
if the size is unknown without triggering a computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the user eventually wants to get the size of a column? Does the user have to use a different/extra API to get it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that Buffer
has .bufsize
which is somewhat equal to this size computing.
Hence maybe we should introduce some separate method of getting the size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rgommers ping on the questions above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided here that None
(and hence Optional
) can indeed be dropped. So the current change here is fine.
Let me point out that we then indeed change .size
from a property to a method. I'll not resolve this issue yet to give folks a chance to respond.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me point out that we then indeed change
.size
from a property to a method. I'll not resolve this issue yet to give folks a chance to respond.
Which is not what this PR does right now, this declares a property in an ABC:
@property
@abstractmethod
Going from .size
to .size()
would be a pretty annoying bc-breaking change at this point (goes for null_count
as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-discussed once more (apologies for the churn): consensus was to turn these into methods indeed, to indicate that they may trigger a potentially expensive computation to return the size as an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a commit to turn .size
into a method (with an explanation in its docstring).
null_count
remains Optional[int]
, so I did not touch it.
Thanks @vnlitvinov. LGTM modulo my one comment. |
Note: this PR is aiming to change the spec, so #69 is also related. |
I would say it's blocked by #69, let's merge that and then update this and merge it as well. |
protocol/dataframe_protocol.py
Outdated
NON_NULLABLE : int | ||
Non-nullable column. | ||
USE_NAN : int | ||
Use explicit float NaN/NaT value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is NaT a thing outside of numpy? As far as I know that's just a sentinel value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much insight here, was just moving the enum descriptions from .describe_null
docstring. @rgommers might know more about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, it's not a standardized thing, it's a sentinel value that behaves similar to nan
in comparisons and sort
. So this was already a mistake in the content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
USE_NAN
is just for floating point data then. NaT
should be removed here.
In the USE_SENTINEL
value it should be usable - it's just a particular value of the same data type as the rest of the data in the column, which is always understandable by the receiving library I'd say. So it can stay there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pandas implementation would also need to be updated (it currently uses USE_NAN
for datetime data):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit removing NaT
.
Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
This address the review comment that NaT is not a thing outside of NumPy. Hence for not-a-datetime, all implementers should be using sentinel values, because those are explicit.
I pushed a rebase after fixing conflicts introduced after merging gh-69 (in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments have been resolved now, so I will merge this. Thanks a lot @vnlitvinov and everyone else who commented.
Between this PR and gh-69, there are now two bc-breaking changes. We've agreed that those changes will be synced quickly to Pandas (by @mroeschke), Vaex (by @honno), Modin by (@vnlitvinov) and cuDF (by @shwina), so that won't give problems with implementations that are not in sync with each other.
Pull request for syncing Modin with new spec: modin-project/modin#4763 |
vaex PR for syncing with new spec vaexio/vaex#2150 |
pandas PR for syncing with new spec |
Follow-up question - |
In terms of necessity in interchange, I don't think so, as yeah you can just go |
Signed-off-by: Vasily Litvinov vasilij.n.litvinov@intel.com
This is related to #73