Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Adding support for unsigned int #5431

Merged
merged 16 commits into from
Jun 13, 2020

Conversation

rgsl888prabhu
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu commented Jun 9, 2020

Adds support of unsigned int to cudf, updated validations and test cases to accommodate it.

And instead of -1 as sentinel value, max(int64) is being used in most of the cases.

@rgsl888prabhu rgsl888prabhu added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer labels Jun 9, 2020
@rgsl888prabhu rgsl888prabhu requested review from shwina and kkraus14 June 9, 2020 20:25
@rgsl888prabhu rgsl888prabhu requested review from a team as code owners June 9, 2020 20:25
@rgsl888prabhu rgsl888prabhu self-assigned this Jun 9, 2020
@rgsl888prabhu rgsl888prabhu changed the title [WIP] Adding support for unsigned int [REVIEW] Adding support for unsigned int Jun 9, 2020
@rgsl888prabhu rgsl888prabhu added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 9, 2020
Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm super uncomfortable with using 0 as a sentinel for nulls throughout the tests. It seems very error prone and fragile. Any ideas on what we can do instead? If we wait for the Pandas 1.0 PR do the new Pandas nullable types help?

@rgsl888prabhu
Copy link
Contributor Author

Will this close #2819?

It will, but there are other issues such as #5352 and #5351 which still requires changes in libcudf side.

@rgsl888prabhu
Copy link
Contributor Author

I'm super uncomfortable with using 0 as a sentinel for nulls throughout the tests. It seems very error prone and fragile. Any ideas on what we can do instead? If we wait for the Pandas 1.0 PR do the new Pandas nullable types help?

It might help, but that would require lot of additional changes to testing to accommodate that change and how close are we with supporting pandas 1.0?

@shwina
Copy link
Contributor

shwina commented Jun 11, 2020

I'm super uncomfortable with using 0 as a sentinel for nulls throughout the tests. It seems very error prone and fragile. Any ideas on what we can do instead?

I agree 100% here. Here are some additional considerations:

  1. 0 should be considered as good a sentinel value as any other, i.e., just changing our sentinel value to something else doesn't guarantee our tests aren't broken.

  2. Regarding Pandas 1.0, do they support nulls for all the operations that we have? If so, do their null semantics match ours? If not, this will be difficult to incorporate in the short term.

I think that our feature set is at a point where comparing with Pandas isn't sufficient and/or possible in many cases. For these cases, we should determine what we think is the right behaviour and test against that, rather than what Pandas does for some different (but closely related) case.

That being said, I'm not advocating for not testing against Pandas. Where possible, we should definitely still do that.

@rgsl888prabhu
Copy link
Contributor Author

@kkraus14 @shwina As a temporary solution till we get Pandas 1.0 support, shall we use max of the col dtype to signify null. At least this value will not be as common as 0.

@kkraus14
Copy link
Collaborator

@kkraus14 @shwina As a temporary solution till we get Pandas 1.0 support, shall we use max of the col dtype to signify null. At least this value will not be as common as 0.

It would be slightly better, but it still feels extremely fragile

@shwina
Copy link
Contributor

shwina commented Jun 11, 2020

@kkraus14 @shwina As a temporary solution till we get Pandas 1.0 support, shall we use max of the col dtype to signify null. At least this value will not be as common as 0.

Practically speaking, yes, I think this is better than 0, but theoretically, max(int64) isn't any better than 0. Neither is 17 or anything else for that matter :)

Suppose we decided to re-write the tests to use Pandas nullable types where possible, and our own test data otherwise, what's the level of effort involved? How many tests would this touch? Is this something that a couple of us could pool effort on and pull off?

@rgsl888prabhu
Copy link
Contributor Author

Just to test, I just changed sentinel value to something random, and I can observe around 10 - 15 tests failing, there can be more than that. But it feels like doable. As @brandon-b-miller mentioned in #5388 , we might have to introduce dtypes to handle nulls, which might add additional burden on modifying these test cases.

@kkraus14
Copy link
Collaborator

Lets use max(uint64) as the sentinel value for now (where we can) for the tests, and we'll file issues and follow up in using Pandas nullable types / rolling our own null behaviors in tests as needed to move away from the sentinel value based testing.

@rgsl888prabhu
Copy link
Contributor Author

Changed the sentinel value to be max(int64) rather than max(uint64) as the value of an int can't be greater than sys.maxsize, else any conversion will fail with following error.

>>> np.int8(np.iinfo(np.uint64).max)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long

I have used this sentinel value in most of the places apart from places were float column is being created and then it is casted to integer.

@kkraus14
Copy link
Collaborator

@shwina can you review again before we merge?

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Jun 13, 2020
@kkraus14
Copy link
Collaborator

@shwina can you review again before we merge?

connected offline and confirmed this is good to go so merging!

@kkraus14 kkraus14 merged commit bd974e0 into rapidsai:branch-0.15 Jun 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants