Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix inplace update of data and add Series.update #7201

Merged
merged 35 commits into from
Mar 31, 2021

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Jan 23, 2021

Fixes: #7187

This PR:

  • Fixes inplace manipulation of columns.
  • Introduces Series.update
  • Fixes incorrect dtype handling in Frame.where

@galipremsagar galipremsagar self-assigned this Jan 23, 2021
@harrism harrism changed the title [WIP] Fix inplace updation of data and add Series.update [WIP] Fix inplace update of data and add Series.update Jan 25, 2021
@kkraus14
Copy link
Collaborator

Fixes incorrect dtype handling in Frame.where[WIP]

Before you dive deep into this, can you explain the situation this is handling?

@galipremsagar
Copy link
Contributor Author

Fixes incorrect dtype handling in Frame.where[WIP]

Before you dive deep into this, can you explain the situation this is handling?

So the issue here is we are not correctly handling the dtype situation, where we need to compare the self & other dtypes and decide what the destination dtype has to be, currently we straight away type-cast other to self.dtype:

>>> import cudf
>>> s = cudf.Series([False, True, True])
>>> other = cudf.Series([3.0, 4.5, 7.0])
>>> s.where([True, True, True], other)
0    False
1     True
2     True
dtype: bool
>>> s.where([False, False, False], other)
0    True
1    True
2    True
dtype: bool
>>> s.to_pandas().where([False, False, False], other.to_pandas())
0      3
1    4.5
2      7
dtype: object
>>> s.to_pandas().where([False, True, False], other.to_pandas())
0       3
1    True
2       7
dtype: object

We should perform the correct dtype calcaulation for numeric types and raise an error when there could be a situation of mixed object dtypes like in pandas. Does this sound okay by you?

@kkraus14
Copy link
Collaborator

kkraus14 commented Jan 25, 2021

We should perform the correct dtype calcaulation for numeric types and raise an error when there could be a situation of mixed object dtypes like in pandas. Does this sound okay by you?

Okay, that sounds good. What about the situation where self is int8 and someone passes other with int32. What would the result type be or should we throw?

@galipremsagar
Copy link
Contributor Author

galipremsagar commented Jan 25, 2021

What about the situation where self is int8 and someone passes other with int32. What would the result type be or should we throw?

Yup that's when the common dtype finding logic should kick in and result in final dtype as int32, for example:

>>> s
0    3
1    4
2    5
dtype: int8
>>> other = pd.Series([10, 11, 12], dtype='int32')
>>> other
0    10
1    11
2    12
dtype: int32
>>> s.where([True, False, True], other)
0     3
1    11
2     5
dtype: int32

@kkraus14
Copy link
Collaborator

Yup that's when the common dtype finding will kick in and result in final dtype as int32

Is that the expected behavior for end users? I.E. if I did something like:

s = cudf.Series([1, 2, 3], dtype='int8')
other = cudf.Series([11, 12, 13], dtype='int32')
s.where([False, True, True], other)

The output would be int32 while the data still nicely fits in int8. Would this be expected behavior for an end user? There's also the inplace flag, which would then either error in the above case or we'd need to introspect the data in other to determine if it can properly fit in the dtype of s.

@galipremsagar
Copy link
Contributor Author

The dtype selection is also based on inplace, where if inplace is True the type-cast is not being performed and the number is probably being allowed to overflow:

>>> s
0     3
1    11
2     5
dtype: int8
>>> other = pd.Series([10, 10000, 12120], dtype='int32')
>>> s.where([True, False, True], other, inplace=True)
>>> s
0     3
1    16
2     5
dtype: int8
>>> s.where([True, False, True], other, inplace=False)
0        3
1    10000
2        5
dtype: int32
  1. If inplace=False, I think we can go ahead with changing of dtype based on both self & other.
  2. If inplace=True, we can return self.dtype itself instead of up-casting(but raise a warning stating there could be potential data loss due to down-casting) as we allow down-casting:
>>> import cudf
>>> s = cudf.Series([10, 10000, 12120], dtype='int32')
>>> s
0       10
1    10000
2    12120
dtype: int32
>>> s.astype('int8')
0    10
1    16
2    88
dtype: int8

@kkraus14
Copy link
Collaborator

Sounds like a good path forward to me.

@kkraus14 kkraus14 added Python Affects Python cuDF API. non-breaking Non-breaking change feature request New feature or request labels Jan 28, 2021
@galipremsagar galipremsagar added the 2 - In Progress Currently a work in progress label Jan 28, 2021
@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer and removed 2 - In Progress Currently a work in progress labels Feb 2, 2021
@galipremsagar galipremsagar marked this pull request as ready for review February 2, 2021 01:56
@galipremsagar galipremsagar requested a review from a team as a code owner February 2, 2021 01:56
@galipremsagar galipremsagar changed the title [WIP] Fix inplace update of data and add Series.update [REVIEW] Fix inplace update of data and add Series.update Feb 2, 2021
python/cudf/cudf/utils/dtypes.py Show resolved Hide resolved
python/cudf/cudf/utils/dtypes.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/dtypes.py Show resolved Hide resolved
python/cudf/cudf/utils/dtypes.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/_internals/where.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/_internals/where.py Show resolved Hide resolved
python/cudf/cudf/core/_internals/where.py Show resolved Hide resolved
python/cudf/cudf/core/_internals/where.py Show resolved Hide resolved
python/cudf/cudf/core/_internals/where.py Outdated Show resolved Hide resolved
@galipremsagar galipremsagar changed the title [WIP] Fix inplace update of data and add Series.update [REVIEW] Fix inplace update of data and add Series.update Mar 30, 2021
@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer and removed 2 - In Progress Currently a work in progress labels Mar 30, 2021
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@galipremsagar
Copy link
Contributor Author

rerun tests

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Mar 31, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 684bb14 into rapidsai:branch-0.19 Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] DataFrame.update() should modify the memory inplace as opposed to replacing the buffers
3 participants