Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for numpy 2.0.0 #2269

Merged
merged 14 commits into from
Oct 30, 2024
Merged

Add support for numpy 2.0.0 #2269

merged 14 commits into from
Oct 30, 2024

Conversation

R-Palazzo
Copy link
Contributor

Resolve #2078
CU-86b0y0uu0

@sdv-team
Copy link
Contributor

@R-Palazzo R-Palazzo removed the request for review from a team October 24, 2024 20:47
Copy link

codecov bot commented Oct 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.63%. Comparing base (8651241) to head (a1f160f).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2269   +/-   ##
=======================================
  Coverage   98.63%   98.63%           
=======================================
  Files          58       58           
  Lines        6016     6026   +10     
=======================================
+ Hits         5934     5944   +10     
  Misses         82       82           
Flag Coverage Δ
integration 82.17% <100.00%> (+0.02%) ⬆️
unit 97.46% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pvk-developer
Copy link
Member

pvk-developer commented Oct 28, 2024

The reason why pa.float32 is failing with id's is because of this:

*** pyarrow.lib.ArrowInvalid: Integer value 640058449 not in range: -16777216 to 16777216

This also raises a new issue about the benchmarking, to check if the sampled range is the expected one, currently it does not check, the same number or error is generated and therefore the test fails. We fallback to object because we couldn't cast it.

PS: We can disable pa.float32 (since it is not officially yet supported), on both this cases, but we should raise an issue to keep track of this and fix it, specially once sdv-dev/RDT#887 is merged and released.

Copy link
Member

@pvk-developer pvk-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a minor comment but otherwise it looks good 👍🏻

tests/benchmark/numpy_dtypes.py Outdated Show resolved Hide resolved
Comment on lines 337 to 342
# To make the NaN to None mapping work for pd.Categorical data, we need to convert
# the columns to object before replacing NaNs with None.
for column in self._columns:
if pd.api.types.is_categorical_dtype(table_data[column]):
table_data[column] = table_data[column].astype(object)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the np.bytes revealed an actual bug in the FixedCombinations with pd.Categorical and NaNs. It was not caught before due to reject sampling; before enough rows were generated, but none with a combination including NaNs in the categorical column. After adding np.bytes and somehow a bit randomly, the synthesizer was able to generate only 9 out of the 10 rows. Changing the number of rows to sample also made the test pass but did not fix the bug, haha. This fixes the bug, and I added a test for it. Let me know if it makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why adding bytes revealed this when the check is for is_categorical_dtype. Bytes are not a categorical dtype

Copy link
Member

@pvk-developer pvk-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the change about the bug you found makes sense.

Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one comment

Comment on lines +339 to +343
table_data[self._columns] = table_data[self._columns].astype({
col: object
for col in self._columns
if pd.api.types.is_categorical_dtype(table_data[col])
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this discussion, should we just use fillna instead of replace? Then we don't have to convert

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@R-Palazzo I forgot that fillna cannot be used with None which is probably why we have this line of code in the first place. This solution is good

Comment on lines +339 to +343
table_data[self._columns] = table_data[self._columns].astype({
col: object
for col in self._columns
if pd.api.types.is_categorical_dtype(table_data[col])
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@R-Palazzo I forgot that fillna cannot be used with None which is probably why we have this line of code in the first place. This solution is good

@R-Palazzo R-Palazzo merged commit 5d76780 into main Oct 30, 2024
41 checks passed
@R-Palazzo R-Palazzo deleted the issue-2078-numpy2.0 branch October 30, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for numpy 2.0.0
4 participants