Add support for numpy 2.0.0 #2269

R-Palazzo · 2024-10-24T20:47:21Z

Resolve #2078
CU-86b0y0uu0

sdv-team · 2024-10-24T20:47:30Z

Task linked: CU-86b0y0uu0 SDV - Add support for numpy 2.0.0 #2078

sdv/data_processing/data_processor.py

codecov · 2024-10-25T17:06:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.63%. Comparing base (8651241) to head (a1f160f).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2269   +/-   ##
=======================================
  Coverage   98.63%   98.63%           
=======================================
  Files          58       58           
  Lines        6016     6026   +10     
=======================================
+ Hits         5934     5944   +10     
  Misses         82       82

Flag	Coverage Δ
integration	`82.17% <100.00%> (+0.02%)`	⬆️
unit	`97.46% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pvk-developer · 2024-10-28T11:28:24Z

The reason why pa.float32 is failing with id's is because of this:

*** pyarrow.lib.ArrowInvalid: Integer value 640058449 not in range: -16777216 to 16777216

This also raises a new issue about the benchmarking, to check if the sampled range is the expected one, currently it does not check, the same number or error is generated and therefore the test fails. We fallback to object because we couldn't cast it.

PS: We can disable pa.float32 (since it is not officially yet supported), on both this cases, but we should raise an issue to keep track of this and fix it, specially once sdv-dev/RDT#887 is merged and released.

pvk-developer

Left a minor comment but otherwise it looks good 👍🏻

tests/benchmark/numpy_dtypes.py

R-Palazzo · 2024-10-28T20:28:49Z

sdv/constraints/tabular.py

+        # To make the NaN to None mapping work for pd.Categorical data, we need to convert
+        # the columns to object before replacing NaNs with None.
+        for column in self._columns:
+            if pd.api.types.is_categorical_dtype(table_data[column]):
+                table_data[column] = table_data[column].astype(object)
+


Adding the np.bytes revealed an actual bug in the FixedCombinations with pd.Categorical and NaNs. It was not caught before due to reject sampling; before enough rows were generated, but none with a combination including NaNs in the categorical column. After adding np.bytes and somehow a bit randomly, the synthesizer was able to generate only 9 out of the 10 rows. Changing the number of rows to sample also made the test pass but did not fix the bug, haha. This fixes the bug, and I added a test for it. Let me know if it makes sense

I'm confused why adding bytes revealed this when the check is for is_categorical_dtype. Bytes are not a categorical dtype

pvk-developer

Yes, the change about the bug you found makes sense.

amontanez24

Looks good, just one comment

amontanez24 · 2024-10-29T16:57:51Z

sdv/constraints/tabular.py

+        table_data[self._columns] = table_data[self._columns].astype({
+            col: object
+            for col in self._columns
+            if pd.api.types.is_categorical_dtype(table_data[col])
+        })


Based on this discussion, should we just use fillna instead of replace? Then we don't have to convert

@R-Palazzo I forgot that fillna cannot be used with None which is probably why we have this line of code in the first place. This solution is good

amontanez24 · 2024-10-29T18:01:48Z

sdv/constraints/tabular.py

+        table_data[self._columns] = table_data[self._columns].astype({
+            col: object
+            for col in self._columns
+            if pd.api.types.is_categorical_dtype(table_data[col])
+        })


@R-Palazzo I forgot that fillna cannot be used with None which is probably why we have this line of code in the first place. This solution is good

R-Palazzo requested review from amontanez24 and pvk-developer October 24, 2024 20:47

R-Palazzo requested a review from a team as a code owner October 24, 2024 20:47

auto-assign bot assigned R-Palazzo Oct 24, 2024

R-Palazzo removed the request for review from a team October 24, 2024 20:47

R-Palazzo force-pushed the issue-2078-numpy2.0 branch from ef4ecbe to b2e5087 Compare October 24, 2024 20:47

R-Palazzo commented Oct 24, 2024

View reviewed changes

sdv/data_processing/data_processor.py Outdated Show resolved Hide resolved

R-Palazzo force-pushed the issue-2078-numpy2.0 branch from 420e883 to 9034ee7 Compare October 25, 2024 16:45

R-Palazzo force-pushed the issue-2078-numpy2.0 branch from 9034ee7 to 7834a5f Compare October 25, 2024 19:39

R-Palazzo added 10 commits October 28, 2024 10:00

support numpy 2.0

d112371

move to str_ instead of bytes_

4a0bd31

add boolean

dfd1ad5

remove lint check

3955096

fix dtype benchmarking

d6530a9

add tests

83cf243

better handle numerical id

53cd4f6

set the min to 0

415d088

fix test

9eeeff4

update default value to fix the dtype benchmarking

61d23b1

R-Palazzo force-pushed the issue-2078-numpy2.0 branch from 635808f to 61d23b1 Compare October 28, 2024 14:23

pvk-developer approved these changes Oct 28, 2024

View reviewed changes

tests/benchmark/numpy_dtypes.py Outdated Show resolved Hide resolved

support np.bytes_ + fix FixedCombinations with pd.Categorical

c7b08d1

R-Palazzo commented Oct 28, 2024

View reviewed changes

clean assignment

5197728

pvk-developer approved these changes Oct 28, 2024

View reviewed changes

amontanez24 reviewed Oct 29, 2024

View reviewed changes

amontanez24 approved these changes Oct 29, 2024

View reviewed changes

R-Palazzo added 2 commits October 29, 2024 15:14

fix pd.Period failing + test

e561625

fix lint

a1f160f

R-Palazzo merged commit 5d76780 into main Oct 30, 2024
41 checks passed

R-Palazzo deleted the issue-2078-numpy2.0 branch October 30, 2024 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for numpy 2.0.0 #2269

Add support for numpy 2.0.0 #2269

R-Palazzo commented Oct 24, 2024

sdv-team commented Oct 24, 2024

codecov bot commented Oct 25, 2024 •

edited

Loading

pvk-developer commented Oct 28, 2024 •

edited

Loading

pvk-developer left a comment

R-Palazzo Oct 28, 2024

amontanez24 Oct 29, 2024

pvk-developer left a comment

amontanez24 left a comment

amontanez24 Oct 29, 2024

amontanez24 Oct 29, 2024

amontanez24 Oct 29, 2024

Add support for numpy 2.0.0 #2269

Add support for numpy 2.0.0 #2269

Conversation

R-Palazzo commented Oct 24, 2024

sdv-team commented Oct 24, 2024

codecov bot commented Oct 25, 2024 • edited Loading

Codecov Report

pvk-developer commented Oct 28, 2024 • edited Loading

pvk-developer left a comment

Choose a reason for hiding this comment

R-Palazzo Oct 28, 2024

Choose a reason for hiding this comment

amontanez24 Oct 29, 2024

Choose a reason for hiding this comment

pvk-developer left a comment

Choose a reason for hiding this comment

amontanez24 left a comment

Choose a reason for hiding this comment

amontanez24 Oct 29, 2024

Choose a reason for hiding this comment

amontanez24 Oct 29, 2024

Choose a reason for hiding this comment

amontanez24 Oct 29, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 25, 2024 •

edited

Loading

pvk-developer commented Oct 28, 2024 •

edited

Loading