[enhancement] remove contiguous check from `_check_array` #2185

icfaust · 2024-11-23T10:03:54Z

Description

This PR makes to_table handle non-contiguous arrays natively. The checks in _check_array are hardcoded for numpy inputs, and are applied regardless of the circumstance. This makes the check seamless, will remove a blocker for other non-numpy data types, and will help in the rollout of the new finite checker, where a future refactor will remove _check_array entirely and enforce the use of validate_data and _check_sample_weight in sklearnex. This also removes three TODOs listed in the codebase.

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
I have provided justification why performance has changed or why changes are not expected.
I have provided justification why quality metrics have changed or why changes are not expected.
I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

icfaust · 2024-11-24T09:56:03Z

/intelci: run

icfaust · 2024-11-24T20:12:38Z

onedal/utils/validation.py

@@ -153,15 +153,6 @@ def _check_array(

    if sp.issparse(array):
        return array
-
-    # TODO: Convert this kind of arrays to a table like in daal4py
-    if not array.flags.aligned and not array.flags.writeable:


From numpy documentation: https://numpy.org/devdocs/dev/alignment.html only structured arrays will cause this, meaning that this is a non issue based on the dtypes of use for oneDAL (float32, float64, int32, and int64), and would failt to convert in table.cpp anyway.

I think numpy also allows creating non-aligned arrays out of custom non-owned pointers, for example through PyArray_SimpleNewFromData, so in theory there could be a non-aligned float32 array or similar. But it'd be a very unlikely input, since default allocators in most platforms are aligned.

Also an easier conversion could be through np.require.

Good point, I'll probably because of this too I will make a stricter check on C++ side https://github.com/intel/scikit-learn-intelex/blob/main/onedal/decomposition/pca.py#L150 probably using https://numpy.org/devdocs/reference/c-api/array.html#c.PyArray_FromArray where I can specify the requirements to be aligned and owned (rather than use PyArray_GETCONTIGUOUS)

@david-cortes-intel I looked into it, the checks PyArray_ISFARRAY_RO and PyArray_ISCARRAY_RO check for alignment natively (https://numpy.org/devdocs/reference/c-api/array.html#c.PyArray_ISFARRAY_RO), and then PyArray_GETCONTIGUOUS also returns an aligned (well-behaved) copy regardless of the ownership. The change in the checks was to make sure that numpy objects wouldn't cause an infinite recursion. So the issue was taken care of, even if I didn't realize it. Just as a note, I am trying to minimize the number of numpy calls because of upcoming array_api support which we cannot rely on direct numpy calls on non-numpy arrays (say dpctl tensors).

With respect to non-ownership, I don't want to step on anyone's toes, as changes to PCA are occuring here which may inpact the ownership check there: #2106

icfaust · 2024-11-25T06:29:10Z

/intelci: run

samir-nasibli · 2024-11-25T18:00:55Z

onedal/datatypes/data_conversion.cpp

+        if (!PyArray_ISCARRAY_RO(ary) && !PyArray_ISFARRAY_RO(ary)) {
+            // NOTE: this will make a C-contiguous deep copy of the data
+            // this is expected to be a special case
+            ary = PyArray_GETCONTIGUOUS(ary);


For clarity, could you explain what would happen if an error occurs here? Additionally, is there a risk of a memory leak in this scenario?
For example, is there some error flag that should be checked?
Wouldn't it be safer to use the Python NumPy API for such checks and conversions?

The Decref is necessary to prevent a memory leak (otherwise leaks in BasicStatistics occur in all CIs), the memory is effectively owned by the table object, who will call its destructor with the pycapsule. This work makes to_table support non-contiguous arrays, which will simplify the python coding occurring before to_table. This work will enable _check_array to be moved from onedal to sklearnex simplifying the rollout of the new finite checker to SVM and neighbors algorithms. There is no error handling available (that I can find) in the numpy C-api as also numpy generally doesn't raise errors like that. The checks that it is 1) a numpy array and 2) of certain aligned and type characteristics will prevent errors, and the use of the numpy c-api is very standardized.

It looks like it can fail to allocate, in which case it will return NULL:
https://github.com/numpy/numpy/blob/cf9598572528318a54489b3c9ed5f65ef042e8c8/numpy/_core/src/multiarray/convert.c#L495

Ahhh thank you for doing that research @david-cortes-intel , that would mean in the case of a failure in allocation, it will fail to decref. I will add a check there.

samir-nasibli · 2024-11-25T18:11:02Z

onedal/datatypes/data_conversion_sua_iface.cpp

+    if (layout == dal::data_layout::unknown){
+        py::object copy;
+        if (py::hasattr(obj, "copy")){
+            copy = obj.attr("copy")();
+        }
+        else if (py::hasattr(obj, "__array_namespace__")){
+            const auto space = obj.attr("__array_namespace__")();
+            copy = space.attr("asarray")(obj, "copy"_a = true);
+        }
+        else {
+            throw std::runtime_error("Wrong strides");
+        }
+        res = convert_to_homogen_impl<Type>(copy);
+        copy.dec_ref();
+        return res;
+    }


Please add comments here for other reviewers

icfaust · 2024-11-25T21:28:15Z

/intelci: run

ahuber21

As this is touching shared code we should see at least one performance measurement that shows no impact on runtime or accuracy.

samir-nasibli · 2024-11-26T08:39:18Z

@icfaust please add proper description to the PR and share benchmark validation results

Alexsandruss

Please, provide detailed explanation of changes in next time.

icfaust · 2024-11-26T13:43:13Z

Sorry about that, updated the description.

icfaust · 2024-11-26T13:59:11Z

/intelci: run

david-cortes-intel · 2024-11-26T14:08:04Z

onedal/datatypes/data_conversion_sua_iface.cpp

+        // NOTE: this will make a C-contiguous deep copy of the data
+        // if possible, this is expected to be a special case
+        py::object copy;
+        if (py::hasattr(obj, "copy")){


Is it guaranteed that the copy will be C-contiguous? It will always be the case with numpy, but what about other packages?

as of now only dpctl and dpnp use this standard:
https://github.com/IntelPython/dpnp/blob/master/dpnp/dpnp_container.py#L137 will specify "K" to dpctl.tensor's copy: https://github.com/IntelPython/dpctl/blob/master/dpctl/tensor/_copy_utils.py#L574 which if its not F aligned (which is the case) it will default to C alignment. asarray in dpctl https://github.com/IntelPython/dpctl/blob/master/dpctl/tensor/_ctors.py#L483 also has "K" as default. We test this circumstance in the test that is modified in this PR for dpnp and dpctl. If a new sycl_usm_namespace array type comes out, we can come back to this.

What about other libraries that implement the array protocol? XArray also has "copy" for example:
https://docs.xarray.dev/en/latest/generated/xarray.DataArray.copy.html

ahuber21

LGTM, please wait for green CI

icfaust added 20 commits November 23, 2024 11:03

Update validation.py

1673a85

Update validation.py

830400b

Update data_conversion.cpp

c2c23f9

Update test_data.py

2d86d75

Update test_data.py

d34c2c2

Update data_conversion.cpp

585f7eb

Update data_conversion.cpp

8666c11

Update sua_iface_helpers.cpp

e303421

Update data_conversion_sua_iface.cpp

50f2a73

Update data_conversion_sua_iface.cpp

676a739

Update data_conversion_sua_iface.cpp

efce9e8

Update data_conversion_sua_iface.cpp

13c4cd0

Update data_conversion_sua_iface.cpp

0104ea5

Update data_conversion_sua_iface.cpp

816a374

Update data_conversion_sua_iface.cpp

485fb34

Update data_conversion_sua_iface.cpp

3d40ece

Update _data_conversion.py

6b98206

Update test_data.py

85e57fe

Update test_data.py

85931b9

Update test_data.py

0f98a3f

icfaust commented Nov 24, 2024

View reviewed changes

icfaust changed the title ~~[experiment] remove contiguous check~~ [enhancement] remove contiguous check Nov 24, 2024

icfaust changed the title ~~[enhancement] remove contiguous check~~ [enhancement] remove contiguous check from _check_array Nov 24, 2024

formatting

8d68574

icfaust marked this pull request as ready for review November 25, 2024 09:09

icfaust requested review from Alexsandruss and samir-nasibli as code owners November 25, 2024 09:09

icfaust requested a review from david-cortes-intel November 25, 2024 09:09

icfaust added the enhancement New feature or request label Nov 25, 2024

samir-nasibli reviewed Nov 25, 2024

View reviewed changes

icfaust added 2 commits November 25, 2024 21:39

Merge branch 'intel:main' into dev/contiguous_check

469a806

Update data_conversion_sua_iface.cpp

5f9fb6b

ahuber21 approved these changes Nov 26, 2024

View reviewed changes

ahuber21 requested changes Nov 26, 2024

View reviewed changes

Alexsandruss approved these changes Nov 26, 2024

View reviewed changes

Update data_conversion.cpp

912fa5b

icfaust requested review from ahuber21 and samir-nasibli November 26, 2024 13:57

david-cortes-intel reviewed Nov 26, 2024

View reviewed changes

ahuber21 approved these changes Nov 26, 2024

View reviewed changes

icfaust merged commit 7dd395b into uxlfoundation:main Nov 27, 2024
26 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[enhancement] remove contiguous check from `_check_array` #2185

[enhancement] remove contiguous check from `_check_array` #2185

icfaust commented Nov 23, 2024 •

edited

Loading

icfaust commented Nov 24, 2024

icfaust Nov 24, 2024

david-cortes-intel Nov 25, 2024

icfaust Nov 25, 2024

icfaust Nov 25, 2024 •

edited

Loading

icfaust Nov 25, 2024

icfaust commented Nov 25, 2024

samir-nasibli Nov 25, 2024

icfaust Nov 25, 2024 •

edited

Loading

david-cortes-intel Nov 26, 2024

icfaust Nov 26, 2024

samir-nasibli Nov 25, 2024

icfaust Nov 25, 2024

icfaust commented Nov 25, 2024

ahuber21 left a comment

samir-nasibli commented Nov 26, 2024

Alexsandruss left a comment

icfaust commented Nov 26, 2024

icfaust commented Nov 26, 2024

david-cortes-intel Nov 26, 2024

icfaust Nov 26, 2024

david-cortes-intel Nov 27, 2024

ahuber21 left a comment

[enhancement] remove contiguous check from _check_array #2185

[enhancement] remove contiguous check from _check_array #2185

Conversation

icfaust commented Nov 23, 2024 • edited Loading

Description

icfaust commented Nov 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icfaust Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icfaust commented Nov 25, 2024

Choose a reason for hiding this comment

icfaust Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icfaust commented Nov 25, 2024

ahuber21 left a comment

Choose a reason for hiding this comment

samir-nasibli commented Nov 26, 2024

Alexsandruss left a comment

Choose a reason for hiding this comment

icfaust commented Nov 26, 2024

icfaust commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahuber21 left a comment

Choose a reason for hiding this comment

[enhancement] remove contiguous check from `_check_array` #2185

[enhancement] remove contiguous check from `_check_array` #2185

icfaust commented Nov 23, 2024 •

edited

Loading

icfaust Nov 25, 2024 •

edited

Loading

icfaust Nov 25, 2024 •

edited

Loading