Skip to content

Commit

Permalink
Implements accumulation functions in dpctl.tensor (#1602)
Browse files Browse the repository at this point in the history
* Use `shT` instead of `std::vector<py::ssize_t>` in `repeat`

* Add missing host task to `host_tasks_list` in _reduction.py

* Implements `dpt.cumulative_logsumexp`, `dpt.cumulative_prod`, and `dpt.cumulative_sum`

The Python bindings for these functions are implemented in a new submodule `_tensor_accumulation_impl`

* Adds the first tests for `dpt.cumulative_sum`

* Pass host task vector to accumulator kernel calls

This resolves hangs in unique functions

* Implements `out` keyword for accumulators

* Fixes cumulative_logsumexp when both an intermediate input and result temporary are needed

* Only permute dims of allocated outputs if accumulated axis is not the trailing axis

Fixes a bug where in some cases output axes were not being permuted

* Enable scalar inputs to accumulation functions

* Adds test for scalar inputs to cumulative_sum

* Adds docstrings for cumulative_sum. cumulative_prod, and cumulative_logsumexp

* Removed redundant dtype kind check in _default_accumulation_dtype

* Reduce repetition of code allocation out array in _accumulate_common

* Adds tests for accumulation function identities, `include_initial` keyword

* Adds more tests for cumulative_sum

* Correct typo in kernels/accumulators.hpp

constexpr nwiT variables rather than nwiT constexpr variables

* Increase work per work item in inclusive_scan_iter_1d update step

* Removes a dead branch from _accumulate_common

As `out` and the input would have to have the same data type to overlap, the second branch is never reached if `out` is the same array as the input

* More accumulator tests

* Removes dead branch from _accumulators.py

A second out temporary does not need to be made in either branch when input and requested dtype are not implemented, as temporaries are always made

Also removes part of a test intended to reach this branch

* Adds tests for `cumulative_prod` and `cumulative_logsumexp`

Also fixes incorrect TypeError in _accumulation.py

* Widen acceptable results of test_logcumsumexp

* Use np.logaddexp.accumulate in hopes of better numerical accuracy of expected result for cumulative_logsumexp

* Attempt to improve cumulative_logsumexp testing by computing running logsumexp of test array

* Reduce size of array in test_logcumsumexp_basic

* Use const qualifiers to make compiler's job easier

Indexers are made const, integral variables in kernels made const too

Make two-offset instances const references to avoid copying.

Gor rid of get_src_const_ptr unused methods in stack_t structs.
Replaced auto with size_t as appropriate. Added const to make
compiler analysis easier (and faster).

* Add test for cumulative_logsumexp for geometric series summation, testing against closed form

* Fix race condition in `custom_inclusive_scan_over_group`

By returning data from `local_mem_acc` after the group barrier, if memory is later overwritten, a race condition follows, which was especially obvious on CPU

Now the value is stored in variable before the barrier and then returned

* Remove use of Numpy functions from test_tensor_accumulation and increase size of test_logcumsumexp_basic

* Need barrier after call to custom inclusive scan to avoid race condition (#1624)

added comments explaining why barriers are needed

* Docstring edits

Add empty new line after list item to make Sphinx happy.

---------

Co-authored-by: Oleksandr Pavlyk <oleksandr.pavlyk@intel.com>
  • Loading branch information
ndgrigorian and oleksandr-pavlyk authored Apr 1, 2024
1 parent 57495af commit 65bb9ef
Show file tree
Hide file tree
Showing 19 changed files with 3,405 additions and 171 deletions.
17 changes: 17 additions & 0 deletions dpctl/tensor/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,17 @@ set(_tensor_linalg_impl_sources
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/simplify_iteration_space.cpp
${_linalg_sources}
)
set(_accumulator_sources
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/accumulators/accumulators_common.cpp
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/accumulators/cumulative_logsumexp.cpp
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/accumulators/cumulative_prod.cpp
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/accumulators/cumulative_sum.cpp
)
set(_tensor_accumulation_impl_sources
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/tensor_accumulation.cpp
${CMAKE_CURRENT_SOURCE_DIR}/libtensor/source/simplify_iteration_space.cpp
${_accumulator_sources}
)

set(_py_trgts)

Expand Down Expand Up @@ -186,6 +197,11 @@ pybind11_add_module(${python_module_name} MODULE ${_tensor_linalg_impl_sources})
add_sycl_to_target(TARGET ${python_module_name} SOURCES ${_tensor_linalg_impl_sources})
list(APPEND _py_trgts ${python_module_name})

set(python_module_name _tensor_accumulation_impl)
pybind11_add_module(${python_module_name} MODULE ${_tensor_accumulation_impl_sources})
add_sycl_to_target(TARGET ${python_module_name} SOURCES ${_tensor_accumulation_impl_sources})
list(APPEND _py_trgts ${python_module_name})

set(_clang_prefix "")
if (WIN32)
set(_clang_prefix "/clang:")
Expand All @@ -203,6 +219,7 @@ list(APPEND _no_fast_math_sources
${_reduction_sources}
${_sorting_sources}
${_linalg_sources}
${_accumulator_sources}
)

foreach(_src_fn ${_no_fast_math_sources})
Expand Down
4 changes: 4 additions & 0 deletions dpctl/tensor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@
from dpctl.tensor._usmarray import usm_ndarray
from dpctl.tensor._utility_functions import all, any

from ._accumulation import cumulative_logsumexp, cumulative_prod, cumulative_sum
from ._array_api import __array_api_version__, __array_namespace_info__
from ._clip import clip
from ._constants import e, inf, nan, newaxis, pi
Expand Down Expand Up @@ -367,4 +368,7 @@
"tensordot",
"vecdot",
"searchsorted",
"cumulative_logsumexp",
"cumulative_prod",
"cumulative_sum",
]
Loading

0 comments on commit 65bb9ef

Please sign in to comment.