[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this request and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Area

cuda.parallel (Python)

### Is your feature request related to a problem? Please describe.

To attain optimal performance kernels for some algorithms must use 32-bit types to store problem size arguments.

Supporting these algorithms for problem sizes in excess of `INT_MAX` can be done with streaming approach with streaming logic encoded in  algorithm's dispatcher. Dispatcher needs to increment iterators on the host.

This is presently not supported by `cccl.c.parallel`, since `indirect_arg_t` does  not implement increment operator. 

Since `indirect_arg_t` is used to represent `cccl_value_t`, `cccl_operation_t` and `cccl_iterator_t`, and incrementing only makes sense for iterators, a dedicated type `indirect_iterator_t` must be introduced, which may implement the `operator+=`.

If the entirety of iterator state is user-defined, `cuda.parallel` must provide host function pointer to increment iterator's state by compiling `advance` function for the host.

If we define the state of a struct that contains `size_t linear_id` in addition to user-defined state, we could get rid of user-defined `advance` function altogether, but would need to provide access to `linear_id` to the `dereference` function. 

Approached need to be prototyped and compared.

### Describe the solution you'd like

The solution should unblock https://github.com/NVIDIA/cccl/pull/3764

### Additional context

https://github.com/NVIDIA/cccl/pull/3764#issuecomment-2725298547

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148

Description

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions