@@ -6,33 +6,38 @@ Overview
66
77Data Parallel Extension for Numba* (`numba-dpex `_) is a free and open-source
88LLVM-based code generator for portable accelerator programming in Python. The
9- code generator implements a new pseudo- kernel programming domain-specific
10- language (DSL) called ` KAPI ` that is modeled after the C++ DSL ` SYCL* `_. The
11- SYCL language is an open standard developed under the Unified Acceleration
12- Foundation (`UXL `_) as a vendor-agnostic way of programming different types of
13- data-parallel hardware such as multi-core CPUs, GPUs, and FPGAs. Numba-dpex and
14- KAPI aim to bring the same vendor-agnostic and standard-compliant programming
15- model to Python.
9+ code generator implements a new kernel programming API (kapi) in pure Python
10+ that is modeled after the API of the C++ embedded domain-specific language
11+ (eDSL) ` SYCL* `_. The SYCL eDSL is an open standard developed under the Unified
12+ Acceleration Foundation (`UXL `_) as a vendor-agnostic way of programming
13+ different types of data-parallel hardware such as multi-core CPUs, GPUs, and
14+ FPGAs. Numba-dpex and kapi aim to bring the same vendor-agnostic and
15+ standard-compliant programming model to Python.
1616
1717Numba-dpex is built on top of the open-source `Numba* `_ JIT compiler that
1818implements a CPython bytecode parser and code generator to lower the bytecode to
19- LLVM IR. The Numba* compiler is able to compile a large sub-set of Python and
20- most of the NumPy library. Numba-dpex uses Numba*'s tooling to implement the
21- parsing and typing support for the data types and functions defined in the KAPI
22- DSL. A custom code generator is then used to lower KAPI to a form of LLVM IR
23- that includes special LLVM instructions that define a low-level data-parallel
24- kernel API. Thus, a function defined in KAPI is compiled to a data-parallel
25- kernel that can run on different types of hardware. Currently, compilation of
26- KAPI is possible for x86 CPU devices, Intel Gen9 integrated GPUs, Intel UHD
27- integrated GPUs, and Intel discrete GPUs.
28-
29-
30- The following example shows a pairwise distance matrix computation in KAPI.
19+ LLVM intermediate representation (IR). The Numba* compiler is able to compile a
20+ large sub-set of Python and most of the NumPy library. Numba-dpex uses Numba*'s
21+ tooling to implement the parsing and the typing support for the data types and
22+ functions defined in kapi. A custom code generator is also introduced to lower
23+ kapi functions to a form of LLVM IR that defined a low-level data-parallel
24+ kernel. Thus, a function written kapi although purely sequential when executed
25+ in Python can be compiled to an actual data-parallel kernel that can run on
26+ different types of hardware. Compilation of kapi is possible for x86
27+ CPU devices, Intel Gen9 integrated GPUs, Intel UHD integrated GPUs, and Intel
28+ discrete GPUs.
29+
30+ The following example presents a pairwise distance matrix computation as written
31+ in kapi. A detailed description of the API and all relevant concepts are dealt
32+ with elsewhere in the documentation, for now the example introduces the core
33+ tenet of the programming model.
3134
3235.. code-block :: python
36+ :linenos:
3337
3438 from numba_dpex import kernel_api as kapi
3539 import math
40+ import dpnp
3641
3742
3843 def pairwise_distance_kernel (item : kapi.Item, data , distance ):
@@ -49,41 +54,74 @@ The following example shows a pairwise distance matrix computation in KAPI.
4954 distance[j, i] = math.sqrt(d)
5055
5156
52- Skipping over much of the language details, at a high-level the
53- ``pairwise_distance_kernel `` can be viewed as a data-parallel function that gets
54- executed individually by a set of "work items". That is, each work item runs the
55- same function for a subset of the elements of the input ``data `` and
56- ``distance `` arrays. For programmers familiar with the CUDA or OpenCL languages,
57- it is the same programming model that is referred to as Single Program Multiple
58- Data (SPMD). As Python has no concept of a work item the KAPI function itself is
59- sequential and needs to be compiled to convert it into a parallel version. The
60- next example shows the changes to the original script to compile and run the
57+ data = dpnp.random.ranf((10000 , 3 ), device = " gpu" )
58+ dist = dpnp.empty(shape = (data.shape[0 ], data.shape[0 ]), device = " gpu" )
59+ exec_range = kapi.Range(data.shape[0 ], data.shape[0 ])
60+ kapi.call_kernel(kernel(pairwise_distance_kernel), exec_range, data, dist)
61+
62+ The ``pairwise_distance_kernel `` function conceptually defines a data-parallel
63+ function to be executed individually by a set of "work items". That is, each
64+ work item runs the function for a subset of the elements of the input ``data ``
65+ and ``distance `` arrays. The ``item `` argument passed to the function identifies
66+ the work item that is executing a specific instance of the function. The set of
67+ work items is defined by the ``exec_range `` object and the ``call_kernel `` call
68+ instructs every work item in ``exec_range `` to execute
69+ ``pairwise_distance_kernel `` for a specific subset of the data.
70+
71+ The logical abstraction exposed by kapi is referred to as Single Program
72+ Multiple Data (SPMD) programming model. CUDA or OpenCL programmers will
73+ recognize the programming model exposed by kapi as similar to the one in those
74+ languages. However, as Python has no concept of a work item a kapi function
75+ executes sequentially when invoked from Python. To convert it into a true
76+ data-parallel function, the function has to be first compiled using numba-dpex.
77+ The next example shows the changes to the original script to compile and run the
6178``pairwise_distance_kernel `` in parallel.
6279
6380.. code-block :: python
81+ :linenos:
82+ :emphasize- lines: 7 , 25
83+
84+ import numba_dpex as dpex
6485
65- from numba_dpex import kernel, call_kernel
86+ from numba_dpex import kernel_api as kapi
87+ import math
6688 import dpnp
6789
90+
91+ @dpex.kernel
92+ def pairwise_distance_kernel (item : kapi.Item, data , distance ):
93+ i = item.get_id(0 )
94+ j = item.get_id(1 )
95+
96+ data_dims = data.shape[1 ]
97+
98+ d = data.dtype.type(0.0 )
99+ for k in range (data_dims):
100+ tmp = data[i, k] - data[j, k]
101+ d += tmp * tmp
102+
103+ distance[j, i] = math.sqrt(d)
104+
105+
68106 data = dpnp.random.ranf((10000 , 3 ), device = " gpu" )
69- distance = dpnp.empty(shape = (data.shape[0 ], data.shape[0 ]), device = " gpu" )
107+ dist = dpnp.empty(shape = (data.shape[0 ], data.shape[0 ]), device = " gpu" )
70108 exec_range = kapi.Range(data.shape[0 ], data.shape[0 ])
71- call_kernel(kernel(pairwise_distance_kernel), exec_range, data, distance)
72109
73- To compile a KAPI function into a data-parallel kernel and run it on a device,
74- three things need to be done: allocate the arguments to the function on the
75- device where the function is to execute, compile the function by applying a
76- numba-dpex decorator, and `launch ` or execute the compiled kernel on the device.
110+ dpex.call_kernel(pairwise_distance_kernel, exec_range, data, dist)
77111
78- Allocating arrays or scalars to be passed to a compiled KAPI function is not
79- done directly in numba-dpex. Instead, numba-dpex supports passing in
112+ To compile a kapi function, the ``call_kernel `` function from kapi has to be
113+ substituted by the one provided in ``numba_dpex `` and the ``kernel `` decorator
114+ has to be added to the kapi function. The actual device for which the function
115+ is compiled and on which it executes is controlled by the input arguments to
116+ ``call_kernel ``. Allocating the input arguments to be passed to a compiled kapi
117+ function is not done by numba-dpex. Instead, numba-dpex supports passing in
80118tensors/ndarrays created using either the `dpnp `_ NumPy drop-in replacement
81- library or the `dpctl `_ SYCl-based Python Array API library. To trigger
82- compilation, the `` numba_dpex.kernel `` decorator has to be used, and finally to
83- launch a compiled kernel the `` numba_dpex.call_kernel `` function should be
84- invoked .
85-
86- For a more detailed description about programming with numba-dpex, refer
87- the :doc: `programming_model `, :doc: `user_guide/index ` and the
88- :doc: ` autoapi/index ` sections of the documentation. To setup numba-dpex and try
89- it out refer the :doc: `getting_started ` section.
119+ library or the `dpctl `_ SYCl-based Python Array API library. The objects
120+ allocated by these libraries encode the device information for that allocation.
121+ Numba-dpex extracts the information and uses it to compile a kernel for that
122+ specific device and then executes the compiled kernel on it .
123+
124+ For a more detailed description about programming with numba-dpex, refer the
125+ :doc: `programming_model `, :doc: `user_guide/index ` and the :doc: ` autoapi/index `
126+ sections of the documentation. To setup numba-dpex and try it out refer the
127+ :doc: `getting_started ` section.
0 commit comments