diff --git a/docs/source/bibliography.bib b/docs/source/bibliography.bib
new file mode 100644
index 0000000000..d0b84ff324
--- /dev/null
+++ b/docs/source/bibliography.bib
@@ -0,0 +1,26 @@
+
+
+@techreport{scott70,
+  author      = {Dana Scott},
+  institution = {OUCL},
+  month       = {November},
+  number      = {PRG02},
+  pages       = {30},
+  title       = {OUTLINE OF A MATHEMATICAL THEORY OF COMPUTATION},
+  year        = {1970}
+}
+
+@article{PLOTKIN20043,
+  abstract = {We review the origins of structural operational semantics. The main publication `A Structural Approach to Operational Semantics,' also known as the `Aarhus Notes,' appeared in 1981 [G.D. Plotkin, A structural approach to operational semantics, DAIMI FN-19, Computer Science Department, Aarhus University, 1981]. The development of the ideas dates back to the early 1970s, involving many people and building on previous work on programming languages and logic. The former included abstract syntax, the SECD machine, and the abstract interpreting machines of the Vienna school; the latter included the λ-calculus and formal systems. The initial development of structural operational semantics was for simple functional languages, more or less variations of the λ-calculus; after that the ideas were gradually extended to include languages with parallel features, such as Milner's CCS. This experience set the ground for a more systematic exposition, the subject of an invited course of lectures at Aarhus University; some of these appeared in print as the 1981 Notes. We discuss the content of these lectures and some related considerations such as `small state' versus `grand state,' structural versus compositional semantics, the influence of the Scott–Strachey approach to denotational semantics, the treatment of recursion and jumps, and static semantics. We next discuss relations with other work and some immediate further development. We conclude with an account of an old, previously unpublished, idea: an alternative, perhaps more readable, graphical presentation of systems of rules for operational semantics.},
+  author   = {Gordon D Plotkin},
+  doi      = {https://doi.org/10.1016/j.jlap.2004.03.009},
+  issn     = {1567-8326},
+  journal  = {The Journal of Logic and Algebraic Programming},
+  keywords = {Semantics of programming languages, (Structural) operational semantics, Structural induction, (Labelled) transition systems, -calculus, Concurrency, Big step semantics, Small-step semantics, Abstract machines, Static semantics},
+  note     = {Structural Operational Semantics},
+  pages    = {3-15},
+  title    = {The origins of structural operational semantics},
+  url      = {https://www.sciencedirect.com/science/article/pii/S1567832604000268},
+  volume   = {60-61},
+  year     = {2004}
+}
diff --git a/docs/source/conf.py b/docs/source/conf.py
index 4276307360..90ebaa6f5f 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -31,8 +31,11 @@
     "sphinxcontrib.googleanalytics",
     "myst_parser",
     "autoapi.extension",
+    "sphinxcontrib.bibtex",
 ]
 
+bibtex_bibfiles = ["bibliography.bib"]
+
 # Add any paths that contain templates here, relative to this directory.
 # templates_path = ['_templates']
 templates_path = []
diff --git a/docs/source/ext_links.txt b/docs/source/ext_links.txt
index ee01b1a83c..4c369fbd92 100644
--- a/docs/source/ext_links.txt
+++ b/docs/source/ext_links.txt
@@ -2,7 +2,7 @@
     **********************************************************
     THESE ARE EXTERNAL PROJECT LINKS USED IN THE DOCUMENTATION
     **********************************************************
-
+.. _math: https://docs.python.org/3/library/math.html
 .. _NumPy*: https://numpy.org/
 .. _Numba*: https://numba.pydata.org/
 .. _numba-dpex: https://github.com/IntelPython/numba-dpex
@@ -14,6 +14,7 @@
 .. _SYCL*: https://www.khronos.org/sycl/
 .. _dpctl: https://intelpython.github.io/dpctl/latest/index.html
 .. _Data Parallel Control: https://intelpython.github.io/dpctl/latest/index.html
+.. _DLPack: https://dmlc.github.io/dlpack/latest/
 .. _Dpnp: https://intelpython.github.io/dpnp/
 .. _dpnp: https://intelpython.github.io/dpnp/
 .. _Data Parallel Extension for Numpy*: https://intelpython.github.io/dpnp/
@@ -28,3 +29,8 @@
 .. _oneDPL: https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-library.html#gs.5izf63
 .. _UXL: https://uxlfoundation.org/
 .. _oneAPI GPU optimization guide: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-0/general-purpose-computing-on-gpu.html
+.. _dpctl.tensor.usm_ndarray: https://intelpython.github.io/dpctl/latest/docfiles/dpctl/usm_ndarray.html#dpctl.tensor.usm_ndarray
+.. _dpnp.ndarray: https://intelpython.github.io/dpnp/reference/ndarray.html
+
+.. _Dispatcher: https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#dispatcher-objects
+.. _Unboxes: https://numba.readthedocs.io/en/stable/extending/interval-example.html#boxing-and-unboxing
diff --git a/docs/source/overview.rst b/docs/source/overview.rst
index 6db2ceabaa..4087fe5009 100644
--- a/docs/source/overview.rst
+++ b/docs/source/overview.rst
@@ -6,33 +6,38 @@ Overview
 
 Data Parallel Extension for Numba* (`numba-dpex`_) is a free and open-source
 LLVM-based code generator for portable accelerator programming in Python. The
-code generator implements a new pseudo-kernel programming domain-specific
-language (DSL) called `KAPI` that is modeled after the C++ DSL `SYCL*`_. The
-SYCL language is an open standard developed under the Unified Acceleration
-Foundation (`UXL`_) as a vendor-agnostic way of programming different types of
-data-parallel hardware such as multi-core CPUs, GPUs, and FPGAs. Numba-dpex and
-KAPI aim to bring the same vendor-agnostic and standard-compliant programming
-model to Python.
+code generator implements a new kernel programming API (kapi) in pure Python
+that is modeled after the API of the C++ embedded domain-specific language
+(eDSL) `SYCL*`_. The SYCL eDSL is an open standard developed under the Unified
+Acceleration Foundation (`UXL`_) as a vendor-agnostic way of programming
+different types of data-parallel hardware such as multi-core CPUs, GPUs, and
+FPGAs. Numba-dpex and kapi aim to bring the same vendor-agnostic and
+standard-compliant programming model to Python.
 
 Numba-dpex is built on top of the open-source `Numba*`_ JIT compiler that
 implements a CPython bytecode parser and code generator to lower the bytecode to
-LLVM IR. The Numba* compiler is able to compile a large sub-set of Python and
-most of the NumPy library. Numba-dpex uses Numba*'s tooling to implement the
-parsing and typing support for the data types and functions defined in the KAPI
-DSL. A custom code generator is then used to lower KAPI to a form of LLVM IR
-that includes special LLVM instructions that define a low-level data-parallel
-kernel API. Thus, a function defined in KAPI is compiled to a data-parallel
-kernel that can run on different types of hardware. Currently, compilation of
-KAPI is possible for x86 CPU devices, Intel Gen9 integrated GPUs, Intel UHD
-integrated GPUs, and Intel discrete GPUs.
-
-
-The following example shows a pairwise distance matrix computation in KAPI.
+LLVM intermediate representation (IR). The Numba* compiler is able to compile a
+large sub-set of Python and most of the NumPy library. Numba-dpex uses Numba*'s
+tooling to implement the parsing and the typing support for the data types and
+functions defined in kapi. A custom code generator is also introduced to lower
+kapi functions to a form of LLVM IR that defined a low-level data-parallel
+kernel. Thus, a function written kapi although purely sequential when executed
+in Python can be compiled to an actual data-parallel kernel that can run on
+different types of hardware. Compilation of kapi is possible for x86
+CPU devices, Intel Gen9 integrated GPUs, Intel UHD integrated GPUs, and Intel
+discrete GPUs.
+
+The following example presents a pairwise distance matrix computation as written
+in kapi. A detailed description of the API and all relevant concepts are dealt
+with elsewhere in the documentation, for now the example introduces the core
+tenet of the programming model.
 
 .. code-block:: python
+    :linenos:
 
     from numba_dpex import kernel_api as kapi
     import math
+    import dpnp
 
 
     def pairwise_distance_kernel(item: kapi.Item, data, distance):
@@ -49,41 +54,74 @@ The following example shows a pairwise distance matrix computation in KAPI.
         distance[j, i] = math.sqrt(d)
 
 
-Skipping over much of the language details, at a high-level the
-``pairwise_distance_kernel`` can be viewed as a data-parallel function that gets
-executed individually by a set of "work items". That is, each work item runs the
-same function for a subset of the elements of the input ``data`` and
-``distance`` arrays. For programmers familiar with the CUDA or OpenCL languages,
-it is the same programming model that is referred to as Single Program Multiple
-Data (SPMD). As Python has no concept of a work item the KAPI function itself is
-sequential and needs to be compiled to convert it into a parallel version. The
-next example shows the changes to the original script to compile and run the
+    data = dpnp.random.ranf((10000, 3), device="gpu")
+    dist = dpnp.empty(shape=(data.shape[0], data.shape[0]), device="gpu")
+    exec_range = kapi.Range(data.shape[0], data.shape[0])
+    kapi.call_kernel(kernel(pairwise_distance_kernel), exec_range, data, dist)
+
+The ``pairwise_distance_kernel`` function conceptually defines a data-parallel
+function to be executed individually by a set of "work items". That is, each
+work item runs the function for a subset of the elements of the input ``data``
+and ``distance`` arrays. The ``item`` argument passed to the function identifies
+the work item that is executing a specific instance of the function. The set of
+work items is defined by the ``exec_range`` object and the ``call_kernel`` call
+instructs every work item in ``exec_range`` to execute
+``pairwise_distance_kernel`` for a specific subset of the data.
+
+The logical abstraction exposed by kapi is referred to as Single Program
+Multiple Data (SPMD) programming model. CUDA or OpenCL programmers will
+recognize the programming model exposed by kapi as similar to the one in those
+languages. However, as Python has no concept of a work item a kapi function
+executes sequentially when invoked from Python. To convert it into a true
+data-parallel function, the function has to be first compiled using numba-dpex.
+The next example shows the changes to the original script to compile and run the
 ``pairwise_distance_kernel`` in parallel.
 
 .. code-block:: python
+    :linenos:
+    :emphasize-lines: 7, 25
+
+    import numba_dpex as dpex
 
-    from numba_dpex import kernel, call_kernel
+    from numba_dpex import kernel_api as kapi
+    import math
     import dpnp
 
+
+    @dpex.kernel
+    def pairwise_distance_kernel(item: kapi.Item, data, distance):
+        i = item.get_id(0)
+        j = item.get_id(1)
+
+        data_dims = data.shape[1]
+
+        d = data.dtype.type(0.0)
+        for k in range(data_dims):
+            tmp = data[i, k] - data[j, k]
+            d += tmp * tmp
+
+        distance[j, i] = math.sqrt(d)
+
+
     data = dpnp.random.ranf((10000, 3), device="gpu")
-    distance = dpnp.empty(shape=(data.shape[0], data.shape[0]), device="gpu")
+    dist = dpnp.empty(shape=(data.shape[0], data.shape[0]), device="gpu")
     exec_range = kapi.Range(data.shape[0], data.shape[0])
-    call_kernel(kernel(pairwise_distance_kernel), exec_range, data, distance)
 
-To compile a KAPI function into a data-parallel kernel and run it on a device,
-three things need to be done: allocate the arguments to the function on the
-device where the function is to execute, compile the function by applying a
-numba-dpex decorator, and `launch` or execute the compiled kernel on the device.
+    dpex.call_kernel(pairwise_distance_kernel, exec_range, data, dist)
 
-Allocating arrays or scalars to be passed to a compiled KAPI function is not
-done directly in numba-dpex. Instead, numba-dpex supports passing in
+To compile a kapi function, the ``call_kernel`` function from kapi has to be
+substituted by the one provided in ``numba_dpex`` and the ``kernel`` decorator
+has to be added to the kapi function. The actual device for which the function
+is compiled and on which it executes is controlled by the input arguments to
+``call_kernel``. Allocating the input arguments to be passed to a compiled kapi
+function is not done by numba-dpex. Instead, numba-dpex supports passing in
 tensors/ndarrays created using either the `dpnp`_ NumPy drop-in replacement
-library or the `dpctl`_ SYCl-based Python Array API library. To trigger
-compilation, the ``numba_dpex.kernel`` decorator has to be used, and finally to
-launch a compiled kernel the ``numba_dpex.call_kernel`` function should be
-invoked.
-
-For a more detailed description about programming with numba-dpex, refer
-the :doc:`programming_model`, :doc:`user_guide/index` and the
-:doc:`autoapi/index` sections of the documentation. To setup numba-dpex and try
-it out refer the :doc:`getting_started` section.
+library or the `dpctl`_ SYCl-based Python Array API library. The objects
+allocated by these libraries encode the device information for that allocation.
+Numba-dpex extracts the information and uses it to compile a kernel for that
+specific device and then executes the compiled kernel on it.
+
+For a more detailed description about programming with numba-dpex, refer the
+:doc:`programming_model`, :doc:`user_guide/index` and the :doc:`autoapi/index`
+sections of the documentation. To setup numba-dpex and try it out refer the
+:doc:`getting_started` section.
diff --git a/docs/source/programming_model.old.rst b/docs/source/programming_model.old.rst
new file mode 100644
index 0000000000..9c98ead75c
--- /dev/null
+++ b/docs/source/programming_model.old.rst
@@ -0,0 +1,277 @@
+.. _programming_model:
+.. include:: ./ext_links.txt
+
+Programming Model
+=================
+
+In a heterogeneous system there may be **multiple** devices a Python user may
+want to engage. For example, it is common for a consumer-grade laptop to feature
+an integrated or a discrete GPU alongside a CPU.
+
+To harness their power one needs to know how to answer the following 3 key
+questions:
+
+1. How does a Python program recognize available computational devices?
+2. How does a Python workload specify computations to be offloaded to selected
+   devices?
+3. How does a Python application manage data sharing?
+
+Recognizing available devices
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Python package ``dpctl`` answers these questions. All the computational devices
+known to the underlying DPC++ runtime can be accessed using
+``dpctl.get_devices()``. A specific device of interest `can be selected
+<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/device_selection.html>`__
+either using a helper function, e.g. ``dpctl.select_gpu_device()``, or by
+passing a filter selector string to ``dpctl.SyclDevice`` constructor.
+
+.. code:: python
+
+   import dpctl
+
+   # select a GPU device. If multiple devices present,
+   # let the underlying runtime select from GPUs
+   dev_gpu = dpctl.SyclDevice("gpu")
+   # select a CPU device
+   dev_cpu = dpctl.SyclDevice("cpu")
+
+   # stand-alone function, equivalent to C++
+   #   `auto dev = sycl::gpu_selector().select_device();`
+   dev_gpu_alt = dpctl.select_gpu_device()
+   # stand-alone function, equivalent to C++
+   #   `auto dev = sycl::cpu_selector().select_device();`
+   dev_cpu_alt = dpctl.select_cpu_device()
+
+A `device object
+<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/devices.html>`__
+can be used to query properies of the device, such as its name, vendor, maximal
+number of computational units, memory size, etc.
+
+Specifying offload target
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To answer the second question on the list we need a digression to explain
+offloading in oneAPI DPC++ first.
+
+.. note::
+   In DPC++, a computation kernel can be specified using generic C++
+   programming and then the kernel can be offloaded to any device that is
+   supported by an underlying SYCL runtime. The device to which the kernel
+   is offloaded is specified using an **execution queue** when *launching
+   the kernel*.
+
+   The oneAPI unified programming model brings portability across heterogeneous
+   architectures. Another important aspect of the programming model is its
+   inherent flexibility that makes it possible to go beyond portability and even
+   strive for performance portability. An oneAPI library may be implemented
+   using C++ techniques such as template metaprogramming or dynamic polymorphism
+   to implement specializations for a generic kernel. If a kernel is implemented
+   polymorphically, the specialized implementation will be dispatched based on
+   the execution queue specified during kernel launch. The oneMKL library is an
+   example of a performance portable oneAPI library.
+
+A computational task is offloaded for execution on a device by submitting it to
+DPC++ runtime which inserts the task in a computational graph. Once the device
+becomes available the runtime selects a task whose dependencies are met for
+execution. The computational graph as well as the device targeted by its tasks
+are stored in a `SYCL queue
+<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/queues.html>`__
+object. The task submission is therefore always associated with a queue.
+
+Queues can be constructed directly from a device object, or by using a filter
+selector string to indicate the device to construct:
+
+.. code:: python
+
+   # construct queue from device object
+   q1 = dpctl.SyclQueue(dev_gpu)
+   # construct queue using filter selector
+   q2 = dpctl.SyclQueue("gpu")
+
+The computational tasks can be stored in an oneAPI native extension in which
+case their submission is orchestrated during Python API calls. Let’s consider a
+function that offloads an evaluation of a polynomial for every point of a NumPy
+array ``X``. Such a function needs to receive a queue object to indicate which
+device the computation must be offloaded to:
+
+.. code:: python
+
+   # allocate space for the result
+   Y = np.empty_like(X)
+   # evaluate polynomial on the device targeted by the queue, Y[i] = p(X[i])
+   onapi_ext.offloaded_poly_evaluate(exec_q, X, Y)
+
+Python call to ``onapi_ext.offloaded_poly_evaluate`` applied to NumPy arrays of
+double precision floating pointer numbers gets translated to the following
+sample C++ code:
+
+.. code:: cpp
+
+   void
+   cpp_offloaded_poly_evaluate(
+     sycl::queue q, const double *X, double *Y, size_t n) {
+       // create buffers from malloc allocations to make data accessible from device
+       sycl::buffer<1, double> buf_X(X, n);
+       sycl::buffer<1, double> buf_Y(Y, n);
+
+       q.submit([&](sycl::handler &cgh) {
+           // create buffer accessors indicating kernel data-flow pattern
+           sycl::accessor acc_X(buf_X, cgh, sycl::read_only);
+           sycl::accessor acc_Y(buf_Y, cgh, sycl::write_only, sycl::no_init);
+
+           cgh.parallel_for(n,
+              // lambda function that gets executed by different work-items with
+              // different arguments in parallel
+              [=](sycl::id<1> id) {
+                 auto x = accX[id];
+                 accY[id] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x));
+              });
+       }).wait();
+
+       return;
+   }
+
+We refer an interested reader to an excellent and freely available “`Data
+Parallel C++ <https://link.springer.com/book/10.1007%2F978-1-4842-5574-2>`__”
+book for details of this data parallel C++.
+
+Our package ``numba_dpex`` allows one to write kernels directly in Python.
+
+.. code:: python
+
+   import numba_dpex
+
+
+   @numba_dpex.kernel
+   def numba_dpex_poly(X, Y):
+       i = numba_dpex.get_global_id(0)
+       x = X[i]
+       Y[i] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x))
+
+Specifying the execution queue is done using Python context manager:
+
+.. code:: python
+
+   import numpy as np
+
+   X = np.random.randn(10**6)
+   Y = np.empty_like(X)
+
+   with dpctl.device_context(q):
+       # apply the kernel to elements of X, writing value into Y,
+       # while executing using given queue
+       numba_dpex_poly[numba_dpex.Range(X.size)](X, Y)
+
+The argument to ``device_context`` can be a queue object, a device object for
+which a temporary queue will be created, or a filter selector string. Thus we
+could have equally used ``dpctl.device_context(gpu_dev)`` or
+``dpctl.device_context("gpu")``.
+
+Note that in this examples data sharing was implicitly managed for us: in the
+case of calling a function from a precompiled oneAPI native extension data
+sharing was managed by DPC++ runtime, while in the case of using ``numba_dpex``
+kernel it was managed during execution of ``__call__`` method.
+
+Data sharing
+~~~~~~~~~~~~
+
+Implicit management of data is surely convenient, but its use in an interpreted
+code comes at a performance cost. A runtime must implicitly copy data from host
+to the device before the kernel execution commences and then copy some (or all)
+of it back after the execution completes for every Python API call.
+
+``dpctl`` provides for allocating memory directly accessible to kernels
+executing on a device using SYCL’s Unified Shared Memory (`USM
+<https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm>`__)
+feature. It also implements USM-based ND-array object
+``dpctl.tensor.usm_ndarray`` that conforms `array-API standard
+<https://data-apis.org/array-api/latest/>`__.
+
+.. code:: python
+
+   import dpctl.tensor as dpt
+
+   # allocate array of doubles using USM-device allocation on GPU device
+   X = dpt.arange(0.0, end=1.0, step=1e-6, device="gpu", usm_type="device")
+   # allocate array for the output
+   Y = dpt.empty_like(X)
+
+   # execution queue is inferred from allocation queues.
+   # Kernel is executed on the same device where arrays were allocated
+   numba_dpex_poly[X.size, numba_dpex.DEFAULT_LOCAL_SIZE](X, Y)
+
+The execution queue can be unambiguously determined in this case since both
+arguments are USM arrays with the same allocation queues and ``X.sycl_queue ==
+Y.sycl_queue`` evaluates to ``True``. Should allocation queues be different,
+such an inference becomes ambiguous and ``numba_dpex`` raises
+``IndeterminateExecutionQueueError`` advising user to explicitly migrate the
+data.
+
+Migration can be accomplished either by using ``dpctl.tensor.asarray(X,
+device=target_device)`` to create a copy, or by using
+``X.to_device(target_device)`` method.
+
+A USM array can be copied back into a NumPy array using ``dpt.asnumpy(Y)`` if
+needed.
+
+Compute follows data
+~~~~~~~~~~~~~~~~~~~~
+
+Automatic deduction of the execution queue from allocation queues is consistent
+with “`local control for data allocation target
+<https://data-apis.org/array-api/latest/design_topics/device_support.html>`__”
+in the array API standard. User has full control over memory allocation through
+three keyword arguments present in all `array creation functions
+<https://data-apis.org/array-api/latest/API_specification/creation_functions.html>`__.
+For example, consider
+
+.. code:: python
+
+    # TODO
+
+The keyword ``device`` is `mandated by the array API
+<https://data-apis.org/array-api/latest/design_topics/device_support.html#syntax-for-device-assignment>`__.
+In ``dpctl.tensor`` the allowed values of the keyword are
+
+-  Filter selector string, e.g. ``device="gpu:0"``
+-  Existing ``dpctl.SyclDevice`` object, e.g. ``device=dev_gpu``
+-  Existing ``dpctl.SyclQueue`` object
+-  ``dpctl.tensor.Device`` object instance obtained from an existing USM array,
+   e.g. ``device=X.device``
+
+In all cases, an allocation queue object will be constructed as described
+`earlier <#specifying-offload-target>`__ and stored in the array instance,
+accessible with ``X.sycl_queue``. Instead of using ``device`` keyword, one can
+alternatively use ``sycl_queue`` keyword for readability to directly specify a
+``dpctl.SyclQueue`` object to be used as the allocation queue.
+
+The rationale for storing the allocation queue in the array is that kernels
+submitted to this queue are guaranteed to be able to correctly dereference (i.e.
+access) the USM pointer. Array operations that only involve this single USM
+array can thus execute on the allocation queue, and the output array can be
+allocated on this same allocation queue with the same usm type as the input
+array.
+
+.. note::
+   Reusing the allocation queue of the input
+   array ensures the computational tasks behind the API call can access the
+   array without making implicit copies and the output array is allocated
+   on the same device as the input.
+
+Compute follows data is the rule prescribing deduction of the execution and the
+allocation queue as well as the USM type for the result when multiple USM arrays
+are combined. It stipulates that arrays can be combined if and only if their
+allocation *queues are the same* as measured by ``==`` operator (i.e.
+``X.sycl_queue == Y.sycl_queue`` must evaluate to ``True``). Same queues refer
+to the same underlying task graphs and DPC++ schedulers.
+
+An attempt to combine USM arrays with unsame allocation queues raises an
+exception advising the user to migrate the data. Migration can be accomplished
+either by using ``dpctl.tensor.asarray(X, device=Y.device)`` to create a copy,
+or by using ``X.to_device(Y.device)`` method which can sometime do the migration
+more efficiently.
+
+.. warning::
+   ``dpctl`` and ``numba_dpex`` are both under heavy development. Feel free to file an
+   issue on GitHub or reach out on Gitter should you encounter any issues.
diff --git a/docs/source/programming_model.rst b/docs/source/programming_model.rst
index 42e972c6b0..c4d2eea32a 100644
--- a/docs/source/programming_model.rst
+++ b/docs/source/programming_model.rst
@@ -4,274 +4,56 @@
 Programming Model
 =================
 
-In a heterogeneous system there may be **multiple** devices a Python user may
-want to engage. For example, it is common for a consumer-grade laptop to feature
-an integrated or a discrete GPU alongside a CPU.
+This section describes the multiple facets of the programming model that defines
+how programmers can use numba-dpex to develop parallel applications. The goal of
+the section is to provide users new to accelerator programming or parallel
+programming in general an introduction to some of the core concepts and map
+those concepts to numba-dpex's interface.
 
-To harness their power one needs to know how to answer the following 3 key
-questions:
 
-1. How does a Python program recognize available computational devices?
-2. How does a Python workload specify computations to be offloaded to selected
-   devices?
-3. How does a Python application manage data sharing?
+Data-level parallelism
+----------------------
 
-Recognizing available devices
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A large part of the massive-level of parallelism offered by accelerators such as
+GPUs is the ability to exploit *data-level parallelism* or simply *data
+parallelism*. The term refers to a common pattern that occurs in many types of
+programs where multiple units of the data accessed by the program can be
+operated by a computer at the same time. All modern computing platforms offer
+features to exploit data parallelism. Hardware features such as multiple nodes
+of a cluster computer, multiple cores or execution units of a CPU or a GPU,
+multiple threads inside a single execution unit, and even short-vector single
+instruction multiple data (SIMD) registers on a core, all offer ways to exploit
+data parallelism. Some of these hardware features such as SIMD registers are
+exclusively designed for data parallelism, whereas others are more
+general-purpose.
 
-Python package ``dpctl`` answers these questions. All the computational devices
-known to the underlying DPC++ runtime can be accessed using
-``dpctl.get_devices()``. A specific device of interest `can be selected
-<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/device_selection.html>`__
-either using a helper function, e.g. ``dpctl.select_gpu_device()``, or by
-passing a filter selector string to ``dpctl.SyclDevice`` constructor.
+The diversity of the hardware landscape coupled with the different API required
+by each type of hardware leads to conundrum for both programmers and programming
+language designers: *How to define a common programming model that can express
+data parallelism?* Defining a common programming model first and foremost
+requires a common execution model backed by an operational semantics
+:cite:p:`scott70` defining the computational steps of the execution model.
 
-.. code:: python
 
-   import dpctl
+SPMD
+----
+logical abstraction
 
-   # select a GPU device. If multiple devices present,
-   # let the underlying runtime select from GPUs
-   dev_gpu = dpctl.SyclDevice("gpu")
-   # select a CPU device
-   dev_cpu = dpctl.SyclDevice("cpu")
+SIMD/SIMT implementation model
 
-   # stand-alone function, equivalent to C++
-   #   `auto dev = sycl::gpu_selector().select_device();`
-   dev_gpu_alt = dpctl.select_gpu_device()
-   # stand-alone function, equivalent to C++
-   #   `auto dev = sycl::cpu_selector().select_device();`
-   dev_cpu_alt = dpctl.select_cpu_device()
 
-A `device object
-<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/devices.html>`__
-can be used to query properies of the device, such as its name, vendor, maximal
-number of computational units, memory size, etc.
+Execution Model
+---------------
 
-Specifying offload target
-~~~~~~~~~~~~~~~~~~~~~~~~~
+Memory Model
+------------
 
-To answer the second question on the list we need a digression to explain
-offloading in oneAPI DPC++ first.
-
-.. note::
-   In DPC++, a computation kernel can be specified using generic C++
-   programming and then the kernel can be offloaded to any device that is
-   supported by an underlying SYCL runtime. The device to which the kernel
-   is offloaded is specified using an **execution queue** when *launching
-   the kernel*.
-
-   The oneAPI unified programming model brings portability across heterogeneous
-   architectures. Another important aspect of the programming model is its
-   inherent flexibility that makes it possible to go beyond portability and even
-   strive for performance portability. An oneAPI library may be implemented
-   using C++ techniques such as template metaprogramming or dynamic polymorphism
-   to implement specializations for a generic kernel. If a kernel is implemented
-   polymorphically, the specialized implementation will be dispatched based on
-   the execution queue specified during kernel launch. The oneMKL library is an
-   example of a performance portable oneAPI library.
-
-A computational task is offloaded for execution on a device by submitting it to
-DPC++ runtime which inserts the task in a computational graph. Once the device
-becomes available the runtime selects a task whose dependencies are met for
-execution. The computational graph as well as the device targeted by its tasks
-are stored in a `SYCL queue
-<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/queues.html>`__
-object. The task submission is therefore always associated with a queue.
-
-Queues can be constructed directly from a device object, or by using a filter
-selector string to indicate the device to construct:
-
-.. code:: python
-
-   # construct queue from device object
-   q1 = dpctl.SyclQueue(dev_gpu)
-   # construct queue using filter selector
-   q2 = dpctl.SyclQueue("gpu")
-
-The computational tasks can be stored in an oneAPI native extension in which
-case their submission is orchestrated during Python API calls. Let’s consider a
-function that offloads an evaluation of a polynomial for every point of a NumPy
-array ``X``. Such a function needs to receive a queue object to indicate which
-device the computation must be offloaded to:
-
-.. code:: python
-
-   # allocate space for the result
-   Y = np.empty_like(X)
-   # evaluate polynomial on the device targeted by the queue, Y[i] = p(X[i])
-   onapi_ext.offloaded_poly_evaluate(exec_q, X, Y)
-
-Python call to ``onapi_ext.offloaded_poly_evaluate`` applied to NumPy arrays of
-double precision floating pointer numbers gets translated to the following
-sample C++ code:
-
-.. code:: cpp
-
-   void
-   cpp_offloaded_poly_evaluate(
-     sycl::queue q, const double *X, double *Y, size_t n) {
-       // create buffers from malloc allocations to make data accessible from device
-       sycl::buffer<1, double> buf_X(X, n);
-       sycl::buffer<1, double> buf_Y(Y, n);
-
-       q.submit([&](sycl::handler &cgh) {
-           // create buffer accessors indicating kernel data-flow pattern
-           sycl::accessor acc_X(buf_X, cgh, sycl::read_only);
-           sycl::accessor acc_Y(buf_Y, cgh, sycl::write_only, sycl::no_init);
-
-           cgh.parallel_for(n,
-              // lambda function that gets executed by different work-items with
-              // different arguments in parallel
-              [=](sycl::id<1> id) {
-                 auto x = accX[id];
-                 accY[id] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x));
-              });
-       }).wait();
-
-       return;
-   }
-
-We refer an interested reader to an excellent and freely available “`Data
-Parallel C++ <https://link.springer.com/book/10.1007%2F978-1-4842-5574-2>`__”
-book for details of this data parallel C++.
-
-Our package ``numba_dpex`` allows one to write kernels directly in Python.
-
-.. code:: python
-
-   import numba_dpex
-
-
-   @numba_dpex.kernel
-   def numba_dpex_poly(X, Y):
-       i = numba_dpex.get_global_id(0)
-       x = X[i]
-       Y[i] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x))
-
-Specifying the execution queue is done using Python context manager:
-
-.. code:: python
-
-   import numpy as np
-
-   X = np.random.randn(10**6)
-   Y = np.empty_like(X)
-
-   with dpctl.device_context(q):
-       # apply the kernel to elements of X, writing value into Y,
-       # while executing using given queue
-       numba_dpex_poly[numba_dpex.Range(X.size)](X, Y)
-
-The argument to ``device_context`` can be a queue object, a device object for
-which a temporary queue will be created, or a filter selector string. Thus we
-could have equally used ``dpctl.device_context(gpu_dev)`` or
-``dpctl.device_context("gpu")``.
-
-Note that in this examples data sharing was implicitly managed for us: in the
-case of calling a function from a precompiled oneAPI native extension data
-sharing was managed by DPC++ runtime, while in the case of using ``numba_dpex``
-kernel it was managed during execution of ``__call__`` method.
-
-Data sharing
-~~~~~~~~~~~~
-
-Implicit management of data is surely convenient, but its use in an interpreted
-code comes at a performance cost. A runtime must implicitly copy data from host
-to the device before the kernel execution commences and then copy some (or all)
-of it back after the execution completes for every Python API call.
-
-``dpctl`` provides for allocating memory directly accessible to kernels
-executing on a device using SYCL’s Unified Shared Memory (`USM
-<https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm>`__)
-feature. It also implements USM-based ND-array object
-``dpctl.tensor.usm_ndarray`` that conforms `array-API standard
-<https://data-apis.org/array-api/latest/>`__.
-
-.. code:: python
-
-   import dpctl.tensor as dpt
-
-   # allocate array of doubles using USM-device allocation on GPU device
-   X = dpt.arange(0.0, end=1.0, step=1e-6, device="gpu", usm_type="device")
-   # allocate array for the output
-   Y = dpt.empty_like(X)
-
-   # execution queue is inferred from allocation queues.
-   # Kernel is executed on the same device where arrays were allocated
-   numba_dpex_poly[X.size, numba_dpex.DEFAULT_LOCAL_SIZE](X, Y)
-
-The execution queue can be unambiguously determined in this case since both
-arguments are USM arrays with the same allocation queues and ``X.sycl_queue ==
-Y.sycl_queue`` evaluates to ``True``. Should allocation queues be different,
-such an inference becomes ambiguous and ``numba_dpex`` raises
-``IndeterminateExecutionQueueError`` advising user to explicitly migrate the
-data.
-
-Migration can be accomplished either by using ``dpctl.tensor.asarray(X,
-device=target_device)`` to create a copy, or by using
-``X.to_device(target_device)`` method.
-
-A USM array can be copied back into a NumPy array using ``dpt.asnumpy(Y)`` if
-needed.
+Kernel Dependency Model
+-----------------------
 
 Compute follows data
-~~~~~~~~~~~~~~~~~~~~
-
-Automatic deduction of the execution queue from allocation queues is consitent
-with “`local control for data allocation target
-<https://data-apis.org/array-api/latest/design_topics/device_support.html>`__”
-in the array API standard. User has full control over memory allocation through
-three keyword arguments present in all `array creation functions
-<https://data-apis.org/array-api/latest/API_specification/creation_functions.html>`__.
-For example, consider
-
-.. code:: python
-
-    # TODO
-
-The keyword ``device`` is `mandated by the array API
-<https://data-apis.org/array-api/latest/design_topics/device_support.html#syntax-for-device-assignment>`__.
-In ``dpctl.tensor`` the allowed values of the keyword are
-
--  Filter selector string, e.g. ``device="gpu:0"``
--  Existing ``dpctl.SyclDevice`` object, e.g. ``device=dev_gpu``
--  Existing ``dpctl.SyclQueue`` object
--  ``dpctl.tensor.Device`` object instance obtained from an existing USM array,
-   e.g. ``device=X.device``
-
-In all cases, an allocation queue object will be constructed as described
-`earlier <#specifying-offload-target>`__ and stored in the array instance,
-accessible with ``X.sycl_queue``. Instead of using ``device`` keyword, one can
-alternatively use ``sycl_queue`` keyword for readability to directly specify a
-``dpctl.SyclQueue`` object to be used as the allocation queue.
-
-The rationale for storing the allocation queue in the array is that kernels
-submitted to this queue are guaranteed to be able to correctly dereference (i.e.
-access) the USM pointer. Array operations that only involve this single USM
-array can thus execute on the allocation queue, and the output array can be
-allocated on this same allocation queue with the same usm type as the input
-array.
-
-.. note::
-   Reusing the allocation queue of the input
-   array ensures the computational tasks behind the API call can access the
-   array without making implicit copies and the output array is allocated
-   on the same device as the input.
-
-Compute follows data is the rule prescribing deduction of the execution and the
-allocation queue as well as the USM type for the result when multiple USM arrays
-are combined. It stipulates that arrays can be combined if and only if their
-allocation *queues are the same* as measured by ``==`` operator (i.e.
-``X.sycl_queue == Y.sycl_queue`` must evaluate to ``True``). Same queues refer
-to the same underlying task graphs and DPC++ schedulers.
-
-An attempt to combine USM arrays with unsame allocation queues raises an
-exception advising the user to migrate the data. Migration can be accomplished
-either by using ``dpctl.tensor.asarray(X, device=Y.device)`` to create a copy,
-or by using ``X.to_device(Y.device)`` method which can sometime do the migration
-more efficiently.
+--------------------
 
-.. warning::
-   ``dpctl`` and ``numba_dpex`` are both under heavy development. Feel free to file an
-   issue on GitHub or reach out on Gitter should you encounter any issues.
+References
+~~~~~~~~~~
+.. bibliography::
diff --git a/docs/source/user_guide/kernel_programming/call-kernel-async.rst b/docs/source/user_guide/kernel_programming/call-kernel-async.rst
new file mode 100644
index 0000000000..b3a657890d
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/call-kernel-async.rst
@@ -0,0 +1,4 @@
+.. _launching-an-async-kernel:
+
+Async kernel execution
+======================
diff --git a/docs/source/user_guide/kernel_programming/call-kernel.rst b/docs/source/user_guide/kernel_programming/call-kernel.rst
new file mode 100644
index 0000000000..2655027460
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/call-kernel.rst
@@ -0,0 +1,117 @@
+.. _launching-a-kernel:
+
+Launching a kernel
+==================
+
+A ``kernel`` decorated kapi function produces a ``KernelDispatcher`` object that
+is a type of a Numba* `Dispatcher`_ object. However, unlike regular Numba*
+Dispatcher objects a ``KernelDispatcher`` object cannot be directly invoked from
+either CPython or another compiled Numba* ``jit`` function. To invoke a
+``kernel`` decorated function, a programmer has to use the
+:func:`numba_dpex.core.kernel_launcher.call_kernel` function.
+
+To invoke a ``KernelDispatcher`` the ``call_kernel`` function requires three
+things: the ``KernelDispatcher`` object, the ``Range`` or ``NdRange`` object
+over which the kernel is to be executed, and the list of arguments to be passed
+to the compiled kernel. Once called with the necessary  arguments, the
+``call_kernel`` function does the following main things:
+
+- Compiles the ``KernelDispatcher`` object specializing it for the provided
+  argument types.
+
+- `Unboxes`_  the kernel arguments by converting CPython objects into Numba* or
+   numba-dpex objects.
+
+- Infer the execution queue on which to submit the kernel from the provided
+  kernel arguments. (TODO: Refer compute follows data.)
+
+- Submits the kernel to the execution queue.
+
+- Waits for the execution completion, before returning control back to the
+  caller.
+
+.. important::
+    Programmers should note the following two things when defining the global or
+    local range to launch a kernel.
+
+    * Numba-dpex currently limits the maximum allowed global range size to
+      ``2^31-1``. It is due to the capabilities of current OpenCL GPU backends
+      that generally do not support more than 32-bit global range sizes. A
+      kernel requesting a larger global range than that will not execute and a
+      ``dpctl._sycl_queue.SyclKernelSubmitError`` will get raised.
+
+      The Intel dpcpp SYCL compiler does handle greater than 32-bit global
+      ranges for GPU backends by wrapping the kernel in a new kernel that has
+      each work-item perform multiple invocations of the original kernel in a
+      32-bit global range. Such a feature is not yet available in numba-dpex.
+
+    * When launching an nd-range kernel, if the number of work-items for a
+      particular dimension of a work-group exceeds the maximum device
+      capability, it can result in undefined behavior.
+
+    The maximum allowed work-items for a device can be queried programmatically
+    as shown in :ref:`ex_max_work_item`.
+
+    .. code-block:: python
+        :linenos:
+        :caption: **Example:** Query maximum number of work-items for a device
+        :name: ex_max_work_item
+
+        import dpctl
+        import math
+
+        d = dpctl.SyclDevice("gpu")
+        d.print_device_info()
+
+        max_num_work_items = (
+            d.max_work_group_size
+            * d.max_work_item_sizes1d[0]
+            * d.max_work_item_sizes2d[0]
+            * d.max_work_item_sizes3d[0]
+        )
+        print(max_num_work_items, f"(2^{int(math.log(max_num_work_items, 2))})")
+
+        cpud = dpctl.SyclDevice("cpu")
+        cpud.print_device_info()
+
+        max_num_work_items_cpu = (
+            cpud.max_work_group_size
+            * cpud.max_work_item_sizes1d[0]
+            * cpud.max_work_item_sizes2d[0]
+            * cpud.max_work_item_sizes3d[0]
+        )
+        print(max_num_work_items_cpu, f"(2^{int(math.log(max_num_work_items_cpu, 2))})")
+
+    The output for :ref:`ex_max_work_item` on a system with an Intel Gen9 integrated
+    graphics processor and a 9th Generation Coffee Lake CPU is shown in
+    :ref:`ex_max_work_item_output`.
+
+    .. code-block:: bash
+        :caption: **OUTPUT:** Query maximum number of work-items for a device
+        :name: ex_max_work_item_output
+
+            Name            Intel(R) UHD Graphics 630 [0x3e98]
+            Driver version  1.3.24595
+            Vendor          Intel(R) Corporation
+            Filter string   level_zero:gpu:0
+
+        4294967296 (2^32)
+            Name            Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
+            Driver version  2023.16.12.0.12_195853.xmain-hotfix
+            Vendor          Intel(R) Corporation
+            Filter string   opencl:cpu:0
+
+        4503599627370496 (2^52)
+
+
+The ``call_kernel`` function can be invoked both from CPython and from another
+Numba* compiled function. Note that the ``call_kernel`` function supports only
+synchronous execution of kernel and the ``call_kernel_async`` function should be
+used for asynchronous mode of kernel execution (refer
+:ref:`launching-an-async-kernel`).
+
+
+.. seealso::
+
+    Refer the API documentation for
+    :func:`numba_dpex.core.kernel_launcher.call_kernel` for more details.
diff --git a/docs/source/user_guide/kernel_programming/device-functions.rst b/docs/source/user_guide/kernel_programming/device-functions.rst
index 0ad5a10441..b9dd914a0a 100644
--- a/docs/source/user_guide/kernel_programming/device-functions.rst
+++ b/docs/source/user_guide/kernel_programming/device-functions.rst
@@ -1,27 +1,86 @@
-Writing Device Functions
-========================
+Numba-dpex provides a decorator to express auxiliary device-only functions that
+can be called from a kernel or another device function, but are not callable
+from the host. This decorator :func:`numba_dpex.core.decorators.device_func` has
+no direct analogue in SYCL and primarily is provided to help programmers make
+their kapi applications modular. :ref:`ex_device_func1` shows a simple usage of
+the ``device_func`` decorator.
 
-The user-level API of SYCL does not have a notion for device-only functions,
-*i.e.* functions that can be only invoked from a kernel and not from a host
-function. However, numba-dpex provides a special decorator
-``numba_dpex.func`` specifically to implement device functions.
+.. code-block:: python
+    :linenos:
+    :caption: **Example:** Basic usage of device_func
+    :name: ex_device_func1
 
-.. literalinclude:: ./../../../../numba_dpex/examples/kernel/device_func.py
-   :pyobject: a_device_function
+    import dpnp
 
-To use a device function from an another device function:
+    import numba_dpex as dpex
+    from numba_dpex import kernel_api as kapi
 
-.. literalinclude:: ./../../../../numba_dpex/examples/kernel/device_func.py
-   :pyobject: another_device_function
+    # Array size
+    N = 10
 
-To use a device function from a kernel function ``numba_dpex.kernel``:
 
-.. literalinclude:: ./../../../../numba_dpex/examples/kernel/device_func.py
-   :pyobject: a_kernel_function
+    @dpex.device_func
+    def a_device_function(a):
+        """A device callable function that can be invoked from a kernel or
+        another device function.
+        """
+        return a + 1
 
-Unlike a kernel function, a device function can return a value like normal
-functions.
 
-.. todo::
+    @dpex.kernel
+    def a_kernel_function(item: kapi.Item, a, b):
+        """Demonstrates calling a device function from a kernel."""
+        i = item.get_id(0)
+        b[i] = a_device_function(a[i])
 
-   Specific capabilities and limitations for device functions need to be added.
+
+    N = 16
+    a = dpnp.ones(N, dtype=dpnp.int32)
+    b = dpnp.zeros(N, dtype=dpnp.int32)
+
+    dpex.call_kernel(a_kernel_function, dpex.Range(N), a, b)
+
+
+.. code-block:: python
+    :linenos:
+    :caption: **Example:** Using kapi functionalities in a device_func
+    :name: ex_device_func2
+
+    import dpnp
+
+    import numba_dpex as dpex
+    from numba_dpex import kernel_api as kapi
+
+
+    @dpex.device_func
+    def increment_value(nd_item: kapi.NdItem, a):
+        """Demonstrates the usage of group_barrier and NdItem usage in a
+        device_func.
+        """
+        i = nd_item.get_global_id(0)
+
+        a[i] += 1
+        kapi.group_barrier(nd_item.get_group(), kapi.MemoryScope.DEVICE)
+
+        if i == 0:
+            for idx in range(1, a.size):
+                a[0] += a[idx]
+
+
+    @dpex.kernel
+    def another_kernel(nd_item: kapi.NdItem, a):
+        """The kernel does everything by calling a device_func."""
+        increment_value(nd_item, a)
+
+
+    N = 16
+    b = dpnp.ones(N, dtype=dpnp.int32)
+
+    dpex.call_kernel(another_kernel, dpex.NdRange((N,), (N,)), b)
+
+
+A device function does not require the first argument to be an index space id
+class, and unlike a kernel function a device function is allowed to return a
+value. All kapi functionality can be used in a ``device_func`` decorated
+function and at compilation stage numba-dpex will attempt to inline a
+``device_func`` into the kernel where it is used.
diff --git a/docs/source/user_guide/kernel_programming/dpnp-ufuncs.csv b/docs/source/user_guide/kernel_programming/dpnp-ufuncs.csv
new file mode 100644
index 0000000000..d79d728a5f
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/dpnp-ufuncs.csv
@@ -0,0 +1,80 @@
+
+Name, Supported types, Notes
+``dpnp.add``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.arctan2``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``, Not supported on devices that lack FP64 support
+``dpnp.bitwise_and``, ``types.int32`` ``types.int64``,
+``dpnp.bitwise_or``, ``types.int32`` ``types.int64``,
+``dpnp.bitwise_xor``,``types.int32`` ``types.int64``,
+``dpnp.copysign``, ``types.float32`` ``types.float64``,
+``dpnp.divide``, ``types.float32`` ``types.float64``,
+``dpnp.equal``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.floor_divide``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.fmax``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.fmin``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.fmod``, ``types.float32`` ``types.float64``,
+``dpnp.greater``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.greater_equal``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.hypot``, ``types.float32`` ``types.float64``,
+``dpnp.left_shift``, ``types.int32`` ``types.int64``,
+``dpnp.less``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.less_equal``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.logical_and``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.logical_or``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.logical_xor``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.maximum``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.minimum``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.mod``, ``types.int32`` ``types.int64``,
+``dpnp.multiply``,``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.not_equal``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.power``, ``types.float32`` ``types.float64``,
+``dpnp.remainder``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.right_shift``, ``types.float32`` ``types.float64``,
+``dpnp.subtract``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.true_divide``, ``types.float32`` ``types.float64``,
+``dpnp.abs``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.absolute``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.arccos``, ``types.float32`` ``types.float64``,
+``dpnp.arccosh``, ``types.float32`` ``types.float64``, Not supported on Intel Xe (Gen12) GPUs
+``dpnp.arcsin``, ``types.float32`` ``types.float64``,
+``dpnp.arcsinh``, ``types.float32`` ``types.float64``,
+``dpnp.arctan``, ``types.float32`` ``types.float64``,
+``dpnp.arctanh``, ``types.float32`` ``types.float64``,
+``dpnp.bitwise_not``, ``types.int32`` ``types.int64``,
+``dpnp.cbrt``, N/A , Not supported,
+``dpnp.ceil``, ``types.float32`` ``types.float64``,
+``dpnp.conjugate``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.cos``, ``types.float32`` ``types.float64``,
+``dpnp.cosh``, ``types.float32`` ``types.float64``,
+``dpnp.deg2rad``, ``types.float32`` ``types.float64``,
+``dpnp.degrees``, ``types.float32`` ``types.float64``,
+``dpnp.erf``, ``types.float32`` ``types.float64``,
+``dpnp.exp``, ``types.float32`` ``types.float64``,
+``dpnp.exp2``, ``types.float32`` ``types.float64``,
+``dpnp.expm1``, ``types.float32`` ``types.float64``, Not supported on Intel Xe (Gen12) GPUs
+``dpnp.fabs``, ``types.float32`` ``types.float64``,
+``dpnp.floor``, ``types.float32`` ``types.float64``,
+``dpnp.frexp``, N/A , Not supported
+``dpnp.invert``, ``types.int32`` ``types.int64``,
+``dpnp.isfinite``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.isinf``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.isnan``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.log``, ``types.float32`` ``types.float64``, Not supported on Intel Xe (Gen12) GPUs
+``dpnp.log10``, ``types.float32`` ``types.float64``, Not supported on Intel Xe (Gen12) GPUs
+``dpnp.log1p``, ``types.float32`` ``types.float64``,
+``dpnp.log2``, ``types.float32`` ``types.float64``, Not supported on Intel Xe (Gen12) GPUs
+``dpnp.log2``, N/A , Not supported
+``dpnp.logical_not``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.logaddexp``, N/A , Not supported
+``dpnp.logaddexp2``, N/A , Not supported
+``dpnp.negative``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.rad2deg``, ``types.float32`` ``types.float64``,
+``dpnp.radians``, ``types.float32`` ``types.float64``,
+``dpnp.reciprocal``, ``types.float32`` ``types.float64``,
+``dpnp.sign``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``, Not supported on Intel Xe (Gen12) GPUs
+``dpnp.sin``, ``types.float32`` ``types.float64``,
+``dpnp.sinh``, ``types.float32`` ``types.float64``,
+``dpnp.sqrt``, ``types.float32`` ``types.float64``,
+``dpnp.square``, ``types.float32`` ``types.float64`` ``types.int32`` ``types.int64``,
+``dpnp.tan``, ``types.float32`` ``types.float64``,
+``dpnp.tanh``, ``types.float32`` ``types.float64``,
+``dpnp.trunc``, ``types.float32`` ``types.float64``,
diff --git a/docs/source/user_guide/kernel_programming/index.rst b/docs/source/user_guide/kernel_programming/index.rst
index b40d5845fe..a28564c813 100644
--- a/docs/source/user_guide/kernel_programming/index.rst
+++ b/docs/source/user_guide/kernel_programming/index.rst
@@ -2,105 +2,165 @@
 .. include:: ./../../ext_links.txt
 
 Kernel Programming
-==================
+##################
 
-The tutorial covers the most important features of the KAPI kernel programming
-API and introduces the concepts needed to express data-parallel kernels in
-numba-dpex.
+The tutorial covers the numba-dpex kernel programming API (kapi) and introduces
+the concepts needed to write data-parallel kernels in numba-dpex.
 
 
-Preliminary concepts
---------------------
+.. Preliminary concepts
+.. --------------------
 
-Data parallelism
-++++++++++++++++
+.. Data parallelism
+.. ++++++++++++++++
 
-Single Program Multiple Data
-++++++++++++++++++++++++++++
+.. Single Program Multiple Data
+.. ++++++++++++++++++++++++++++
 
-Range v/s Nd-Range Kernels
-++++++++++++++++++++++++++
+.. Range v/s Nd-Range Kernels
+.. ++++++++++++++++++++++++++
 
-Work items and Work groups
-++++++++++++++++++++++++++
-
-Basic concepts
---------------
+.. Work items and Work groups
+.. ++++++++++++++++++++++++++
 
+Core concepts
+*************
 
 Writing a *range* kernel
-++++++++++++++++++++++++
-
-A *range* kernel represents the simplest form of parallelism that can be
-expressed in KAPI. A range kernel represents a data-parallel execution of the
-same function by a set of work items. In KAPI, an instance of the
-:py:class:`numba_dpex.kernel_api.Range` class represents the set of work items
-and each work item in the ``Range`` is represented by an instance of the
-:py:class:`numba_dpex.kernel_api.Item` class. As such these two classes are
-essential to writing a range kernel in KAPI.
-
-.. literalinclude:: ./../../../../numba_dpex/examples/kernel/vector_sum.py
-   :language: python
-   :lines: 8-9, 11-15
-   :caption: **EXAMPLE:** A KAPI range kernel
-   :name: ex_kernel_declaration_vector_sum
-
-:ref:`ex_kernel_declaration_vector_sum` shows an example of a range kernel.
-Every range kernel requires its first argument to be an ``Item`` and
-needs to be launched via :py:func:`numba_dpex.experimental.launcher.call_kernel`
-by passing an instance a ``Range`` object.
-
-Do note that a ``Range`` object only controls the creation of work items, the
-distribution of work and data over a ``Range`` still needs to be defined by the
-user-written function. In the example, each work item access a single element of
-each of the three array and performs a single addition operation. It is possible
-to write the kernel differently so that each work item accesses multiple data
-elements or conditionally performs different amount of work. The data access
-patterns in a work item can have performance implications and programmers should
-refer a more topical material such as the `oneAPI GPU optimization guide`_ to
-learn more.
-
-A range kernel is meant to express a basic `parallel-for` calculation that is
-ideally suited for embarrassingly parallel kernels such as elementwise
-computations over ndarrays. The API for expressing a range kernel does not
-allow advanced features such as synchronization of work items and fine-grained
-control over memory allocation on a device.
+========================
+
+.. include:: ./writing-range-kernel.rst
 
 Writing an *nd-range* kernel
-++++++++++++++++++++++++++++
+============================
+
+.. include:: ./writing-ndrange-kernel.rst
+
+.. Launching a kernel
+.. ==================
+
+.. include:: ./call-kernel.rst
 
 The ``device_func`` decorator
-+++++++++++++++++++++++++++++
+=============================
+
+.. include:: ./device-functions.rst
+
+
+Supported types of kernel argument
+==================================
+
+A kapi kernel function can have both array and scalar arguments. At least one of
+the argument to every kernel function has to be an array. The requirement is
+enforced so that a execution queue can be inferred at the kernel launch stage.
+An array type argument is passed as a reference to the kernel and all scalar
+arguments are passed by value.
+
+Supported array types
+---------------------
+- `dpctl.tensor.usm_ndarray`_ : A SYCL-based Python Array API complaint tensor.
+- `dpnp.ndarray`_ :  A ``numpy.ndarray``-like array container that supports SYCL USM memory allocation.
+
+Scalar types
+------------
+
+Scalar values can be passed to a kernel function either using the default Python
+scalar type or as explicit NumPy or dpnp data type objects.
+:ref:`ex_scalar_kernel_arg_ty` shows the two possible ways of defining a scalar
+type. In both scenarios, numba-dpex depends on the default Numba* type inferring
+algorithm to determine the LLVM IR type of a Python object that represents a
+scalar value. At the kernel submission stage the LLVM IR type is reinterpreted
+as a C++11 type to interoperate with the underlying SYCL runtime.
+
+.. code-block:: python
+    :caption: **Example:** Ways of defining a scalar kernel argument
+    :name: ex_scalar_kernel_arg_ty
+
+    import dpnp
+
+    a = 1
+    b = dpnp.dtype("int32").type(1)
 
-Supported mathematical operations
-+++++++++++++++++++++++++++++++++
+    print(type(a))
+    print(type(b))
 
-Supported Python operators
-++++++++++++++++++++++++++
+.. code-block:: bash
+    :caption: **Output:** Ways of defining a scalar kernel argument
+    :name: ex_scalar_kernel_arg_ty_output
 
-Supported kernel arguments
-++++++++++++++++++++++++++
+    <class 'int'>
+    <class 'numpy.int32'>
 
-Launching a kernel
-++++++++++++++++++
+The following scalar types are currently supported as arguments of a numba-dpex
+kernel function:
 
-Advanced topics
----------------
+- ``int``
+- ``float``
+- ``complex``
+- ``numpy.int32``
+- ``numpy.uint32``
+- ``numpy.int64``
+- ``numpy.uint32``
+- ``numpy.float32``
+- ``numpy.float64``
+
+.. important::
+
+    The Numba* type inferring algorithm by default infers a native Python
+    scalar type to be a 64-bit value. The algorithm is defined that way to be
+    consistent with the default CPython behavior. The default inferred 64-bit
+    type can cause compilation failures on platforms that do not have native
+    64-bit floating point support. Another potential fallout of the default
+    64-bit type inference can be when a narrower width type is required by a
+    specific kernel. To avoid these issues, users are advised to always use a
+    dpnp/numpy type object to explicitly define the type of a scalar value.
+
+DLPack support
+--------------
+At this time direct support for the `DLPack`_ protocol is has not been added to
+numba-dpex. To interoperate numba_dpex with other SYCL USM based libraries,
+users should first convert their input tensor or ndarray object into either of
+the two supported array types, both of which support DLPack.
+
+
+Supported Python features
+*************************
+
+Mathematical operations
+=======================
+
+.. include:: ./math-functions.rst
+
+Operators
+=========
+
+.. include:: ./operators.rst
+
+General Python features
+=======================
+
+.. include:: ./supported-python-features.rst
+
+
+Advanced concepts
+*****************
 
 Local memory allocation
-+++++++++++++++++++++++
+=======================
 
 Private memory allocation
-+++++++++++++++++++++++++
+=========================
 
 Group barrier synchronization
-+++++++++++++++++++++++++++++
+=============================
 
 Atomic operations
-+++++++++++++++++
+=================
+
+.. Async kernel execution
+.. ======================
 
-Async kernel execution
-++++++++++++++++++++++
+.. include:: ./call-kernel-async.rst
 
 Specializing a kernel or a device_func
-++++++++++++++++++++++++++++++++++++++
+======================================
diff --git a/docs/source/user_guide/kernel_programming/math-functions.csv b/docs/source/user_guide/kernel_programming/math-functions.csv
new file mode 100644
index 0000000000..f4532e8b79
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/math-functions.csv
@@ -0,0 +1,49 @@
+Name, Supported signature
+``math.isnan``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.isinf``,``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.ceil``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.floor``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.trunc``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.fabs``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.sqrt``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.exp``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.expm1``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.log``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.log10``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.log1p``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.sin``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.cos``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.tan``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.asin``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.acos``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.atan``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.sinh``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.cosh``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.tanh``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.asinh``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.acosh``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.atanh``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.exp2``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.log2``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.erf``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.erfc``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.gamma``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.lgamma``, ``types.float32(types.float32)``; ``types.float64(types.float64)``
+``math.copysign``, "``types.float32(types.float32, types.float32)``; ``types.float64(types.float64, types.float64)``"
+``math.atan2``, "``types.float32(types.float32, types.float32)``; ``types.float64(types.float64, types.float64)``"
+``math.pow``, "``types.float32(types.float32, types.float32)``; ``types.float64(types.float64, types.float64)``"
+``math.fmod``, "``types.float32(types.float32, types.float32)``; ``types.float64(types.float64, types.float64)``"
+``math.ldexp``, "``types.float32(types.float32, types.int32)``;
+``types.float32(types.float32, types.int64)``;
+``types.float64(types.float64, types.int32)``;
+``types.float64(types.float64, types.int64)``"
+``math.hypot``, "``types.float32(types.float32, types.int32)``;
+``types.float32(types.float32, types.int64)``;
+``types.float64(types.float64, types.int32)``;
+``types.float64(types.float64, types.int64)``"
+``math.frexp``, Not supported
+``math.ldexp``, Not supported
+``math.trunc``, Not supported
+``math.modf``, Not supported
+``math.factorial``, Not supported
+``math.fsum``, Not supported
diff --git a/docs/source/user_guide/kernel_programming/math-functions.rst b/docs/source/user_guide/kernel_programming/math-functions.rst
new file mode 100644
index 0000000000..284f89e06e
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/math-functions.rst
@@ -0,0 +1,22 @@
+
+Scalar mathematical functions from the Python `math`_ module and the `dpnp`_
+library can be used inside a kernel function. During compilation the
+mathematical functions get compiled into device-specific intrinsic instructions.
+
+
+.. csv-table:: Current support matrix of ``math`` module functions
+   :file: ./math-functions.csv
+   :widths: 30, 70
+   :header-rows: 1
+
+.. caution::
+
+   The supported signature for some of the ``math`` module functions in the
+   compiled mode differs from CPython. The divergence in behavior is a known
+   issue. Please refer https://github.com/IntelPython/numba-dpex/issues/759 for
+   updates.
+
+.. csv-table:: Current support matrix of ``dpnp`` functions
+   :file: ./dpnp-ufuncs.csv
+   :widths: auto
+   :header-rows: 1
diff --git a/docs/source/user_guide/kernel_programming/operators.csv b/docs/source/user_guide/kernel_programming/operators.csv
new file mode 100644
index 0000000000..71e855cedb
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/operators.csv
@@ -0,0 +1,35 @@
+Name, Operator, Note
+Addition, ``+``,
+Multiplication, ``*``,
+Subtraction, ``-``,
+Division, ``/``,
+Floor Division, ``//``,
+Modulo, ``%``,
+Exponent, ``**``,
+In-place Addition, ``+=``,
+In-place Subtraction, ``-=``,
+In-place Division, ``/=``,
+In-place Floor Division, ``//=``,
+In-place Modulo, ``%=``,
+In-place Exponent, ``**=``, Only supported on OpenCL CPU devices
+Bitwise And, ``&``,
+Bitwise Left Shift, ``<<``,
+Bitwise Right Shift, ``>>``,
+Bitwise Or, ``|``,
+Bitwise Exclusive Or, ``^``,
+In-place Bitwise And, ``&=``,
+In-place Bitwise Left Shift, ``<<=``,
+In-place Bitwise Right Shift, ``>>=``,
+In-place Bitwise Or, ``|=``,
+In-place Bitwise Exclusive Or, ``^=``,
+Negation, ``-``,
+Complement, ``~``,
+Pos, ``+``,
+Less Than, ``<``,
+Less Than Equal, ``<=``,
+Greater Than, ``>``,
+Greater Than Equal, ``>=``,
+Equal To, ``==``,
+Not Equal To, ``!=``,
+Matmul, ``@``, **Not supported**
+In-place Matmul, ``@=``, **Not supported**
diff --git a/docs/source/user_guide/kernel_programming/operators.rst b/docs/source/user_guide/kernel_programming/operators.rst
new file mode 100644
index 0000000000..f07ef4c986
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/operators.rst
@@ -0,0 +1,6 @@
+List of supported Python operators that can be used in a ``kernel`` or
+``device_func`` decorated function.
+
+.. csv-table:: Current support matrix of Python operators
+   :file: ./operators.csv
+   :header-rows: 1
diff --git a/docs/source/user_guide/kernel_programming/supported-python-features.rst b/docs/source/user_guide/kernel_programming/supported-python-features.rst
index fcfc41b904..56b9c91aaa 100644
--- a/docs/source/user_guide/kernel_programming/supported-python-features.rst
+++ b/docs/source/user_guide/kernel_programming/supported-python-features.rst
@@ -1,8 +1,15 @@
-Supported Python Features inside ``numba_dpex.kernel``
-======================================================
 
-This page lists the Python features supported inside a ``numba_dpex.kernel``
-function.
+A kapi function when run in the purely interpreted mode by the CPython
+interpreter is a regular Python function, and as such in theory any Python
+feature can be used in the body of the function. In practice, to be
+JIT compilable and executable on a device only a subset of Python language
+features are supported in a kapi function. The restriction stems from both
+limitations in the Numba compiler tooling and also from the device-specific
+calling convention and other restrictions applied by a device's ABI.
+
+This section provides a partial support matrix for Python features with respect
+to their usage in a kapi function.
+
 
 Built-in types
 --------------
@@ -31,65 +38,6 @@ The following built-in functions are supported:
 - ``range()``
 - ``round()``
 
-Standard library modules
-------------------------
-
-The following functions from the math module are supported:
-
--  ``math.acos()``
--  ``math.asin()``
--  ``math.atan()``
--  ``math.acosh()``
--  ``math.asinh()``
--  ``math.atanh()``
--  ``math.cos()``
--  ``math.sin()``
--  ``math.tan()``
--  ``math.cosh()``
--  ``math.sinh()``
--  ``math.tanh()``
--  ``math.erf()``
--  ``math.erfc()``
--  ``math.exp()``
--  ``math.expm1()``
--  ``math.fabs()``
--  ``math.gamma()``
--  ``math.lgamma()``
--  ``math.log()``
--  ``math.log10()``
--  ``math.log1p()``
--  ``math.sqrt()``
--  ``math.ceil()``
--  ``math.floor()``
-
-The following functions from the operator module are supported:
-
--  ``operator.add()``
--  ``operator.eq()``
--  ``operator.floordiv()``
--  ``operator.ge()``
--  ``operator.gt()``
--  ``operator.iadd()``
--  ``operator.ifloordiv()``
--  ``operator.imod()``
--  ``operator.imul()``
--  ``operator.ipow()``
--  ``operator.isub()``
--  ``operator.itruediv()``
--  ``operator.le()``
--  ``operator.lshift()``
--  ``operator.lt()``
--  ``operator.mod()``
--  ``operator.mul()``
--  ``operator.ne()``
--  ``operator.neg()``
--  ``operator.not_()``
--  ``operator.or_()``
--  ``operator.pos()``
--  ``operator.pow()``
--  ``operator.sub()``
--  ``operator.truediv()``
-
 Unsupported Constructs
 ----------------------
 
@@ -101,10 +49,3 @@ The following Python constructs are **not supported**:
 - Generator (any ``yield`` statements)
 - The ``raise`` statement
 - The ``assert`` statement
-
-
-NumPy support
--------------
-
-NumPy functions are whole array operations and are not supported within a
-``numba_dpex.kernel``.
diff --git a/docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst b/docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst
new file mode 100644
index 0000000000..43632176d3
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/writing-ndrange-kernel.rst
@@ -0,0 +1,129 @@
+
+In a range kernel, the kernel execution is scheduled over a set of work-items
+without any explicit grouping of the work-items. The basic form of parallelism
+that can be expressed using a range kernel does not allow expressing any notion
+of locality within the kernel. To get around that limitation, kapi provides a
+second form of expressing a parallel kernel that is called an *nd-range* kernel.
+An nd-range kernel represents a data-parallel execution of the kernel by a set
+of explicitly defined groups of work-items. An individual group of work-items is
+called a *work-group*. :ref:`ex_matmul_kernel` demonstrates an nd-range kernel
+and some of the advanced features programmers can use in this type of kernel.
+
+.. code-block:: python
+    :linenos:
+    :caption: **Example:** Sliding window matrix multiplication as an nd-range kernel
+    :name: ex_matmul_kernel
+
+    from numba_dpex import kernel_api as kapi
+    import numba_dpex as dpex
+    import numpy as np
+    import dpctl.tensor as dpt
+
+    square_block_side = 2
+    work_group_size = (square_block_side, square_block_side)
+    dtype = np.float32
+
+
+    @dpex.kernel
+    def matmul(
+        nditem: kapi.NdItem,
+        X,  # IN READ-ONLY    (X_n_rows, n_cols)
+        y,  # IN READ-ONLY    (n_cols, y_n_rows),
+        X_slm,  # SLM to store a sliding window over X
+        Y_slm,  # SLM to store a sliding window over Y
+        result,  # OUT        (X_n_rows, y_n_rows)
+    ):
+        X_n_rows = X.shape[0]
+        Y_n_cols = y.shape[1]
+        n_cols = X.shape[1]
+
+        result_row_idx = nditem.get_global_id(0)
+        result_col_idx = nditem.get_global_id(1)
+
+        local_row_idx = nditem.get_local_id(0)
+        local_col_idx = nditem.get_local_id(1)
+
+        n_blocks_for_cols = n_cols // square_block_side
+        if (n_cols % square_block_side) > 0:
+            n_blocks_for_cols += 1
+
+        output = dtype(0)
+
+        gr = nditem.get_group()
+
+        for block_idx in range(n_blocks_for_cols):
+            X_slm[local_row_idx, local_col_idx] = dtype(0)
+            Y_slm[local_row_idx, local_col_idx] = dtype(0)
+            if (result_row_idx < X_n_rows) and (
+                (local_col_idx + (square_block_side * block_idx)) < n_cols
+            ):
+                X_slm[local_row_idx, local_col_idx] = X[
+                    result_row_idx, local_col_idx + (square_block_side * block_idx)
+                ]
+
+            if (result_col_idx < Y_n_cols) and (
+                (local_row_idx + (square_block_side * block_idx)) < n_cols
+            ):
+                Y_slm[local_row_idx, local_col_idx] = y[
+                    local_row_idx + (square_block_side * block_idx), result_col_idx
+                ]
+
+            kapi.group_barrier(gr)
+
+            for idx in range(square_block_side):
+                output += X_slm[local_row_idx, idx] * Y_slm[idx, local_col_idx]
+
+            kapi.group_barrier(gr)
+
+        if (result_row_idx < X_n_rows) and (result_col_idx < Y_n_cols):
+            result[result_row_idx, result_col_idx] = output
+
+
+    def _arange_reshaped(shape, dtype):
+        n_items = shape[0] * shape[1]
+        return np.arange(n_items, dtype=dtype).reshape(shape)
+
+
+    X = _arange_reshaped((5, 5), dtype)
+    Y = _arange_reshaped((5, 5), dtype)
+    X = dpt.asarray(X)
+    Y = dpt.asarray(Y)
+    device = X.device.sycl_device
+    result = dpt.zeros((5, 5), dtype, device=device)
+    X_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
+    Y_slm = kapi.LocalAccessor(shape=work_group_size, dtype=dtype)
+
+    dpex.call_kernel(matmul, kapi.NdRange((6, 6), (2, 2)), X, Y, X_slm, Y_slm, result)
+
+
+When writing an nd-range kernel, a programmer defines a set of groups of
+work-items instead of a flat execution range.There are several semantic rules
+associated both with a work-group and the work-items in a work-group:
+
+- Each work-group gets executed in an arbitrary order by the underlying
+  runtime and programmers should not assume any implicit ordering.
+
+- Work-items in different wok-groups cannot communicate with each other except
+  via atomic operations on global memory.
+
+- Work-items within a work-group share a common memory region called
+  "shared local memory" (SLM). Depending on the device the SLM maybe mapped to a
+  dedicated fast memory.
+
+- Work-items in a work-group can synchronize using a
+  :func:`numba_dpex.kernel_api.group_barrier` operation that can additionally
+  guarantee memory consistency using a *work-group memory fence*.
+
+.. note::
+
+    The SYCL language provides additional features for work-items in a
+    work-group such as *group functions* that specify communication routines
+    across work-items and also implement patterns such as reduction and scan.
+    These features are not yet available in numba-dpex.
+
+An nd-range kernel needs to be launched with an instance of the
+:py:class:`numba_dpex.kernel_api.NdRange` class and the first
+argument to an nd-range kernel has to be an instance of
+:py:class:`numba_dpex.kernel_api.NdItem`. Apart from the need to provide an
+```NdItem`` parameter, the rest of the semantic rules that apply to a range
+kernel also apply to an nd-range kernel.
diff --git a/docs/source/user_guide/kernel_programming/writing-range-kernel.rst b/docs/source/user_guide/kernel_programming/writing-range-kernel.rst
new file mode 100644
index 0000000000..70daf10975
--- /dev/null
+++ b/docs/source/user_guide/kernel_programming/writing-range-kernel.rst
@@ -0,0 +1,81 @@
+A *range* kernel represents the simplest form of parallelism that can be
+expressed in numba-dpex using kapi. Such a kernel represents a data-parallel
+execution over a set of work-items with each work-item representing a logical
+thread of execution. :ref:`ex_vecadd_kernel` shows an example of a range kernel
+written in numba-dpex.
+
+.. code-block:: python
+    :linenos:
+    :caption: **Example:** Vector addition using a range kernel
+    :name: ex_vecadd_kernel
+    :emphasize-lines: 9,17
+
+    import dpnp
+    import numba_dpex as dpex
+    from numba_dpex import kernel_api as kapi
+
+
+    # Data parallel kernel implementing vector sum
+    @dpex.kernel
+    def vecadd(item: kapi.Item, a, b, c):
+        i = item.get_id(0)
+        c[i] = a[i] + b[i]
+
+
+    N = 1024
+    a = dpnp.ones(N)
+    b = dpnp.ones_like(a)
+    c = dpnp.zeros_like(a)
+    dpex.call_kernel(vecadd, kapi.Range(N), a, b, c)
+
+The highlighted lines in the example demonstrate the definition of the execution
+range on **line 17** and extraction of every work-items' *id* or index position
+via the ``item.get_id`` call on **line 10**. An execution range comprising of
+1024 work-items is defined when calling the kernel and each work-item then
+executes a single addition.
+
+There are a few semantic rules that have to be adhered to when writing a range
+kernel:
+
+* Analogous to the API of SYCL a range kernel can execute only over a 1-, 2-, or
+  a 3-dimensional set of work-items.
+
+* Every range kernel requires its first argument to be an instance of the
+  :class:`numba_dpex.kernel_api.Item` class. The ``Item`` object is an
+  abstraction encapsulating the index position (id) of a single work-item in the
+  global execution range. The id will be a 1-, 2-, or a 3-tuple depending
+  the dimensionality of the execution range.
+
+* A range kernel cannot return any value.
+
+  **Note** the rule is enforced only in
+  the compiled mode and not in the pure Python execution on a kapi kernel.
+
+* A kernel can accept both array and scalar arguments. Array arguments currently
+  can either be a ``dpnp.ndarray`` or a ``dpctl.tensor.usm_ndarray``. Scalar
+  values can be of any Python numeric type. Array arguments are passed by
+  reference, *i.e.*, changes to an array in a kernel are visible outside the
+  kernel. Scalar values are always passed by value.
+
+* At least one argument of a kernel should be an array. The requirement is so
+  that the kernel launcher (:func:`numba_dpex.core.kernel_launcher.call_kernel`)
+  can determine the execution queue on which to launch the kernel. Refer to the
+  :ref:`launching-a-kernel` section for more details.
+
+A range kernel has to be executed via the
+:py:func:`numba_dpex.core.kernel_launcher.call_kernel` function by passing in
+an instance of the :class:`numba_dpex.kernel_api.Range` class. Refer to the
+:ref:`launching-a-kernel` section for more details on how to launch a range
+kernel.
+
+A range kernel is meant to express a basic `parallel-for` calculation that is
+ideally suited for embarrassingly parallel kernels such as element-wise
+computations over n-dimensional arrays (ndarrays). The API for expressing a
+range kernel does not allow advanced features such as synchronization of
+work-items and fine-grained control over memory allocation on a device. For such
+advanced features, an nd-range kernel should be used.
+
+.. seealso::
+    Refer API documentation for :class:`numba_dpex.kernel_api.Range`,
+    :class:`numba_dpex.kernel_api.Item`,  and
+    :func:`numba_dpex.core.kernel_launcher.call_kernel` for more details.
diff --git a/environment/docs.yml b/environment/docs.yml
index 056dd9fd06..3c005cd8d0 100644
--- a/environment/docs.yml
+++ b/environment/docs.yml
@@ -24,6 +24,7 @@ dependencies:
     - recommonmark
     - sphinx-rtd-theme
     - sphinxcontrib-apidoc
+    - sphinxcontrib-bibtex
     - sphinxcontrib-googleanalytics
     - sphinxcontrib.programoutput
     - pydata-sphinx-theme