From 1e25af72a6acb70516e8751d5363fb32bf5d1a45 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Mon, 29 Jul 2024 17:07:40 +0800
Subject: [PATCH 01/18] add new tutorial for accelerator setup guide

---
 docs/_tutorials/accelerator-setup-guide.md | 162 +++++++++++++++++++++
 1 file changed, 162 insertions(+)
 create mode 100644 docs/_tutorials/accelerator-setup-guide.md
diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
new file mode 100644
index 000000000000..e774bd434c5c
--- /dev/null
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -0,0 +1,162 @@
+---
+title: DeepSpeed Accelerator SetupGuides
+tags: getting-started
+---
+
+# Contents
+- [Contents](#contents)
+- [Introduction](#introduction)
+- [Intel Architecture (IA) CPU](#ia-cpu)
+  - [Port accelerator runtime calls](#port-accelerator-runtime-calls)
+  - [Port accelerator device name](#port-accelerator-device-name)
+  - [Tensor operations](#tensor-operations)
+  - [Communication backend](#communication-backend)
+- [Run DeepSpeed model on different accelerators](#run-deepspeed-model-on-different-accelerators)
+- [Run DeepSpeed model on CPU](#run-deepspeed-model-on-cpu)
+- [Implement new accelerator extension](#implement-new-accelerator-extension)
+
+# Introduction
+DeepSpeed supports different accelerators from different companies.   Setup steps to run DeepSpeed on certain accelerators might be different.  This guide allows user to lookup the accelerator family they are using and setup environment for the hardware they are using.
+
+Each section of this tutorial explains setup steps 
+
+
+The DeepSpeed Accelerator Abstraction allows user to run large language model seamlessly on various Deep Learning acceleration hardware with DeepSpeed.   It offers a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  This means user can write large language model code without hardware specific code.  With DeepSpeed Accelerator Abstraction, the same large language model can run on different hardware platform, without the need to rewrite model code.  This makes running large language model on different hardware easier.
+
+This document covers three topics related to DeepSpeed Accelerator Abstraction Interface:
+1. Write accelerator agnostic models using DeepSpeed Accelerator Abstraction Interface.
+2. Run DeepSpeed model on different accelerators.
+3. Implement new accelerator extension for DeepSpeed Accelerator Abstraction Interface.
+
+# Write accelerator agnostic models
+In this part, you will learn how to write a model that does not contain HW specific code, or how to port a model that run on a specific HW only to be accelerator agnostic.  To do this, we first import `get_accelerator` from `deepspeed.accelerator`
+```
+from deepspeed.accelerator import get_accelerator
+```
+Note: `get_accelerator()` is the entrance to DeepSpeed Accelerator Abstraction Interface
+## Port accelerator runtime calls
+First we need to port accelerator runtime calls.  On CUDA device, accelerator runtime call appears in the form of `torch.cuda.<interface>(...)`.   With DeepSpeed Accelerator Abstract Interface, such accelerator runtime call can be written in the form of `get_accelerator().<interface>(...)` which will be accelerator agnostic.
+
+A typical conversion looks like the following example:
+
+```
+if torch.cuda.is_available():
+    ...
+```
+-->
+```
+if get_accelerator().is_available():
+    ...
+```
+
+For most `torch.cuda.<interface>(...)` call, we can literally replace `torch.cuda` with `get_accelerator()`.   However, there are some exceptions that needs attention:
+1. For `torch.cuda.current_device()`, we need to know whether calling this interface is to get device index, or supply the return value as a device.   If we want to use the return value as a device string, we need to call `get_accelerator().current_device_name()`.  For example:
+```
+torch.empty(weight_shape, dtype=dtype, device=get_accelerator().current_device_name())
+```
+However, if we wish to get device index as a number, we should call `get_accelerator().current_device()`
+```
+local_rank = get_accelerator().current_device()
+```
+2. For `torch.cuda.default_generators[index]`, convert to `get_accelerator().default_generator(index)`
+
+## Port accelerator device name
+For CUDA specific device name such as `'cuda'` or `'cuda:0'`, or `'cuda:1'`, we convert them to `get_accelerator().device_name()`, `get_accelerator().device_name(0)`, and `get_accelerator().device_name(1)`.
+
+A device name without index can be used if model need to do specific thing for certain accelerator.  We suggest to make as less as such usage only for situations can not be resolve other way.
+
+## Tensor operations
+CUDA specific tensor operations needs to be converted according to the following rules:
+- When we convert a torch tensor to accelerator device such as `my_tensor.cuda()`, we use `my_tensor.to(get_accelerator().device_name())`
+
+- When we check whether a torch tensor is on accelerator device such as `my_tensor.is_cuda`, we use `get_accelerator().on_accelerator(my_tensor)`
+
+- When pin a tensor to GPU memory such as `my_tensor.pin_memory()`, we use `get_accelerator().pin_memory(my_tensor)`
+
+## Communication backend
+When a communication backend string is used, the interface `get_accelerator().communication_backend_name()` is used get get communication backend name. So instead of:
+```
+torch.distributed.init_process_group('nccl')
+```
+, we use:
+```
+torch.distributed.init_process_group(get_accelerator().communication_backend_name())
+```
+
+# Run DeepSpeed model on different accelerators
+Once a model is ported with DeepSpeed Accelerator Abstraction Interface, we can run this model on different accelerators using an extension to DeepSpeed. DeepSpeed checks whether a certain extension is installed in the environment to decide whether to use the Accelerator backend in that extension. For example, if we wish to run a model on Intel GPU, we can install _Intel Extension for DeepSpeed_ following the instructions in the following [link](https://github.com/intel/intel-extension-for-deepspeed/)
+
+After the extension is installed, install DeepSpeed and run the model.  The model will be running on top of DeepSpeed.   Because DeepSpeed installation is also accelerator related, it is recommended to install DeepSpeed accelerator extension before installing DeepSpeed.
+
+`CUDA_Accelerator` is the default accelerator in DeepSpeed.  If no other DeepSpeed accelerator extension is installed, `CUDA_Accelerator` will be used.
+
+When running a model on different accelerators in a cloud environment, the recommended practice is to provision an environment for each accelerator in a different env with tools such as _anaconda/miniconda/virtualenv_.  When running models on different Accelerator, load the env accordingly.
+
+Note that different accelerator may have different 'flavor' of float16 or bfloat16.   So it is recommended to make the model configurable for both float16 and bfloat16, in that way model code does not need to be changed when running on different accelerators.
+
+# Run DeepSpeed model on CPU
+DeepSpeed support using CPU as accelerator.  DeepSpeed model using DeepSpeed Accelerator Abstraction Interface could run on CPU without change to model code.   DeepSpeed decide whether _Intel Extension for PyTorch_ is installed in the environment.  If this packaged is installed, DeepSpeed will use CPU as accelerator.  Otherwise CUDA device will be used as accelerator.
+
+To run DeepSpeed model on CPU, use the following steps to prepare environment:
+
+```
+python -m pip install intel_extension_for_pytorch
+python -m pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-cpu
+git clone https://github.com/oneapi-src/oneCCL
+cd oneCCL
+mkdir build
+cd build
+cmake ..
+make
+make install
+```
+
+Before run CPU workload, we need to source oneCCL environment variables
+```
+source <path-to-oneCCL>/build/_install/env/setvars.sh
+```
+
+After environment is prepared, we can launch DeepSpeed inference with the following command
+```
+deepspeed --bind_cores_to_rank <deepspeed-model-script>
+```
+
+This command would launch number of workers equal to number of CPU sockets on the system.  Currently DeepSpeed support running inference model with AutoTP on top of CPU.  The argument `--bind_cores_to_rank` distribute CPU cores on the system evenly among workers, to allow each worker running on a dedicated set of CPU cores.
+
+On CPU system, there might be daemon process that periodically activate which would increase variance of each worker.  One practice is leave a couple of cores for daemon process using `--bind-core-list` argument:
+
+```
+deepspeed --bind_cores_to_rank --bind_core_list 0-51,56-107 <deepspeed-model-script>
+```
+
+The command above leave 4 cores on each socket to daemon process (assume two sockets, each socket has 56 cores).
+
+We can also set an arbitrary number of workers.  Unlike GPU, CPU cores on host can be further divided into subgroups.  When this number is not set, DeepSpeed would detect number of NUMA nodes on the system and launch one worker for each NUMA node.
+
+```
+deepspeed --num_accelerators 4 --bind_cores_to_rank <deepspeed-model-script>
+```
+
+Launching DeepSpeed model on multiple CPU nodes is similar to other accelerators.  We need to specify `impi` as launcher and specify `--bind_cores_to_rank` for better core binding.  Also specify `slots` number according to number of CPU sockets in host file.
+
+```
+# hostfile content should follow the format
+# worker-1-hostname slots=<#sockets>
+# worker-2-hostname slots=<#sockets>
+# ...
+
+deepspeed --hostfile=<hostfile> --bind_cores_to_rank --launcher impi --master_addr <master-ip> <deepspeed-model-script>
+```
+
+# Implement new accelerator extension
+It is possible to implement a new DeepSpeed accelerator extension to support new accelerator in DeepSpeed.  An example to follow is _[Intel Extension For DeepSpeed](https://github.com/intel/intel-extension-for-deepspeed/)_.   An accelerator extension contains the following components:
+1. XYZ_Accelerator(DeepSpeedAccelerator) class definition, where 'XYZ' is the accelerator name, such as 'XPU' or 'CPU'.
+This class implements `class DeepSpeedAccelerator` and will be returned by `get_accelerator()` in DeepSpeed.
+2. Op builders following https://github.com/intel/intel-extension-for-deepspeed/tree/main/intel_extension_for_deepspeed/op_builder.   All op builders needs to inherit `deepspeed.ops.op_builder.builder.OpBuilder` directly or indirectly.  A common practice is to implement a base op builder (SYCLOpBuilder in the case of Intel Extension for DeepSpeed) and inherit this base op builder instead.
+3. Op kernels as in the following [link](https://github.com/intel/intel-extension-for-deepspeed/tree/main/intel_extension_for_deepspeed/op_builder/csrc).
+
+Note that an extension does not have to implement all op builders under https://github.com/microsoft/DeepSpeed/tree/master/op_builder all at a time.   A missing op builder usually means certain DeepSpeed functionality cannot be used for that Accelerator, but models that does not use that functionality can still run.
+
+When implementing op builder for an accelerator extension, one thing needs to be noted is that the op builder native code is being built by DeepSpeed jit load mechanism.  This mean the native source file being built needs to be in DeepSpeed installation directory.  However these files are defined in accelerator extension installation directory, which cannot be built by DeepSpeed directly.  To solve this, follow the example in https://github.com/intel/intel-extension-for-deepspeed/blob/main/intel_extension_for_deepspeed/op_builder/cpu_adam.py to use 'sycl_kernel_path' and 'sycl_kernel_include' (User can change 'sycl' to other prefix in their own accelerator extension) to allow native code be built during DeepSpeed jit load.
+
+When accelerator extension is installed in the environment, it can be used by either explicit call deepspeed.accelerator.set_accelerator(XYZ_Accelerator()) following the example in https://github.com/microsoft/DeepSpeed/blob/master/accelerator/real_accelerator.py, or add an implicit detection code in get_accelerator in the same file above.

From 7f2f3e4ff7d371f093d35bdfa47febfa451a6102 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Wed, 31 Jul 2024 12:32:38 +0800
Subject: [PATCH 02/18] update CPU setup guide

---
 docs/_tutorials/accelerator-setup-guide.md | 175 ++++++---------------
 1 file changed, 52 insertions(+), 123 deletions(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index e774bd434c5c..d3797886645f 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -7,156 +7,85 @@ tags: getting-started
 - [Contents](#contents)
 - [Introduction](#introduction)
 - [Intel Architecture (IA) CPU](#ia-cpu)
-  - [Port accelerator runtime calls](#port-accelerator-runtime-calls)
-  - [Port accelerator device name](#port-accelerator-device-name)
-  - [Tensor operations](#tensor-operations)
-  - [Communication backend](#communication-backend)
-- [Run DeepSpeed model on different accelerators](#run-deepspeed-model-on-different-accelerators)
-- [Run DeepSpeed model on CPU](#run-deepspeed-model-on-cpu)
-- [Implement new accelerator extension](#implement-new-accelerator-extension)
+- [Intel XPU](#intel-xpu)
 
 # Introduction
 DeepSpeed supports different accelerators from different companies.   Setup steps to run DeepSpeed on certain accelerators might be different.  This guide allows user to lookup the accelerator family they are using and setup environment for the hardware they are using.
 
-Each section of this tutorial explains setup steps 
+# Intel Architecture (IA) CPU
+DeepSpeed support CPU with Intel Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and preferrably AVX512 instruction set.
 
+DeepSpeed had been verified on the following CPU processors:
+* Intel Gen 4th Xeon Processors
+* Intel Gen 5th Xeon Processors
 
-The DeepSpeed Accelerator Abstraction allows user to run large language model seamlessly on various Deep Learning acceleration hardware with DeepSpeed.   It offers a set of accelerator runtime and accelerator op builder interface which can be implemented for different hardware.  This means user can write large language model code without hardware specific code.  With DeepSpeed Accelerator Abstraction, the same large language model can run on different hardware platform, without the need to rewrite model code.  This makes running large language model on different hardware easier.
+## Installation steps for Intel Architecture CPU
+To install DeepSpeed on Intel Architecture CPU, use the following steps:
+1. Install gcc compiler
+DeepSpeed requires gcc-9 or above to build kernels on Intel Architecture CPU, install gcc-9 or above.
+2. Install numactl
+DeepSpeed use numactl for fine grain CPU core allocation for load-balancing, install numactl on your system.
+3. Install PyTorch
+`pip install torch`
+4. Install DeepSpeed
+`pip install deepspeed`
 
-This document covers three topics related to DeepSpeed Accelerator Abstraction Interface:
-1. Write accelerator agnostic models using DeepSpeed Accelerator Abstraction Interface.
-2. Run DeepSpeed model on different accelerators.
-3. Implement new accelerator extension for DeepSpeed Accelerator Abstraction Interface.
-
-# Write accelerator agnostic models
-In this part, you will learn how to write a model that does not contain HW specific code, or how to port a model that run on a specific HW only to be accelerator agnostic.  To do this, we first import `get_accelerator` from `deepspeed.accelerator`
-```
-from deepspeed.accelerator import get_accelerator
-```
-Note: `get_accelerator()` is the entrance to DeepSpeed Accelerator Abstraction Interface
-## Port accelerator runtime calls
-First we need to port accelerator runtime calls.  On CUDA device, accelerator runtime call appears in the form of `torch.cuda.<interface>(...)`.   With DeepSpeed Accelerator Abstract Interface, such accelerator runtime call can be written in the form of `get_accelerator().<interface>(...)` which will be accelerator agnostic.
-
-A typical conversion looks like the following example:
-
-```
-if torch.cuda.is_available():
-    ...
-```
--->
-```
-if get_accelerator().is_available():
-    ...
-```
-
-For most `torch.cuda.<interface>(...)` call, we can literally replace `torch.cuda` with `get_accelerator()`.   However, there are some exceptions that needs attention:
-1. For `torch.cuda.current_device()`, we need to know whether calling this interface is to get device index, or supply the return value as a device.   If we want to use the return value as a device string, we need to call `get_accelerator().current_device_name()`.  For example:
-```
-torch.empty(weight_shape, dtype=dtype, device=get_accelerator().current_device_name())
-```
-However, if we wish to get device index as a number, we should call `get_accelerator().current_device()`
+## How to launch DeepSpeed on Intel Architecture CPU
+DeepSpeed can launch on Intel Architecture CPU with default deepspeed command.  However, for compute intensive workloads, Intel Architecture CPU works best when each worker process runs on different set of physical CPU cores, so worker process does not compete CPU cores with each other.  To bind cores to each worker (rank), use the following command line switch for better performance.
 ```
-local_rank = get_accelerator().current_device()
+deepspeed --bind_cores_to_rank <deepspeed-model-script>
 ```
-2. For `torch.cuda.default_generators[index]`, convert to `get_accelerator().default_generator(index)`
-
-## Port accelerator device name
-For CUDA specific device name such as `'cuda'` or `'cuda:0'`, or `'cuda:1'`, we convert them to `get_accelerator().device_name()`, `get_accelerator().device_name(0)`, and `get_accelerator().device_name(1)`.
-
-A device name without index can be used if model need to do specific thing for certain accelerator.  We suggest to make as less as such usage only for situations can not be resolve other way.
-
-## Tensor operations
-CUDA specific tensor operations needs to be converted according to the following rules:
-- When we convert a torch tensor to accelerator device such as `my_tensor.cuda()`, we use `my_tensor.to(get_accelerator().device_name())`
-
-- When we check whether a torch tensor is on accelerator device such as `my_tensor.is_cuda`, we use `get_accelerator().on_accelerator(my_tensor)`
-
-- When pin a tensor to GPU memory such as `my_tensor.pin_memory()`, we use `get_accelerator().pin_memory(my_tensor)`
+This switch would automatically detect the number of CPU NUMA node on the host, then launch as many worker as number of NUMA nodes, and each worker to cores/memory of each NUMA node.  This ensures workers does not interfere with each other and all memory allocation is local memory which improves performance.
 
-## Communication backend
-When a communication backend string is used, the interface `get_accelerator().communication_backend_name()` is used get get communication backend name. So instead of:
+When user wish to get more control on the number of workers and which cores can be used by the workload, user can use the following command line switches.
 ```
-torch.distributed.init_process_group('nccl')
+deepspeed --num_accelerators <number-of-workers> --bind_cores_to_rank --bind_core_list <comma-seperated-dash-range> <deepspeed-model-script>
 ```
-, we use:
+For example:
 ```
-torch.distributed.init_process_group(get_accelerator().communication_backend_name())
+deepspeed --num_accelerators 4 --bind_cores_to_rank --bind_core_list <0-27,32-59> inference.py
 ```
+This would start 4 workers for the workload.  The core list range will be divided evenly between 4 workers, with worker 0 take 0-13, worker 1, take 14-27, worker 2 take 32-45, and worker 3 take 46-59.  Core 28-31,60-63 are left out because there might be some background process running on the system, leaving some idle cores will reduce performance jitting and straggler effect.
 
-# Run DeepSpeed model on different accelerators
-Once a model is ported with DeepSpeed Accelerator Abstraction Interface, we can run this model on different accelerators using an extension to DeepSpeed. DeepSpeed checks whether a certain extension is installed in the environment to decide whether to use the Accelerator backend in that extension. For example, if we wish to run a model on Intel GPU, we can install _Intel Extension for DeepSpeed_ following the instructions in the following [link](https://github.com/intel/intel-extension-for-deepspeed/)
-
-After the extension is installed, install DeepSpeed and run the model.  The model will be running on top of DeepSpeed.   Because DeepSpeed installation is also accelerator related, it is recommended to install DeepSpeed accelerator extension before installing DeepSpeed.
-
-`CUDA_Accelerator` is the default accelerator in DeepSpeed.  If no other DeepSpeed accelerator extension is installed, `CUDA_Accelerator` will be used.
-
-When running a model on different accelerators in a cloud environment, the recommended practice is to provision an environment for each accelerator in a different env with tools such as _anaconda/miniconda/virtualenv_.  When running models on different Accelerator, load the env accordingly.
-
-Note that different accelerator may have different 'flavor' of float16 or bfloat16.   So it is recommended to make the model configurable for both float16 and bfloat16, in that way model code does not need to be changed when running on different accelerators.
-
-# Run DeepSpeed model on CPU
-DeepSpeed support using CPU as accelerator.  DeepSpeed model using DeepSpeed Accelerator Abstraction Interface could run on CPU without change to model code.   DeepSpeed decide whether _Intel Extension for PyTorch_ is installed in the environment.  If this packaged is installed, DeepSpeed will use CPU as accelerator.  Otherwise CUDA device will be used as accelerator.
-
-To run DeepSpeed model on CPU, use the following steps to prepare environment:
+Launching DeepSpeed model on multiple CPU nodes is similar to other accelerators.  We need to specify `impi` as launcher and specify `--bind_cores_to_rank` for better core binding.  Also specify `slots` number according to number of CPU sockets in   host file.
 
 ```
-python -m pip install intel_extension_for_pytorch
-python -m pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-cpu
-git clone https://github.com/oneapi-src/oneCCL
-cd oneCCL
-mkdir build
-cd build
-cmake ..
-make
-make install
-```
-
-Before run CPU workload, we need to source oneCCL environment variables
-```
-source <path-to-oneCCL>/build/_install/env/setvars.sh
-```
+# hostfile content should follow the format
+# worker-1-hostname slots=<#sockets>
+# worker-2-hostname slots=<#sockets>
+# ...
 
-After environment is prepared, we can launch DeepSpeed inference with the following command
-```
-deepspeed --bind_cores_to_rank <deepspeed-model-script>
+deepspeed --hostfile=<hostfile> --bind_cores_to_rank --launcher impi --master_addr <master-ip> <deepspeed-model-script>
 ```
 
-This command would launch number of workers equal to number of CPU sockets on the system.  Currently DeepSpeed support running inference model with AutoTP on top of CPU.  The argument `--bind_cores_to_rank` distribute CPU cores on the system evenly among workers, to allow each worker running on a dedicated set of CPU cores.
-
-On CPU system, there might be daemon process that periodically activate which would increase variance of each worker.  One practice is leave a couple of cores for daemon process using `--bind-core-list` argument:
+## Install with Intel Extension for PyTorch and oneCCL
+Although not mandatory, Intel Extension for PyTorch and Intel oneCCL provide better optimizations for LLM models.  Intel oneCCL also provide optimization when running LLM model on multi-node.  To use DeepSpeed with Intel Extension for PyTorch and oneCCL, use the following steps:
+1. Install Intel Extension for PyTorch.  This is suggested if you want to get better LLM inference performance on CPU.
+`pip install intel-extension-for-pytorch`
 
+The following steps are to install oneCCL binding for PyTorch.  This is suggested if you are running DeepSpeed on multiple CPU node, for better communication performance.   On single node with multiple CPU socket, these steps are not needed.
+2. Install oneCCL binding for PyTorch
+`python -m pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-cpu`
+3. Install Intel oneCCL, this will be used to build direct oneCCL kernels (CCLBackend kernels)
 ```
-deepspeed --bind_cores_to_rank --bind_core_list 0-51,56-107 <deepspeed-model-script>
+pip install oneccl-devel
+pip install impi-devel
 ```
-
-The command above leave 4 cores on each socket to daemon process (assume two sockets, each socket has 56 cores).
-
-We can also set an arbitrary number of workers.  Unlike GPU, CPU cores on host can be further divided into subgroups.  When this number is not set, DeepSpeed would detect number of NUMA nodes on the system and launch one worker for each NUMA node.
-
+Then set the environment variables for Intel oneCCL (assuming using conda environment).
 ```
-deepspeed --num_accelerators 4 --bind_cores_to_rank <deepspeed-model-script>
+export CPATH=${CONDA_PREFIX}/include:$CPATH
+export CCL_ROOT=${CONDA_PREFIX}
+export I_MPI_ROOT=${CONDA_PREFIX}
+export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib/ccl/cpu:${CONDA_PREFIX}/lib/libfabric:${CONDA_PREFIX}/lib
 ```
 
-Launching DeepSpeed model on multiple CPU nodes is similar to other accelerators.  We need to specify `impi` as launcher and specify `--bind_cores_to_rank` for better core binding.  Also specify `slots` number according to number of CPU sockets in host file.
-
+##Optimize LLM inference with Intel Extension for PyTorch
+Intel Extension for PyTorch compatible w]th DeepSpeed AutoTP tensor parallel inference.  It allows CPU inference benefit from both DeepSpeed Automatic Tensor Parallelism and LLM optimization from Intel Extension for PyTorch.  To use Intel Extension for PyTorch, after call deepspeed.init_inference, call
 ```
-# hostfile content should follow the format
-# worker-1-hostname slots=<#sockets>
-# worker-2-hostname slots=<#sockets>
-# ...
-
-deepspeed --hostfile=<hostfile> --bind_cores_to_rank --launcher impi --master_addr <master-ip> <deepspeed-model-script>
+ipex_model = ipex.llm.optimize(deepspeed_model)
 ```
+to get model optimzied by Intel Extension for PyTorch.
 
-# Implement new accelerator extension
-It is possible to implement a new DeepSpeed accelerator extension to support new accelerator in DeepSpeed.  An example to follow is _[Intel Extension For DeepSpeed](https://github.com/intel/intel-extension-for-deepspeed/)_.   An accelerator extension contains the following components:
-1. XYZ_Accelerator(DeepSpeedAccelerator) class definition, where 'XYZ' is the accelerator name, such as 'XPU' or 'CPU'.
-This class implements `class DeepSpeedAccelerator` and will be returned by `get_accelerator()` in DeepSpeed.
-2. Op builders following https://github.com/intel/intel-extension-for-deepspeed/tree/main/intel_extension_for_deepspeed/op_builder.   All op builders needs to inherit `deepspeed.ops.op_builder.builder.OpBuilder` directly or indirectly.  A common practice is to implement a base op builder (SYCLOpBuilder in the case of Intel Extension for DeepSpeed) and inherit this base op builder instead.
-3. Op kernels as in the following [link](https://github.com/intel/intel-extension-for-deepspeed/tree/main/intel_extension_for_deepspeed/op_builder/csrc).
-
-Note that an extension does not have to implement all op builders under https://github.com/microsoft/DeepSpeed/tree/master/op_builder all at a time.   A missing op builder usually means certain DeepSpeed functionality cannot be used for that Accelerator, but models that does not use that functionality can still run.
-
-When implementing op builder for an accelerator extension, one thing needs to be noted is that the op builder native code is being built by DeepSpeed jit load mechanism.  This mean the native source file being built needs to be in DeepSpeed installation directory.  However these files are defined in accelerator extension installation directory, which cannot be built by DeepSpeed directly.  To solve this, follow the example in https://github.com/intel/intel-extension-for-deepspeed/blob/main/intel_extension_for_deepspeed/op_builder/cpu_adam.py to use 'sycl_kernel_path' and 'sycl_kernel_include' (User can change 'sycl' to other prefix in their own accelerator extension) to allow native code be built during DeepSpeed jit load.
+Refer to https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm for more extensive guide.
 
-When accelerator extension is installed in the environment, it can be used by either explicit call deepspeed.accelerator.set_accelerator(XYZ_Accelerator()) following the example in https://github.com/microsoft/DeepSpeed/blob/master/accelerator/real_accelerator.py, or add an implicit detection code in get_accelerator in the same file above.
+# Intel XPU

From e3a851f6199033b2f57965fd441258a24d71e815 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Wed, 31 Jul 2024 12:36:28 +0800
Subject: [PATCH 03/18] finetune link

---
 docs/_tutorials/accelerator-setup-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index d3797886645f..eeb95ad092c3 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -6,7 +6,7 @@ tags: getting-started
 # Contents
 - [Contents](#contents)
 - [Introduction](#introduction)
-- [Intel Architecture (IA) CPU](#ia-cpu)
+- [Intel Architecture (IA) CPU](#intel-architecture-ia-cpu)
 - [Intel XPU](#intel-xpu)
 
 # Introduction

From 31034d53125f6364f2e661d0f72f6ad50d19f453 Mon Sep 17 00:00:00 2001
From: Liangliang Ma <1906710196@qq.com>
Date: Thu, 1 Aug 2024 15:12:35 +0800
Subject: [PATCH 04/18] add intel xpu (#43)

---
 docs/_tutorials/accelerator-setup-guide.md | 40 ++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index eeb95ad092c3..37bc6358669c 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -13,9 +13,9 @@ tags: getting-started
 DeepSpeed supports different accelerators from different companies.   Setup steps to run DeepSpeed on certain accelerators might be different.  This guide allows user to lookup the accelerator family they are using and setup environment for the hardware they are using.
 
 # Intel Architecture (IA) CPU
-DeepSpeed support CPU with Intel Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and preferrably AVX512 instruction set.
+DeepSpeed supports CPU with Intel Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and preferrably AVX512 instruction set.
 
-DeepSpeed had been verified on the following CPU processors:
+DeepSpeed has been verified on the following CPU processors:
 * Intel Gen 4th Xeon Processors
 * Intel Gen 5th Xeon Processors
 
@@ -89,3 +89,39 @@ to get model optimzied by Intel Extension for PyTorch.
 Refer to https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm for more extensive guide.
 
 # Intel XPU
+DeepSpeed XPU accelerator supports Intel® Data Center GPU Max Series.
+
+DeepSpeed has been verified on the following GPU products:
+* Intel® Data Center GPU Max 1100
+* Intel® Data Center GPU Max 1550
+
+## Installation steps for Intel XPU
+To install DeepSpeed on Intel XPU, use the following steps:
+1. Install oneAPI base toolkit \
+The Intel® oneAPI Base Toolkit (Base Kit) is a core set of tools and libraries, including an DPC++/C++ Compiler for building Deepspeed XPU kernels like fusedAdam and CPUAdam, high performance computation libraries demanded by IPEX, etc.
+For easy download, usage and more details, check [Intel oneAPI base-toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
+2. Install PyTorch \
+`pip install torch`
+3. Install Intel extension for pytorch, for torch functionality and performance on Intel platform \
+`pip install intel-extension-for-pytorch`
+4. Install oneccl_bindings_for_pytorch, which is the default communication backend cross XPU devices \
+`pip install oneccl_bind_pt`
+5. Install DeepSpeed
+`pip install deepspeed`
+
+**_NOTE:_** Should keep the software stack latest for the kernels of XPU in DeepSpeed will always be compatible with the latest released oneAPI basekit and IPEX(Intel extension for pytorch). Also you can add `-f https://developer.intel.com/ipex-whl-stable-xpu` flag for better experience of pip install intel packages.
+
+## How to use DeepSpeed on Intel XPU
+DeepSpeed can launch on Intel XPU with common deepspeed command. Before that, user needs activate the oneAPI environment by: \
+`source <oneAPI installed path>/setvars.sh`
+
+To validate the XPU availability and if the XPU accelerator is correctly chosen, here is an example:
+```
+$ python
+>>> import torch; print('torch:', torch.__version__)
+torch: 2.3.0
+>>> import intel_extension_for_pytorch; print('XPU available:', torch.xpu.is_available())
+XPU available: True
+>>> from deepspeed.accelerator import get_accelerator; print('accelerator:', get_accelerator()._name)
+accelerator: xpu
+```

From 4159f1fec70a62a01c49eb52b157b90472dbb4a3 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Fri, 2 Aug 2024 13:35:50 +0800
Subject: [PATCH 05/18] fix CPU `art

---
 docs/_tutorials/accelerator-setup-guide.md | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index 37bc6358669c..be246246a655 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -13,20 +13,26 @@ tags: getting-started
 DeepSpeed supports different accelerators from different companies.   Setup steps to run DeepSpeed on certain accelerators might be different.  This guide allows user to lookup the accelerator family they are using and setup environment for the hardware they are using.
 
 # Intel Architecture (IA) CPU
-DeepSpeed supports CPU with Intel Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and preferrably AVX512 instruction set.
+DeepSpeed supports CPU with Intel Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and recommend AMX instruction set.
 
 DeepSpeed has been verified on the following CPU processors:
-* Intel Gen 4th Xeon Processors
-* Intel Gen 5th Xeon Processors
+* 4th Gen Intel Xeon Scalarable Processors
+* 5th Gen Intel Xeon Scalarable Processors
+* 6th Gen Intel Xeon Scalarable Processors
 
 ## Installation steps for Intel Architecture CPU
 To install DeepSpeed on Intel Architecture CPU, use the following steps:
 1. Install gcc compiler
 DeepSpeed requires gcc-9 or above to build kernels on Intel Architecture CPU, install gcc-9 or above.
+
 2. Install numactl
-DeepSpeed use numactl for fine grain CPU core allocation for load-balancing, install numactl on your system.
+DeepSpeed use `numactl` for fine grain CPU core allocation for load-balancing, install numactl on your system.
+For example, on Ubuntu system, use the following command:
+`sudo apt-get install numactl`
+
 3. Install PyTorch
 `pip install torch`
+
 4. Install DeepSpeed
 `pip install deepspeed`
 
@@ -64,8 +70,10 @@ Although not mandatory, Intel Extension for PyTorch and Intel oneCCL provide bet
 `pip install intel-extension-for-pytorch`
 
 The following steps are to install oneCCL binding for PyTorch.  This is suggested if you are running DeepSpeed on multiple CPU node, for better communication performance.   On single node with multiple CPU socket, these steps are not needed.
+
 2. Install oneCCL binding for PyTorch
 `python -m pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-cpu`
+
 3. Install Intel oneCCL, this will be used to build direct oneCCL kernels (CCLBackend kernels)
 ```
 pip install oneccl-devel
@@ -79,13 +87,14 @@ export I_MPI_ROOT=${CONDA_PREFIX}
 export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib/ccl/cpu:${CONDA_PREFIX}/lib/libfabric:${CONDA_PREFIX}/lib
 ```
 
-##Optimize LLM inference with Intel Extension for PyTorch
+## Optimize LLM inference with Intel Extension for PyTorch
 Intel Extension for PyTorch compatible w]th DeepSpeed AutoTP tensor parallel inference.  It allows CPU inference benefit from both DeepSpeed Automatic Tensor Parallelism and LLM optimization from Intel Extension for PyTorch.  To use Intel Extension for PyTorch, after call deepspeed.init_inference, call
 ```
 ipex_model = ipex.llm.optimize(deepspeed_model)
 ```
 to get model optimzied by Intel Extension for PyTorch.
 
+## More example for using DeepSpeed with Intel Extension for PyTorch on Intel Architecture CPU
 Refer to https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm for more extensive guide.
 
 # Intel XPU

From ff7155b95602997e43940abeaaf0be961702ef88 Mon Sep 17 00:00:00 2001
From: Liangliang Ma <1906710196@qq.com>
Date: Mon, 5 Aug 2024 14:10:51 +0800
Subject: [PATCH 06/18] change installation style for xpu (#44)

* change installation style for xpu

* add xpu example

* fix typo
---
 docs/_tutorials/accelerator-setup-guide.md | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index be246246a655..e9eb6373ae5c 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -109,16 +109,11 @@ To install DeepSpeed on Intel XPU, use the following steps:
 1. Install oneAPI base toolkit \
 The Intel® oneAPI Base Toolkit (Base Kit) is a core set of tools and libraries, including an DPC++/C++ Compiler for building Deepspeed XPU kernels like fusedAdam and CPUAdam, high performance computation libraries demanded by IPEX, etc.
 For easy download, usage and more details, check [Intel oneAPI base-toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
-2. Install PyTorch \
-`pip install torch`
-3. Install Intel extension for pytorch, for torch functionality and performance on Intel platform \
-`pip install intel-extension-for-pytorch`
-4. Install oneccl_bindings_for_pytorch, which is the default communication backend cross XPU devices \
-`pip install oneccl_bind_pt`
-5. Install DeepSpeed
-`pip install deepspeed`
+2. Install PyTorch, Intel extension for pytorch, Intel oneCCL Bindings for PyTorch. These packages are required in `xpu_accelerator` for torch functionality and performance, also communication backend on Intel platform. The recommended installation reference:
+https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu.
 
-**_NOTE:_** Should keep the software stack latest for the kernels of XPU in DeepSpeed will always be compatible with the latest released oneAPI basekit and IPEX(Intel extension for pytorch). Also you can add `-f https://developer.intel.com/ipex-whl-stable-xpu` flag for better experience of pip install intel packages.
+3. Install DeepSpeed \
+`pip install deepspeed`
 
 ## How to use DeepSpeed on Intel XPU
 DeepSpeed can launch on Intel XPU with common deepspeed command. Before that, user needs activate the oneAPI environment by: \
@@ -134,3 +129,6 @@ XPU available: True
 >>> from deepspeed.accelerator import get_accelerator; print('accelerator:', get_accelerator()._name)
 accelerator: xpu
 ```
+
+## More example for using DeepSpeed on Intel XPU
+Refer to https://github.com/intel/intel-extension-for-pytorch/tree/release/xpu/2.1.40/examples/gpu/inference/python/llm for more extensive guide.

From 61bb8903daba81854b6a3927ae36604b3cb5cd16 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Mon, 5 Aug 2024 12:40:46 +0800
Subject: [PATCH 07/18] add registration mark(test)

---
 docs/_tutorials/accelerator-setup-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index e9eb6373ae5c..e0b7cd5e7849 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -13,7 +13,7 @@ tags: getting-started
 DeepSpeed supports different accelerators from different companies.   Setup steps to run DeepSpeed on certain accelerators might be different.  This guide allows user to lookup the accelerator family they are using and setup environment for the hardware they are using.
 
 # Intel Architecture (IA) CPU
-DeepSpeed supports CPU with Intel Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and recommend AMX instruction set.
+DeepSpeed supports CPU with Intel&reg Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and recommend AMX instruction set.
 
 DeepSpeed has been verified on the following CPU processors:
 * 4th Gen Intel Xeon Scalarable Processors

From f5bd5b22a623a9fba97d835ce83e6f2cec5b775e Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Mon, 5 Aug 2024 12:42:06 +0800
Subject: [PATCH 08/18] fix reg mark

---
 docs/_tutorials/accelerator-setup-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index e0b7cd5e7849..d302a5ae9557 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -13,7 +13,7 @@ tags: getting-started
 DeepSpeed supports different accelerators from different companies.   Setup steps to run DeepSpeed on certain accelerators might be different.  This guide allows user to lookup the accelerator family they are using and setup environment for the hardware they are using.
 
 # Intel Architecture (IA) CPU
-DeepSpeed supports CPU with Intel&reg Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and recommend AMX instruction set.
+DeepSpeed supports CPU with Intel &reg Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and recommend AMX instruction set.
 
 DeepSpeed has been verified on the following CPU processors:
 * 4th Gen Intel Xeon Scalarable Processors

From 286b8e8642123deb30305b9efb0eca129c272fd8 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Mon, 5 Aug 2024 12:57:59 +0800
Subject: [PATCH 09/18] fix all registration marks

---
 docs/_tutorials/accelerator-setup-guide.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index d302a5ae9557..220a485ea16b 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -13,12 +13,12 @@ tags: getting-started
 DeepSpeed supports different accelerators from different companies.   Setup steps to run DeepSpeed on certain accelerators might be different.  This guide allows user to lookup the accelerator family they are using and setup environment for the hardware they are using.
 
 # Intel Architecture (IA) CPU
-DeepSpeed supports CPU with Intel &reg Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and recommend AMX instruction set.
+DeepSpeed supports CPU with Intel Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and recommend AMX instruction set.
 
 DeepSpeed has been verified on the following CPU processors:
-* 4th Gen Intel Xeon Scalarable Processors
-* 5th Gen Intel Xeon Scalarable Processors
-* 6th Gen Intel Xeon Scalarable Processors
+* 4th Gen Intel® Xeon® Scalarable Processors
+* 5th Gen Intel® Xeon® Scalarable Processors
+* 6th Gen Intel® Xeon® Scalarable Processors
 
 ## Installation steps for Intel Architecture CPU
 To install DeepSpeed on Intel Architecture CPU, use the following steps:
@@ -71,7 +71,7 @@ Although not mandatory, Intel Extension for PyTorch and Intel oneCCL provide bet
 
 The following steps are to install oneCCL binding for PyTorch.  This is suggested if you are running DeepSpeed on multiple CPU node, for better communication performance.   On single node with multiple CPU socket, these steps are not needed.
 
-2. Install oneCCL binding for PyTorch
+2. Install Intel oneCCL binding for PyTorch
 `python -m pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-cpu`
 
 3. Install Intel oneCCL, this will be used to build direct oneCCL kernels (CCLBackend kernels)

From f86b751277b2db3b742ab0c0104194e056512ce7 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Mon, 5 Aug 2024 14:02:40 +0800
Subject: [PATCH 10/18] revise run abstract-accelerator-interface document

---
 .../accelerator-abstraction-interface.md      | 66 +------------------
 1 file changed, 3 insertions(+), 63 deletions(-)

diff --git a/docs/_tutorials/accelerator-abstraction-interface.md b/docs/_tutorials/accelerator-abstraction-interface.md
index 88a43236ce9d..0a0d52c60bac 100644
--- a/docs/_tutorials/accelerator-abstraction-interface.md
+++ b/docs/_tutorials/accelerator-abstraction-interface.md
@@ -79,69 +79,9 @@ torch.distributed.init_process_group(get_accelerator().communication_backend_nam
 ```
 
 # Run DeepSpeed model on different accelerators
-Once a model is ported with DeepSpeed Accelerator Abstraction Interface, we can run this model on different accelerators using an extension to DeepSpeed. DeepSpeed checks whether a certain extension is installed in the environment to decide whether to use the Accelerator backend in that extension. For example, if we wish to run a model on Intel GPU, we can install _Intel Extension for DeepSpeed_ following the instructions in the following [link](https://github.com/intel/intel-extension-for-deepspeed/)
-
-After the extension is installed, install DeepSpeed and run the model.  The model will be running on top of DeepSpeed.   Because DeepSpeed installation is also accelerator related, it is recommended to install DeepSpeed accelerator extension before installing DeepSpeed.
-
-`CUDA_Accelerator` is the default accelerator in DeepSpeed.  If no other DeepSpeed accelerator extension is installed, `CUDA_Accelerator` will be used.
-
-When running a model on different accelerators in a cloud environment, the recommended practice is to provision an environment for each accelerator in a different env with tools such as _anaconda/miniconda/virtualenv_.  When running models on different Accelerator, load the env accordingly.
-
-Note that different accelerator may have different 'flavor' of float16 or bfloat16.   So it is recommended to make the model configurable for both float16 and bfloat16, in that way model code does not need to be changed when running on different accelerators.
-
-# Run DeepSpeed model on CPU
-DeepSpeed support using CPU as accelerator.  DeepSpeed model using DeepSpeed Accelerator Abstraction Interface could run on CPU without change to model code.   DeepSpeed decide whether _Intel Extension for PyTorch_ is installed in the environment.  If this packaged is installed, DeepSpeed will use CPU as accelerator.  Otherwise CUDA device will be used as accelerator.
-
-To run DeepSpeed model on CPU, use the following steps to prepare environment:
-
-```
-python -m pip install intel_extension_for_pytorch
-python -m pip install oneccl_bind_pt -f https://developer.intel.com/ipex-whl-stable-cpu
-git clone https://github.com/oneapi-src/oneCCL
-cd oneCCL
-mkdir build
-cd build
-cmake ..
-make
-make install
-```
-
-Before run CPU workload, we need to source oneCCL environment variables
-```
-source <path-to-oneCCL>/build/_install/env/setvars.sh
-```
-
-After environment is prepared, we can launch DeepSpeed inference with the following command
-```
-deepspeed --bind_cores_to_rank <deepspeed-model-script>
-```
-
-This command would launch number of workers equal to number of CPU sockets on the system.  Currently DeepSpeed support running inference model with AutoTP on top of CPU.  The argument `--bind_cores_to_rank` distribute CPU cores on the system evenly among workers, to allow each worker running on a dedicated set of CPU cores.
-
-On CPU system, there might be daemon process that periodically activate which would increase variance of each worker.  One practice is leave a couple of cores for daemon process using `--bind-core-list` argument:
-
-```
-deepspeed --bind_cores_to_rank --bind_core_list 0-51,56-107 <deepspeed-model-script>
-```
-
-The command above leave 4 cores on each socket to daemon process (assume two sockets, each socket has 56 cores).
-
-We can also set an arbitrary number of workers.  Unlike GPU, CPU cores on host can be further divided into subgroups.  When this number is not set, DeepSpeed would detect number of NUMA nodes on the system and launch one worker for each NUMA node.
-
-```
-deepspeed --num_accelerators 4 --bind_cores_to_rank <deepspeed-model-script>
-```
-
-Launching DeepSpeed model on multiple CPU nodes is similar to other accelerators.  We need to specify `impi` as launcher and specify `--bind_cores_to_rank` for better core binding.  Also specify `slots` number according to number of CPU sockets in host file.
-
-```
-# hostfile content should follow the format
-# worker-1-hostname slots=<#sockets>
-# worker-2-hostname slots=<#sockets>
-# ...
-
-deepspeed --hostfile=<hostfile> --bind_cores_to_rank --launcher impi --master_addr <master-ip> <deepspeed-model-script>
-```
+[link](accelerator-setup-guide.md) provides a guide on how to setup different accelerators for DeepSpeed.  It also comes with simple example how to run deepspeed for different accelerators.  The following guides are provided:
+1. Run DeepSpeed model on CPU
+2. Run DeepSpeed model on XPU
 
 # Implement new accelerator extension
 It is possible to implement a new DeepSpeed accelerator extension to support new accelerator in DeepSpeed.  An example to follow is _[Intel Extension For DeepSpeed](https://github.com/intel/intel-extension-for-deepspeed/)_.   An accelerator extension contains the following components:

From 1a751212b2fbf802528144b9cfb5f39ac7087bd1 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Mon, 5 Aug 2024 14:04:32 +0800
Subject: [PATCH 11/18] remove unnecessary link

---
 docs/_tutorials/accelerator-abstraction-interface.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-abstraction-interface.md b/docs/_tutorials/accelerator-abstraction-interface.md
index 0a0d52c60bac..f836b5b93a69 100644
--- a/docs/_tutorials/accelerator-abstraction-interface.md
+++ b/docs/_tutorials/accelerator-abstraction-interface.md
@@ -12,7 +12,6 @@ tags: getting-started
   - [Tensor operations](#tensor-operations)
   - [Communication backend](#communication-backend)
 - [Run DeepSpeed model on different accelerators](#run-deepspeed-model-on-different-accelerators)
-- [Run DeepSpeed model on CPU](#run-deepspeed-model-on-cpu)
 - [Implement new accelerator extension](#implement-new-accelerator-extension)
 
 # Introduction

From 7eb98b9879eb6bf16531f43051524bc0df9e4258 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Mon, 5 Aug 2024 14:18:40 +0800
Subject: [PATCH 12/18] update link name

---
 docs/_tutorials/accelerator-abstraction-interface.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-abstraction-interface.md b/docs/_tutorials/accelerator-abstraction-interface.md
index f836b5b93a69..d7c153638c0d 100644
--- a/docs/_tutorials/accelerator-abstraction-interface.md
+++ b/docs/_tutorials/accelerator-abstraction-interface.md
@@ -78,7 +78,7 @@ torch.distributed.init_process_group(get_accelerator().communication_backend_nam
 ```
 
 # Run DeepSpeed model on different accelerators
-[link](accelerator-setup-guide.md) provides a guide on how to setup different accelerators for DeepSpeed.  It also comes with simple example how to run deepspeed for different accelerators.  The following guides are provided:
+[Accelerator Setup Guide](accelerator-setup-guide.md) provides a guide on how to setup different accelerators for DeepSpeed.  It also comes with simple example how to run deepspeed for different accelerators.  The following guides are provided:
 1. Run DeepSpeed model on CPU
 2. Run DeepSpeed model on XPU
 

From 68406fdeb21d0f5c0790c70a324321152e051954 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@intel.com>
Date: Mon, 5 Aug 2024 16:02:03 +0800
Subject: [PATCH 13/18] fix typo

---
 docs/_tutorials/accelerator-setup-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index 220a485ea16b..802a70018368 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -1,5 +1,5 @@
 ---
-title: DeepSpeed Accelerator SetupGuides
+title: DeepSpeed Accelerator Setup Guides
 tags: getting-started
 ---
 

From 2e29f1076903c75495d45bdee46f7338430e43a3 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@gmail.com>
Date: Sat, 10 Aug 2024 11:44:05 +0800
Subject: [PATCH 14/18] Update docs/_tutorials/accelerator-setup-guide.md

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
---
 docs/_tutorials/accelerator-setup-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index 802a70018368..a67444126473 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -10,7 +10,7 @@ tags: getting-started
 - [Intel XPU](#intel-xpu)
 
 # Introduction
-DeepSpeed supports different accelerators from different companies.   Setup steps to run DeepSpeed on certain accelerators might be different.  This guide allows user to lookup the accelerator family they are using and setup environment for the hardware they are using.
+DeepSpeed supports different accelerators from different companies.   Setup steps to run DeepSpeed on certain accelerators might be different.  This guide allows user to lookup setup instructions for the accelerator family and hardware they are using.
 
 # Intel Architecture (IA) CPU
 DeepSpeed supports CPU with Intel Architecture instruction set.  It is recommended to have the CPU support at least AVX2 instruction set and recommend AMX instruction set.

From 4ad7636485c8fe122f642a0031e71c306d8d0ad7 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@gmail.com>
Date: Sat, 10 Aug 2024 11:45:02 +0800
Subject: [PATCH 15/18] Update docs/_tutorials/accelerator-setup-guide.md

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
---
 docs/_tutorials/accelerator-setup-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index a67444126473..ca3dd7ce5157 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -41,7 +41,7 @@ DeepSpeed can launch on Intel Architecture CPU with default deepspeed command.
 ```
 deepspeed --bind_cores_to_rank <deepspeed-model-script>
 ```
-This switch would automatically detect the number of CPU NUMA node on the host, then launch as many worker as number of NUMA nodes, and each worker to cores/memory of each NUMA node.  This ensures workers does not interfere with each other and all memory allocation is local memory which improves performance.
+This switch would automatically detect the number of CPU NUMA node on the host, launch the same number of workers, and bind each worker to cores/memory of a different NUMA node.  This improves performance by ensuring workers do not interfere with each other, and that all memory allocation is from local memory.
 
 When user wish to get more control on the number of workers and which cores can be used by the workload, user can use the following command line switches.
 ```

From 6ceef090fe948d919db913eb2ef540b5e66e4273 Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@gmail.com>
Date: Sat, 10 Aug 2024 11:45:30 +0800
Subject: [PATCH 16/18] Update docs/_tutorials/accelerator-setup-guide.md

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
---
 docs/_tutorials/accelerator-setup-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index ca3dd7ce5157..343b305689a6 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -43,7 +43,7 @@ deepspeed --bind_cores_to_rank <deepspeed-model-script>
 ```
 This switch would automatically detect the number of CPU NUMA node on the host, launch the same number of workers, and bind each worker to cores/memory of a different NUMA node.  This improves performance by ensuring workers do not interfere with each other, and that all memory allocation is from local memory.
 
-When user wish to get more control on the number of workers and which cores can be used by the workload, user can use the following command line switches.
+If a user wishes to have more control on the number of workers and specific cores that can be used by the workload, user can use the following command line switches.
 ```
 deepspeed --num_accelerators <number-of-workers> --bind_cores_to_rank --bind_core_list <comma-seperated-dash-range> <deepspeed-model-script>
 ```

From 3e57b19f4db04d78edd74b6545a5e7e69cef24af Mon Sep 17 00:00:00 2001
From: "Ma, Guokai" <guokai.ma@gmail.com>
Date: Sat, 10 Aug 2024 11:46:06 +0800
Subject: [PATCH 17/18] Update docs/_tutorials/accelerator-setup-guide.md

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
---
 docs/_tutorials/accelerator-setup-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index 343b305689a6..5f3b1751c686 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -88,7 +88,7 @@ export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib/ccl/cpu:${CONDA_PREFIX}/lib/libfabric
 ```
 
 ## Optimize LLM inference with Intel Extension for PyTorch
-Intel Extension for PyTorch compatible w]th DeepSpeed AutoTP tensor parallel inference.  It allows CPU inference benefit from both DeepSpeed Automatic Tensor Parallelism and LLM optimization from Intel Extension for PyTorch.  To use Intel Extension for PyTorch, after call deepspeed.init_inference, call
+Intel Extension for PyTorch compatible with DeepSpeed AutoTP tensor parallel inference.  It allows CPU inference to benefit from both DeepSpeed Automatic Tensor Parallelism, and LLM optimizations of Intel Extension for PyTorch.  To use Intel Extension for PyTorch, after calling deepspeed.init_inference, call
 ```
 ipex_model = ipex.llm.optimize(deepspeed_model)
 ```

From c060e3340479850f895d2ec8a26e48ef07907c6f Mon Sep 17 00:00:00 2001
From: Guokai Ma <guokai.ma@gmail.com>
Date: Sat, 10 Aug 2024 11:50:54 +0800
Subject: [PATCH 18/18] fix xpu part description

---
 docs/_tutorials/accelerator-setup-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_tutorials/accelerator-setup-guide.md b/docs/_tutorials/accelerator-setup-guide.md
index 5f3b1751c686..cf2d01d2b25c 100644
--- a/docs/_tutorials/accelerator-setup-guide.md
+++ b/docs/_tutorials/accelerator-setup-guide.md
@@ -116,7 +116,7 @@ https://intel.github.io/intel-extension-for-pytorch/index.html#installation?plat
 `pip install deepspeed`
 
 ## How to use DeepSpeed on Intel XPU
-DeepSpeed can launch on Intel XPU with common deepspeed command. Before that, user needs activate the oneAPI environment by: \
+DeepSpeed can be launched on Intel XPU with deepspeed launch command. Before that, user needs activate the oneAPI environment by: \
 `source <oneAPI installed path>/setvars.sh`
 
 To validate the XPU availability and if the XPU accelerator is correctly chosen, here is an example: