bentoml · Sherlock113 · Apr 7, 2024 · Apr 1, 2024 · Apr 1, 2024 · Apr 1, 2024
@@ -23,7 +23,7 @@ To specify concurrency for a BentoML Service, use the concurrency field in traff
 
 Key points about concurrency in BentoML:
 
-- Concurrency represents the ideal number of requests a Service can simultaneously process. By default, BentoML does not impose a limit on concurrency to avoid bottlenecks.
+- ``concurrency`` is a new field introduced in BentoML 1.2.8. It represents the ideal number of requests that a BentoML Service (namely, all :doc:`workers </guides/workers>` in the Service) can simultaneously process. By default, BentoML does not impose a limit on concurrency to avoid bottlenecks.
 - If your Service supports :doc:`adaptive batching </guides/adaptive-batching>` or continuous batching, set ``concurrency`` to match the batch size. This aligns processing capacity with batch requirements, optimizing throughput.
 - If a Service spawns multiple workers to leverage the parallelism of the underlying hardware accelerators (for example, multi-device GPUs), ``concurrency`` should be configured as the number of parallelism the devices can support.
 - For Services designed to handle one request at a time, set ``concurrency`` to ``1``, ensuring that requests are processed sequentially without overlap.
@@ -47,7 +47,7 @@ When using the ``traffic`` field in the ``@bentoml.service`` decorator, you can
 Note that they serve different purposes:
 
 - ``concurrency``: Indicates the ideal number of simultaneous requests that a Service is designed to handle efficiently. It's a guideline for optimizing performance, particularly in terms of how batching or parallel processing is implemented. This means that the simultaneous requests being processed by a Service instance can still exceed the ``concurrency`` configured.
-- ``max_concurrency``: Acts as a hard limit on the number of requests that can be processed simultaneously by a single instance of a Service. It's used to prevent a Service from being overwhelmed by too many requests at once, which could degrade performance or lead to resource exhaustion. Requests that exceed the ``max_concurrency`` limit will be rejected to maintain QoS and ensure that each request is handled within an acceptable time frame.
+- ``max_concurrency``: Acts as a hard limit on the number of requests that can be processed simultaneously by a single instance of a Service. It's used to prevent a Service from being overwhelmed by too many requests at once, which could degrade performance or lead to resource exhaustion. Requests that exceed the ``max_concurrency`` limit will be rejected to maintain QoS and ensure that each request is handled within an acceptable time frame. Note that starting from BentoML 1.2.8, ``max_concurrency`` applies to the aggregate of all workers within a Service. For prior versions, it works on a per-worker basis.
 
 Concurrency-based autoscaling
 -----------------------------

@@ -33,6 +33,12 @@ This chapter introduces the key features of BentoML. We recommend you read :doc:
 
         Create an OCI-compliant image for your BentoML project and deploy it anywhere.
 
+    .. grid-item-card:: :doc:`/guides/workers`
+        :link: /guides/workers
+        :link-type: doc
+
+        Understand BentoML workers and how to configure them.
+
     .. grid-item-card:: :doc:`/guides/build-options`
         :link: /guides/build-options
         :link-type: doc
@@ -100,6 +106,7 @@ This chapter introduces the key features of BentoML. We recommend you read :doc:
     iotypes
     deployment
     containerization
+    workers
     build-options
     model-store
     distributed-services

@@ -0,0 +1,81 @@
+=======
+Workers
+=======
+
+BentoML workers enhance the parallel processing capabilities of machine learning models. Under the hood, there are one or multiple workers within a BentoML :doc:`Service </guides/services>`. They are the processes that actually run the code logic within the Service. This design leverages the parallelism of the underlying hardware, whether it's multi-core CPUs or multi-device GPUs.
+
+This document explains how to configure and allocate workers for different use cases.
+
+Configure workers
+-----------------
+
+When you define a BentoML Service, use the ``workers`` parameter to set the number of workers. For example, setting ``workers=4`` launches four worker instances of the Service, each running in its process. Each worker is homogeneous, which means they perform the same tasks.
+
+.. code-block:: python
+
+    @bentoml.service(
+        workers=4,
+    )
+    class MyService:
+        # Service implementation
+
+The number of workers isn't necessarily equivalent to the number of concurrent requests a BentoML Service can serve in parallel. With optimizations like :doc:`adaptable batching </guides/adaptive-batching>` and continuous batching, each worker can potentially handle many requests simultaneously to enhance the throughput of your Service. To specify the ideal number of concurrent requests for a Service (namely, all workers within the Service), you can configure :doc:`concurrency </guides/concurrency>`.
+
+Use cases
+---------
+
+Workers allow a BentoML Service to effectively utilize underlying hardware accelerators, like CPUs and GPUs, ensuring optimal performance and resource utilization.
+
+The default worker count in BentoML is set to ``1``. However, depending on your computational workload and hardware configuration, you might need to adjust this number.
+
+CPU workloads
+^^^^^^^^^^^^^
+
+Python processes are subject to the Global Interpreter Lock (GIL), which limits the execution of multiple threads in a single process. To avoid this and fully leverage multi-core CPUs, you can start multiple workers. However, be mindful of the memory implications, as each worker will load a copy of the model into memory. Ensure that your machine's memory can support the cumulative memory requirements of all workers.
+
+You can set the number of worker processes based on the available CPU cores by setting ``workers`` to ``cpu_count``.
+
+.. code-block:: python
+
+    @bentoml.service(workers="cpu_count")
+    class MyService:
+        # Service implementation
+
+GPU workloads
+^^^^^^^^^^^^^
+
+In scenarios with multi-device GPUs, allocating specific GPUs to different workers allows each worker to process tasks independently. This can maximize parallel processing, increase throughput, and reduce overall inference time.
+
+You use ``worker_index`` to represent a worker instance, which is a unique identifier for each worker process within a BentoML Service, starting from ``0``. This index is used primarily to allocate GPUs among multiple workers. One common use case is to load one model per CUDA device to ensure that each GPU is utilized efficiently and to prevent resource contention between models.
+
+Here is an example:
+
+.. code-block:: python
+
+    import bentoml
+
+    @bentoml.service(
+        resources={"gpu": 2},
+        workers=2
+    )
+    class MyService:
+
+    def __init__(self):
+        import torch
+
+            cuda = torch.device(f"cuda:{bentoml.server_context.worker_index-1}")
+            model = models.resnet18(pretrained=True)
+            model.to(cuda)
+
+This Service dynamically determines the GPU device to use for the model by creating a ``torch.device`` object. The device ID is set by ``bentoml.server_context.worker_index - 1`` to allocate a specific GPU to each worker process. Worker 1 (``worker_index = 1``) uses GPU 0 and worker 2 (``worker_index = 2``) uses GPU 1. See the figure below for details.
+
+.. image:: ../../_static/img/guides/workers/workers-models-gpus.png
+    :width: 400px
+    :align: center
+
+When determining which device ID to assign to each worker for tasks such as loading models onto GPUs, this 1-indexing approach means you need to subtract 1 from the ``worker_index`` to get the 0-based device ID. This is because hardware devices like GPUs are usually indexed starting from 0. For more information, see GPU inference.
+
+If you want to use multiple GPUs for distributed operations (multiple GPUs for the same worker), PyTorch and TensorFlow offer different methods:
+
+- PyTorch: `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`_ and `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_
+- TensorFlow: `Distributed training <https://www.tensorflow.org/guide/distributed_training>`_