Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: add a docker-compose-distributed example with multiple workers #1064

Conversation

bufferoverflow
Copy link
Contributor

I was unable to make a distributed setup work using this docker compose file, I guess there is some missing back connect from supervisor to worker or so. Would be great if some of you could guide me on how to make this work.

@XprobeBot XprobeBot added this to the v0.9.1 milestone Feb 29, 2024
@bufferoverflow
Copy link
Contributor Author

Here's the log of the xinference-worker-1

xinference-worker-1-1    | 2024-02-29 14:15:42,714 xinference.core.worker 1 INFO     Starting metrics export server at 0.0.0.0:None
xinference-worker-1-1    | 2024-02-29 14:15:42,715 xinference.core.worker 1 INFO     Checking metrics export server...
xinference-worker-1-1    | 2024-02-29 14:15:43,639 xinference.core.worker 1 INFO     Metrics server is started at: http://0.0.0.0:44539
xinference-worker-1-1    | Traceback (most recent call last):
xinference-worker-1-1    |   File "/opt/conda/bin/xinference-worker", line 8, in <module>
xinference-worker-1-1    |     sys.exit(worker())
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
xinference-worker-1-1    |     return self.main(*args, **kwargs)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1078, in main
xinference-worker-1-1    |     rv = self.invoke(ctx)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
xinference-worker-1-1    |     return ctx.invoke(self.callback, **ctx.params)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
xinference-worker-1-1    |     return __callback(*args, **kwargs)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/deploy/cmdline.py", line 349, in worker
xinference-worker-1-1    |     main(
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/deploy/worker.py", line 94, in main
xinference-worker-1-1    |     loop.run_until_complete(task)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
xinference-worker-1-1    |     return future.result()
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/deploy/worker.py", line 65, in _start_worker
xinference-worker-1-1    |     await start_worker_components(
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/deploy/worker.py", line 43, in start_worker_components
xinference-worker-1-1    |     await xo.create_actor(
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 78, in create_actor
xinference-worker-1-1    |     return await ctx.create_actor(actor_cls, *args, uid=uid, address=address, **kwargs)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 143, in create_actor
xinference-worker-1-1    |     return self._process_result_message(result)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
xinference-worker-1-1    |     raise message.as_instanceof_cause()
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 596, in create_actor
xinference-worker-1-1    |     await self._run_coro(message.message_id, actor.__post_create__())
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro
xinference-worker-1-1    |     return await coro
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 163, in __post_create__
xinference-worker-1-1    |     ] = await xo.actor_ref(
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 125, in actor_ref
xinference-worker-1-1    |     return await ctx.actor_ref(*args, **kwargs)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 196, in actor_ref
xinference-worker-1-1    |     future = await self._call(actor_ref.address, message, wait=False)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 77, in _call
xinference-worker-1-1    |     return await self._caller.call(
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/core.py", line 180, in call
xinference-worker-1-1    |     client = await self.get_client(router, dest_address)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/core.py", line 68, in get_client
xinference-worker-1-1    |     client = await router.get_client(dest_address, from_who=self)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/router.py", line 143, in get_client
xinference-worker-1-1    |     client = await self._create_client(client_type, address, **kw)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/router.py", line 157, in _create_client
xinference-worker-1-1    |     return await client_type.connect(address, local_address=local_address, **kw)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/communication/socket.py", line 255, in connect
xinference-worker-1-1    |     (reader, writer) = await asyncio.open_connection(host=host, port=port, **kwargs)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/asyncio/streams.py", line 48, in open_connection
xinference-worker-1-1    |     transport, _ = await loop.create_connection(
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1076, in create_connection
xinference-worker-1-1    |     raise exceptions[0]
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1060, in create_connection
xinference-worker-1-1    |     sock = await self._connect_sock(
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 969, in _connect_sock
xinference-worker-1-1    |     await self.sock_connect(sock, address)
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/asyncio/selector_events.py", line 501, in sock_connect
xinference-worker-1-1    |     return await fut
xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/asyncio/selector_events.py", line 541, in _sock_connect_cb
xinference-worker-1-1    |     raise OSError(err, f'Connect call failed {address}')
xinference-worker-1-1    | ConnectionRefusedError: [address=0.0.0.0:30001, pid=1] [Errno 111] Connect call failed ('0.0.0.0', 38519)
xinference-worker-1-1 exited with code 1

UI is accessible http://0.0.0.0:9997/ui/ but connecting via another xinference-worker -e 'http://0.0.0.0:9997' does also fail.

@ChengjieLi28
Copy link
Contributor

ChengjieLi28 commented Mar 1, 2024

Hi, @bufferoverflow . Thanks for contributing! Could you please modify your PR in these respects:

  1. Separate docker compose files for distributed and local situation
  2. Consider how to start distributed xinference in a more generic way, e.g. consider a more generic number of workers and worker ip addresses
  3. recommend using an image with a version tag in the docker compose file instead of nightly-main

@bufferoverflow
Copy link
Contributor Author

thanks @ChengjieLi28 I change accordingly, but the main problem here is that it does not work on my end with multiple workers, do I miss some parameters?

@ChengjieLi28
Copy link
Contributor

thanks @ChengjieLi28 I change accordingly, but the main problem here is that it does not work on my end with multiple workers, do I miss some parameters?

I tried your docker-compose.yml on my machine. It seems that you need to ensure that supervisor has already been started, and then you can start the workers. That is, the worker must be started after the supervisor is started. From the log, the worker tries to connect to the supervisor (restful api) endpoint when the supervior has not been started yet.

@bufferoverflow bufferoverflow force-pushed the feat/extend-docker-compose-to-multiple-container branch from c58d8d5 to 2cb7d90 Compare March 1, 2024 07:33
@bufferoverflow bufferoverflow changed the title feat: extend docker-compose example to multiple workers feat: add a docker-compose-distributed example with multiple workers Mar 1, 2024
@bufferoverflow
Copy link
Contributor Author

@ChengjieLi28 The issue was the missing --supervisor-port parameter which will be used to connect to by the workers. I made now a separate docker compose file and switched to a tagged image.

  1. Consider how to start distributed xinference in a more generic way, e.g. consider a more generic number of workers and worker ip addresses

Not sure regarding this as docker compose is quite static, what did you had in mind?

btw. the supervisor requires gpu as well but it should not from my perspective.

@bufferoverflow bufferoverflow force-pushed the feat/extend-docker-compose-to-multiple-container branch from 2cb7d90 to c009b8a Compare March 1, 2024 07:48
@XprobeBot XprobeBot modified the milestones: v0.9.1, v0.9.2 Mar 1, 2024
@ChengjieLi28
Copy link
Contributor

@ChengjieLi28 The issue was the missing --supervisor-port parameter which will be used to connect to by the workers. I made now a separate docker compose file and switched to a tagged image.

  1. Consider how to start distributed xinference in a more generic way, e.g. consider a more generic number of workers and worker ip addresses

Not sure regarding this as docker compose is quite static, what did you had in mind?

btw. the supervisor requires gpu as well but it should not from my perspective.

  • Supervisor can be started without a GPU.
  • Keep your two workers as written, as an example, and add some comments to indicate that workers can be added according to this writing style.

@ChengjieLi28
Copy link
Contributor

@ChengjieLi28 The issue was the missing --supervisor-port parameter which will be used to connect to by the workers. I made now a separate docker compose file and switched to a tagged image.

  1. Consider how to start distributed xinference in a more generic way, e.g. consider a more generic number of workers and worker ip addresses

Not sure regarding this as docker compose is quite static, what did you had in mind?
btw. the supervisor requires gpu as well but it should not from my perspective.

  • Supervisor can be started without a GPU.
  • Keep your two workers as written, as an example, and add some comments to indicate that workers can be added according to this writing style.

Also, remain the volumn related comments to this new file. This allows users to use the mounted directory without having to repeatedly download the model.

@bufferoverflow
Copy link
Contributor Author

xinference-supervisor without GPU:

xinference-supervisor-1  | CUDA Version 12.1.1
xinference-supervisor-1  | 
xinference-supervisor-1  | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
xinference-supervisor-1  | 
xinference-supervisor-1  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
xinference-supervisor-1  | By pulling and using the container, you accept the terms and conditions of this license:
xinference-supervisor-1  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
xinference-supervisor-1  | 
xinference-supervisor-1  | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
xinference-supervisor-1  | 
xinference-supervisor-1  | WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
xinference-supervisor-1  |    Use the NVIDIA Container Toolkit to start this container with GPU support; see
xinference-supervisor-1  |    https://docs.nvidia.com/datacenter/cloud-native/ .
xinference-supervisor-1  | 
xinference-supervisor-1  | Traceback (most recent call last):
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/site-packages/llama_cpp/llama_cpp.py", line 59, in _load_shared_library
xinference-supervisor-1  |     return ctypes.CDLL(str(_lib_path), **cdll_args) # type: ignore
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/ctypes/__init__.py", line 374, in __init__
xinference-supervisor-1  |     self._handle = _dlopen(self._name, mode)
xinference-supervisor-1  | OSError: libcuda.so.1: cannot open shared object file: No such file or directory
xinference-supervisor-1  | 
xinference-supervisor-1  | During handling of the above exception, another exception occurred:
xinference-supervisor-1  | 
xinference-supervisor-1  | Traceback (most recent call last):
xinference-supervisor-1  |   File "/opt/conda/bin/xinference-supervisor", line 5, in <module>
xinference-supervisor-1  |     from xinference.deploy.cmdline import supervisor
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/site-packages/xinference/__init__.py", line 38, in <module>
xinference-supervisor-1  |     _install()
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/site-packages/xinference/__init__.py", line 35, in _install
xinference-supervisor-1  |     install_model()
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/site-packages/xinference/model/__init__.py", line 19, in _install
xinference-supervisor-1  |     llm_install()
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/__init__.py", line 50, in _install
xinference-supervisor-1  |     from .ggml.chatglm import ChatglmCppChatModel
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/ggml/chatglm.py", line 22, in <module>
xinference-supervisor-1  |     from ....types import (
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/site-packages/xinference/types.py", line 345, in <module>
xinference-supervisor-1  |     from llama_cpp import Llama
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/site-packages/llama_cpp/__init__.py", line 1, in <module>
xinference-supervisor-1  |     from .llama_cpp import *
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/site-packages/llama_cpp/llama_cpp.py", line 72, in <module>
xinference-supervisor-1  |     _lib = _load_shared_library(_lib_base_name)
xinference-supervisor-1  |   File "/opt/conda/lib/python3.10/site-packages/llama_cpp/llama_cpp.py", line 61, in _load_shared_library
xinference-supervisor-1  |     raise RuntimeError(f"Failed to load shared library '{_lib_path}': {e}")
xinference-supervisor-1  | RuntimeError: Failed to load shared library '/opt/conda/lib/python3.10/site-packages/llama_cpp/libllama.so': libcuda.so.1: cannot open shared object file: No such file or directory
xinference-supervisor-1 exited with code 0

I agree supervisor does work without, but I was unable to find a parameter to make it work without. I guess it's because the docker image is for GPU.

@bufferoverflow bufferoverflow force-pushed the feat/extend-docker-compose-to-multiple-container branch 2 times, most recently from 3ccca40 to ac14c9e Compare March 1, 2024 08:27
@bufferoverflow
Copy link
Contributor Author

@ChengjieLi28 added the volume and worker comments. Please let me know if there is anything else I can do.

@bufferoverflow bufferoverflow force-pushed the feat/extend-docker-compose-to-multiple-container branch from ac14c9e to 2f1bb4e Compare March 1, 2024 08:53
@ChengjieLi28
Copy link
Contributor

@ChengjieLi28 added the volume and worker comments. Please let me know if there is anything else I can do.

Great! Thank you. I will test this file on my machine. If everything works, I would approve this PR. This PR will be included in the next release.

@bufferoverflow bufferoverflow force-pushed the feat/extend-docker-compose-to-multiple-container branch from 2f1bb4e to 1801e05 Compare March 1, 2024 09:09
@ChengjieLi28 ChengjieLi28 changed the title feat: add a docker-compose-distributed example with multiple workers FEAT: add a docker-compose-distributed example with multiple workers Mar 1, 2024
@ChengjieLi28
Copy link
Contributor

Hi, @bufferoverflow . When I tested this PR on my machine. I found that some errors happen and I coundn't open the web ui on the port 9997.

(xinf) lichengjie@xprobe:~/TestDocker$ docker-compose -f compose2.yml up
[+] Running 4/0
 ✔ Container testdocker-xinference-1             Created                                                                                  0.0s
 ✔ Container testdocker-xinference-worker-2-1    Created                                                                                  0.0s
 ✔ Container testdocker-xinference-worker-1-1    Created                                                                                  0.0s
 ✔ Container testdocker-xinference-supervisor-1  Created                                                                                  0.0s
Attaching to testdocker-xinference-1, testdocker-xinference-supervisor-1, testdocker-xinference-worker-1-1, testdocker-xinference-worker-2-1
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | ======
testdocker-xinference-worker-2-1    | ====
testdocker-xinference-worker-2-1    | == CUDA ==
testdocker-xinference-worker-2-1    | ====
testdocker-xinference-worker-2-1    | ====
testdocker-xinference-worker-2-1    | ==
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | CUDA Version 12.1.1
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
testdocker-xinference-worker-2-1    | By pulling and using the container, you accept the terms and conditions of this license:
testdocker-xinference-worker-2-1    | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    |
testdocker-xinference-1             |
testdocker-xinference-1             |
testdocker-xinference-1             | =========
testdocker-xinference-1             | =
testdocker-xinference-1             | == CUDA ==
testdocker-xinference-1             | =======
testdocker-xinference-1             | =
testdocker-xinference-1             | =
testdocker-xinference-1             | =
testdocker-xinference-1             |
testdocker-xinference-1             | CUDA Version 12.1.1
testdocker-xinference-1             |
testdocker-xinference-1             |
testdocker-xinference-1             | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
testdocker-xinference-1             |
testdocker-xinference-1             |
testdocker-xinference-1             | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
testdocker-xinference-1             | By pulling and using the container, you accept the terms and conditions of this license:
testdocker-xinference-1             | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
testdocker-xinference-1             |
testdocker-xinference-1             | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
testdocker-xinference-1             |
testdocker-xinference-1             |
testdocker-xinference-1             |
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  | ==========
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  | == CUDA ==
testdocker-xinference-supervisor-1  | ====
testdocker-xinference-supervisor-1  | ==
testdocker-xinference-supervisor-1  | =
testdocker-xinference-supervisor-1  | ==
testdocker-xinference-supervisor-1  | =
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  | CUDA Version 12.1.1
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
testdocker-xinference-supervisor-1  | By pulling and using the container, you accept the terms and conditions of this license:
testdocker-xinference-supervisor-1  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  |
testdocker-xinference-supervisor-1  |
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | =========
testdocker-xinference-worker-1-1    | =
testdocker-xinference-worker-1-1    | == CUDA ==
testdocker-xinference-worker-1-1    | ==
testdocker-xinference-worker-1-1    | ==
testdocker-xinference-worker-1-1    | ==
testdocker-xinference-worker-1-1    | ==
testdocker-xinference-worker-1-1    | =
testdocker-xinference-worker-1-1    | =
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | CUDA Version 12.1.1
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
testdocker-xinference-worker-1-1    | By pulling and using the container, you accept the terms and conditions of this license:
testdocker-xinference-worker-1-1    | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    |
testdocker-xinference-1 exited with code 0
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
testdocker-xinference-worker-2-1    |     conn = connection.create_connection(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
testdocker-xinference-worker-2-1    |     raise err
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
testdocker-xinference-worker-2-1    |     sock.connect(sa)
testdocker-xinference-worker-2-1    | ConnectionRefusedError: [Errno 111] Connection refused
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
testdocker-xinference-worker-2-1    |     httplib_response = self._make_request(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 416, in _make_request
testdocker-xinference-worker-2-1    |     conn.request(method, url, **httplib_request_kw)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request
testdocker-xinference-worker-2-1    |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1283, in request
testdocker-xinference-worker-2-1    |     self._send_request(method, url, body, headers, encode_chunked)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1329, in _send_request
testdocker-xinference-worker-2-1    |     self.endheaders(body, encode_chunked=encode_chunked)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1278, in endheaders
testdocker-xinference-worker-2-1    |     self._send_output(message_body, encode_chunked=encode_chunked)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1038, in _send_output
testdocker-xinference-worker-2-1    |     self.send(msg)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 976, in send
testdocker-xinference-worker-2-1    |     self.connect()
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect
testdocker-xinference-worker-2-1    |     conn = self._new_conn()
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
testdocker-xinference-worker-2-1    |     raise NewConnectionError(
testdocker-xinference-worker-2-1    | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7ffa294f1570>: Failed to establish a new connection: [Errno 111] Connection refused
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
testdocker-xinference-worker-2-1    |     resp = conn.urlopen(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
testdocker-xinference-worker-2-1    |     retries = retries.increment(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
testdocker-xinference-worker-2-1    |     raise MaxRetryError(_pool, url, error or ResponseError(cause))
testdocker-xinference-worker-2-1    | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='xinference-supervisor', port=9997): Max retries exceeded with url: /v1/cluster/auth (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffa294f1570>: Failed to establish a new connection: [Errno 111] Connection refused'))
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/bin/xinference-worker", line 8, in <module>
testdocker-xinference-worker-2-1    |     sys.exit(worker())
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
testdocker-xinference-worker-2-1    |     return self.main(*args, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1078, in main
testdocker-xinference-worker-2-1    |     rv = self.invoke(ctx)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
testdocker-xinference-worker-2-1    |     return ctx.invoke(self.callback, **ctx.params)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
testdocker-xinference-worker-2-1    |     return __callback(*args, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/deploy/cmdline.py", line 345, in worker
testdocker-xinference-worker-2-1    |     client = RESTfulClient(base_url=endpoint)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 651, in __init__
testdocker-xinference-worker-2-1    |     self._check_cluster_authenticated()
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 667, in _check_cluster_authenticated
testdocker-xinference-worker-2-1    |     response = requests.get(url)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/api.py", line 73, in get
testdocker-xinference-worker-2-1    |     return request("get", url, params=params, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/api.py", line 59, in request
testdocker-xinference-worker-2-1    |     return session.request(method=method, url=url, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
testdocker-xinference-worker-2-1    |     resp = self.send(prep, **send_kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
testdocker-xinference-worker-2-1    |     r = adapter.send(request, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
testdocker-xinference-worker-2-1    |     raise ConnectionError(e, request=request)
testdocker-xinference-worker-2-1    | requests.exceptions.ConnectionError: HTTPConnectionPool(host='xinference-supervisor', port=9997): Max retries exceeded with url: /v1/cluster/auth (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffa294f1570>: Failed to establish a new connection: [Errno 111] Connection refused'))
testdocker-xinference-worker-2-1    |
testdocker-xinference-supervisor-1  | 2024-03-01 09:06:06,120 xinference.core.supervisor 61 INFO     Xinference supervisor xinference-supervisor:9999 started
testdocker-xinference-worker-1-1    | Traceback (most recent call last):
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
testdocker-xinference-worker-1-1    |     conn = connection.create_connection(
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
testdocker-xinference-worker-1-1    |     raise err
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
testdocker-xinference-worker-1-1    |     sock.connect(sa)
testdocker-xinference-worker-1-1    | ConnectionRefusedError: [Errno 111] Connection refused
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | Traceback (most recent call last):
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
testdocker-xinference-worker-1-1    |     httplib_response = self._make_request(
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 416, in _make_request
testdocker-xinference-worker-1-1    |     conn.request(method, url, **httplib_request_kw)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request
testdocker-xinference-worker-1-1    |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1283, in request
testdocker-xinference-worker-1-1    |     self._send_request(method, url, body, headers, encode_chunked)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1329, in _send_request
testdocker-xinference-worker-1-1    |     self.endheaders(body, encode_chunked=encode_chunked)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1278, in endheaders
testdocker-xinference-worker-1-1    |     self._send_output(message_body, encode_chunked=encode_chunked)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1038, in _send_output
testdocker-xinference-worker-1-1    |     self.send(msg)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 976, in send
testdocker-xinference-worker-1-1    |     self.connect()
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect
testdocker-xinference-worker-1-1    |     conn = self._new_conn()
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
testdocker-xinference-worker-1-1    |     raise NewConnectionError(
testdocker-xinference-worker-1-1    | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f4d991a5780>: Failed to establish a new connection: [Errno 111] Connection refused
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | Traceback (most recent call last):
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
testdocker-xinference-worker-1-1    |     resp = conn.urlopen(
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
testdocker-xinference-worker-1-1    |     retries = retries.increment(
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
testdocker-xinference-worker-1-1    |     raise MaxRetryError(_pool, url, error or ResponseError(cause))
testdocker-xinference-worker-1-1    | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='xinference-supervisor', port=9997): Max retries exceeded with url: /v1/cluster/auth (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4d991a5780>: Failed to establish a new connection: [Errno 111] Connection refused'))
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | Traceback (most recent call last):
testdocker-xinference-worker-1-1    |   File "/opt/conda/bin/xinference-worker", line 8, in <module>
testdocker-xinference-worker-1-1    |     sys.exit(worker())
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
testdocker-xinference-worker-1-1    |     return self.main(*args, **kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1078, in main
testdocker-xinference-worker-1-1    |     rv = self.invoke(ctx)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
testdocker-xinference-worker-1-1    |     return ctx.invoke(self.callback, **ctx.params)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
testdocker-xinference-worker-1-1    |     return __callback(*args, **kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/deploy/cmdline.py", line 345, in worker
testdocker-xinference-worker-1-1    |     client = RESTfulClient(base_url=endpoint)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 651, in __init__
testdocker-xinference-worker-1-1    |     self._check_cluster_authenticated()
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 667, in _check_cluster_authenticated
testdocker-xinference-worker-1-1    |     response = requests.get(url)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/api.py", line 73, in get
testdocker-xinference-worker-1-1    |     return request("get", url, params=params, **kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/api.py", line 59, in request
testdocker-xinference-worker-1-1    |     return session.request(method=method, url=url, **kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
testdocker-xinference-worker-1-1    |     resp = self.send(prep, **send_kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
testdocker-xinference-worker-1-1    |     r = adapter.send(request, **kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
testdocker-xinference-worker-1-1    |     raise ConnectionError(e, request=request)
testdocker-xinference-worker-1-1    | requests.exceptions.ConnectionError: HTTPConnectionPool(host='xinference-supervisor', port=9997): Max retries exceeded with url: /v1/cluster/auth (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4d991a5780>: Failed to establish a new connection: [Errno 111] Connection refused'))
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-2-1 exited with code 0
testdocker-xinference-worker-1-1 exited with code 0
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
testdocker-xinference-worker-2-1    |     conn = connection.create_connection(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
testdocker-xinference-worker-2-1    |     raise err
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
testdocker-xinference-worker-2-1    |     sock.connect(sa)
testdocker-xinference-worker-2-1    | ConnectionRefusedError: [Errno 111] Connection refused
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
testdocker-xinference-worker-2-1    |     httplib_response = self._make_request(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 416, in _make_request
testdocker-xinference-worker-2-1    |     conn.request(method, url, **httplib_request_kw)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request
testdocker-xinference-worker-2-1    |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1283, in request
testdocker-xinference-worker-2-1    |     self._send_request(method, url, body, headers, encode_chunked)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1329, in _send_request
testdocker-xinference-worker-2-1    |     self.endheaders(body, encode_chunked=encode_chunked)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1278, in endheaders
testdocker-xinference-worker-2-1    |     self._send_output(message_body, encode_chunked=encode_chunked)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1038, in _send_output
testdocker-xinference-worker-2-1    |     self.send(msg)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 976, in send
testdocker-xinference-worker-2-1    |     self.connect()
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect
testdocker-xinference-worker-2-1    |     conn = self._new_conn()
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
testdocker-xinference-worker-2-1    |     raise NewConnectionError(
testdocker-xinference-worker-2-1    | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f4a500c93c0>: Failed to establish a new connection: [Errno 111] Connection refused
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
testdocker-xinference-worker-2-1    |     resp = conn.urlopen(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
testdocker-xinference-worker-2-1    |     retries = retries.increment(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
testdocker-xinference-worker-2-1    |     raise MaxRetryError(_pool, url, error or ResponseError(cause))
testdocker-xinference-worker-2-1    | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='xinference-supervisor', port=9997): Max retries exceeded with url: /v1/cluster/auth (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4a500c93c0>: Failed to establish a new connection: [Errno 111] Connection refused'))
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/bin/xinference-worker", line 8, in <module>
testdocker-xinference-worker-2-1    |     sys.exit(worker())
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
testdocker-xinference-worker-2-1    |     return self.main(*args, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1078, in main
testdocker-xinference-worker-2-1    |     rv = self.invoke(ctx)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
testdocker-xinference-worker-2-1    |     return ctx.invoke(self.callback, **ctx.params)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
testdocker-xinference-worker-2-1    |     return __callback(*args, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/deploy/cmdline.py", line 345, in worker
testdocker-xinference-worker-2-1    |     client = RESTfulClient(base_url=endpoint)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 651, in __init__
testdocker-xinference-worker-2-1    |     self._check_cluster_authenticated()
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 667, in _check_cluster_authenticated
testdocker-xinference-worker-2-1    |     response = requests.get(url)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/api.py", line 73, in get
testdocker-xinference-worker-2-1    |     return request("get", url, params=params, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/api.py", line 59, in request
testdocker-xinference-worker-2-1    |     return session.request(method=method, url=url, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
testdocker-xinference-worker-2-1    |     resp = self.send(prep, **send_kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
testdocker-xinference-worker-2-1    |     r = adapter.send(request, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
testdocker-xinference-worker-2-1    |     raise ConnectionError(e, request=request)
testdocker-xinference-worker-2-1    | requests.exceptions.ConnectionError: HTTPConnectionPool(host='xinference-supervisor', port=9997): Max retries exceeded with url: /v1/cluster/auth (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4a500c93c0>: Failed to establish a new connection: [Errno 111] Connection refused'))
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-1-1    | Traceback (most recent call last):
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
testdocker-xinference-worker-1-1    |     conn = connection.create_connection(
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
testdocker-xinference-worker-1-1    |     raise err
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
testdocker-xinference-worker-1-1    |     sock.connect(sa)
testdocker-xinference-worker-1-1    | ConnectionRefusedError: [Errno 111] Connection refused
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | Traceback (most recent call last):
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
testdocker-xinference-worker-1-1    |     httplib_response = self._make_request(
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 416, in _make_request
testdocker-xinference-worker-1-1    |     conn.request(method, url, **httplib_request_kw)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request
testdocker-xinference-worker-1-1    |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1283, in request
testdocker-xinference-worker-1-1    |     self._send_request(method, url, body, headers, encode_chunked)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1329, in _send_request
testdocker-xinference-worker-1-1    |     self.endheaders(body, encode_chunked=encode_chunked)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1278, in endheaders
testdocker-xinference-worker-1-1    |     self._send_output(message_body, encode_chunked=encode_chunked)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1038, in _send_output
testdocker-xinference-worker-1-1    |     self.send(msg)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 976, in send
testdocker-xinference-worker-1-1    |     self.connect()
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect
testdocker-xinference-worker-1-1    |     conn = self._new_conn()
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
testdocker-xinference-worker-1-1    |     raise NewConnectionError(
testdocker-xinference-worker-1-1    | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f0d84089510>: Failed to establish a new connection: [Errno 111] Connection refused
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | Traceback (most recent call last):
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
testdocker-xinference-worker-1-1    |     resp = conn.urlopen(
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
testdocker-xinference-worker-1-1    |     retries = retries.increment(
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
testdocker-xinference-worker-1-1    |     raise MaxRetryError(_pool, url, error or ResponseError(cause))
testdocker-xinference-worker-1-1    | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='xinference-supervisor', port=9997): Max retries exceeded with url: /v1/cluster/auth (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0d84089510>: Failed to establish a new connection: [Errno 111] Connection refused'))
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | Traceback (most recent call last):
testdocker-xinference-worker-1-1    |   File "/opt/conda/bin/xinference-worker", line 8, in <module>
testdocker-xinference-worker-1-1    |     sys.exit(worker())
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
testdocker-xinference-worker-1-1    |     return self.main(*args, **kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1078, in main
testdocker-xinference-worker-1-1    |     rv = self.invoke(ctx)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
testdocker-xinference-worker-1-1    |     return ctx.invoke(self.callback, **ctx.params)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
testdocker-xinference-worker-1-1    |     return __callback(*args, **kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/deploy/cmdline.py", line 345, in worker
testdocker-xinference-worker-1-1    |     client = RESTfulClient(base_url=endpoint)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 651, in __init__
testdocker-xinference-worker-1-1    |     self._check_cluster_authenticated()
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 667, in _check_cluster_authenticated
testdocker-xinference-worker-1-1    |     response = requests.get(url)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/api.py", line 73, in get
testdocker-xinference-worker-1-1    |     return request("get", url, params=params, **kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/api.py", line 59, in request
testdocker-xinference-worker-1-1    |     return session.request(method=method, url=url, **kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
testdocker-xinference-worker-1-1    |     resp = self.send(prep, **send_kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
testdocker-xinference-worker-1-1    |     r = adapter.send(request, **kwargs)
testdocker-xinference-worker-1-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
testdocker-xinference-worker-1-1    |     raise ConnectionError(e, request=request)
testdocker-xinference-worker-1-1    | requests.exceptions.ConnectionError: HTTPConnectionPool(host='xinference-supervisor', port=9997): Max retries exceeded with url: /v1/cluster/auth (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0d84089510>: Failed to establish a new connection: [Errno 111] Connection refused'))
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-2-1 exited with code 0
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-1-1 exited with code 0
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
testdocker-xinference-worker-1-1    | By pulling and using the container, you accept the terms and conditions of this license:
testdocker-xinference-worker-1-1    | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    |
testdocker-xinference-worker-1-1    |
testdocker-xinference-supervisor-1  | 2024-03-01 09:06:12,361 xinference.api.restful_api 1 INFO     Starting Xinference at endpoint: http://xinference-supervisor:9997
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
testdocker-xinference-worker-2-1    |     conn = connection.create_connection(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
testdocker-xinference-worker-2-1    |     raise err
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
testdocker-xinference-worker-2-1    |     sock.connect(sa)
testdocker-xinference-worker-2-1    | ConnectionRefusedError: [Errno 111] Connection refused
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
testdocker-xinference-worker-2-1    |     httplib_response = self._make_request(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 416, in _make_request
testdocker-xinference-worker-2-1    |     conn.request(method, url, **httplib_request_kw)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request
testdocker-xinference-worker-2-1    |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1283, in request
testdocker-xinference-worker-2-1    |     self._send_request(method, url, body, headers, encode_chunked)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1329, in _send_request
testdocker-xinference-worker-2-1    |     self.endheaders(body, encode_chunked=encode_chunked)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1278, in endheaders
testdocker-xinference-worker-2-1    |     self._send_output(message_body, encode_chunked=encode_chunked)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 1038, in _send_output
testdocker-xinference-worker-2-1    |     self.send(msg)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/http/client.py", line 976, in send
testdocker-xinference-worker-2-1    |     self.connect()
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect
testdocker-xinference-worker-2-1    |     conn = self._new_conn()
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
testdocker-xinference-worker-2-1    |     raise NewConnectionError(
testdocker-xinference-worker-2-1    | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fdcae2a1810>: Failed to establish a new connection: [Errno 111] Connection refused
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
testdocker-xinference-worker-2-1    |     resp = conn.urlopen(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
testdocker-xinference-worker-2-1    |     retries = retries.increment(
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
testdocker-xinference-worker-2-1    |     raise MaxRetryError(_pool, url, error or ResponseError(cause))
testdocker-xinference-worker-2-1    | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='xinference-supervisor', port=9997): Max retries exceeded with url: /v1/cluster/auth (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdcae2a1810>: Failed to establish a new connection: [Errno 111] Connection refused'))
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | During handling of the above exception, another exception occurred:
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    | Traceback (most recent call last):
testdocker-xinference-worker-2-1    |   File "/opt/conda/bin/xinference-worker", line 8, in <module>
testdocker-xinference-worker-2-1    |     sys.exit(worker())
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
testdocker-xinference-worker-2-1    |     return self.main(*args, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1078, in main
testdocker-xinference-worker-2-1    |     rv = self.invoke(ctx)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
testdocker-xinference-worker-2-1    |     return ctx.invoke(self.callback, **ctx.params)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
testdocker-xinference-worker-2-1    |     return __callback(*args, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/deploy/cmdline.py", line 345, in worker
testdocker-xinference-worker-2-1    |     client = RESTfulClient(base_url=endpoint)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 651, in __init__
testdocker-xinference-worker-2-1    |     self._check_cluster_authenticated()
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/xinference/client/restful/restful_client.py", line 667, in _check_cluster_authenticated
testdocker-xinference-worker-2-1    |     response = requests.get(url)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/api.py", line 73, in get
testdocker-xinference-worker-2-1    |     return request("get", url, params=params, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/api.py", line 59, in request
testdocker-xinference-worker-2-1    |     return session.request(method=method, url=url, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
testdocker-xinference-worker-2-1    |     resp = self.send(prep, **send_kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
testdocker-xinference-worker-2-1    |     r = adapter.send(request, **kwargs)
testdocker-xinference-worker-2-1    |   File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
testdocker-xinference-worker-2-1    |     raise ConnectionError(e, request=request)
testdocker-xinference-worker-2-1    | requests.exceptions.ConnectionError: HTTPConnectionPool(host='xinference-supervisor', port=9997): Max retries exceeded with url: /v1/cluster/auth (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdcae2a1810>: Failed to establish a new connection: [Errno 111] Connection refused'))
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-1-1    | 2024-03-01 09:06:12,934 xinference.core.worker 1 INFO     Starting metrics export server at 0.0.0.0:None
testdocker-xinference-worker-1-1    | 2024-03-01 09:06:12,935 xinference.core.worker 1 INFO     Checking metrics export server...
testdocker-xinference-worker-2-1 exited with code 1
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-2-1    |
testdocker-xinference-worker-1-1    | 2024-03-01 09:06:15,448 xinference.core.worker 1 INFO     Metrics server is started at: http://0.0.0.0:33305
testdocker-xinference-worker-1-1    | 2024-03-01 09:06:15,456 xinference.core.worker 1 INFO     Xinference worker xinference-worker-1:30001 started
testdocker-xinference-worker-1-1    | 2024-03-01 09:06:15,456 xinference.core.worker 1 INFO     Purge cache directory: /root/.xinference/cache
testdocker-xinference-worker-2-1    | 2024-03-01 09:06:15,571 xinference.core.worker 1 INFO     Starting metrics export server at 0.0.0.0:None
testdocker-xinference-worker-2-1    | 2024-03-01 09:06:15,572 xinference.core.worker 1 INFO     Checking metrics export server...
testdocker-xinference-worker-2-1    | 2024-03-01 09:06:18,105 xinference.core.worker 1 INFO     Metrics server is started at: http://0.0.0.0:40331
testdocker-xinference-worker-2-1    | 2024-03-01 09:06:18,111 xinference.core.worker 1 INFO     Xinference worker xinference-worker-2:30002 started
testdocker-xinference-worker-2-1    | 2024-03-01 09:06:18,112 xinference.core.worker 1 INFO     Purge cache directory: /root/.xinference/cache

@ChengjieLi28
Copy link
Contributor

It seems that the worker and supervisor are started together, and just rely on the restart policy to finally be able to start successfully. This can cause confusions for users. And then port 9997 of the supervisor needs to be open to the host, otherwise it's not accessible externally.

@bufferoverflow
Copy link
Contributor Author

I just added a healthcheck and depends_on condition with 4ccd3a0

@ChengjieLi28
Copy link
Contributor

I just added a healthcheck and depends_on condition with 4ccd3a0

Thanks. I have already tested your PR on my machine. Everything works fine. Please reduce the interval of failure checks and I will merge this PR.

@bufferoverflow
Copy link
Contributor Author

@ChengjieLi28 thanks for the feedback, I just made 20221c0 to set interval and start_period to 5s

@ChengjieLi28 ChengjieLi28 merged commit 653e409 into xorbitsai:main Mar 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants