Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC:ADD FAQ IN troubleshooting.rst #911

Merged
merged 5 commits into from
Jan 19, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions doc/source/getting_started/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,3 +63,100 @@ Say if your CUDA driver version is 11.8, then you can install PyTorch with the f
.. code-block:: python

pip install torch==2.0.1+cu118


Access is possible through ``http://localhost:9997``, but not through ``ip+9997`` of the local machine.
=======================================================================================================

Add ``H 0.0.0.0`` when starting, as in

.. code:: bash

xinference -H 0.0.0.0

If using docker (official docker is recommended), add ``p 9998:9997``
when starting, then access is available through ``ip+9998`` of the local
machine.

Can multiple models be loaded together?
=======================================

A single GPU can only support loading one LLM model at a time, but it is
possible to load an embedding model and a rerank model simultaneously.
With multiple GPUs, you can load multiple LLM models.

Issues with loading or slow downloads of the built-in models in xinference
==========================================================================

Xinference by default uses huggiface as the source for models. If your
machines are in Mainland China, there might be accessibility issues when
using built-in models.

To address this, add ``XINFERENCE_MODEL_SRC=modelscope`` when starting
the service to change the model source to ModelScope, which is optimized
for Mainland China.

If you’re starting xinference with Docker, include
``e XINFERENCE_MODEL_SRC=modelscope`` during the docker run command. For
more environment variable configurations, please refer to the official
[`Environment
Variables <https://inference.readthedocs.io/zh-cn/latest/getting_started/environments.html>`__]
documentation.

How to upgrade xinference
=========================

.. code:: bash

pip install --upgrade xinference

Installation of xinference dependencies is slow
===============================================

We are recommended to use the official docker image for installation.
There is a nightly-main version based on the main branch updated daily.
For stable versions, see GitHub.

.. code:: bash

docker pull xprobe/xinference

Does xinference support configuring LoRA?
=========================================

It is currently not supported; it requires manual integration with the
main model.

Can’t find a custom registration entry point for rerank models in xinference
============================================================================

Upgrade inference to the latest version, versions ``0.7.3`` and below
are not supported.

Does xinference support running on Huawei Ascend 310 or 910 hardware?
=====================================================================

Yes, it does.

Does xinference support an API that is compatible with OpenAI?
==============================================================

Yes, xinference not only supports an API compatible with OpenAI but also
has a client API available for use. For more details, please visit the
official website `Client
API <https://inference.readthedocs.io/zh-cn/latest/user_guide/client_api.html>`__.

When using xinference to load models, multi-GPU support is not functioning, and it only loads onto one card.
============================================================================================================

- If you are using Docker for vLLM multi-GPU inference, you need to
specify ``-shm-size``.

- If the vLLM backend is in use, you should disable vLLM before
performing the inference.

Does Xinference support setting up a chat model for embeddings?
===============================================================

It used to. But since the embedding performance of LLMs was poor, the
feature has been removed to prevent misuse.
Loading