Create directory to save models.
mkdir -p ~/models
Upload the internlm-chat-7b-turbomind.tgz
obtained from S1.Quantize on server by W4A16 to the models
Unzip the model.
tar zxvf internlm-chat-7b-turbomind.tgz -C .
The PyTorch version on Jetson does not support distributed reduce operations, which may cause errors in the distributed parts of the MMEngine module.
Error as:
AttributeError: module 'torch.distributed' has no attribute 'ReduceOp'
Activate conda environment:
conda activate lmdeploy
Run Python in interpreter mode:
Enter the following content:
import mmengine
It will output the installation location of the MMEngine module. The author's location is/home/nvidia/miniconda3/envs/lmdeploy/lib/python3.8/site-packages/mmengine/
,then the location of that ishome/nvidia/miniconda3/envs/lmdeploy/lib/python3.8/site-packages/mmengine/
.Let's use <path/to/mmengine>
Modify line 208 of <path/to/mmengine>/logging/
- global_rank = _get_rank()
+ global_rank = 0
There will be no errors during operation.
**Attention * *: This method is too crude and only applicable to Jetson platform deployment inference. It will affect distributed functionality on the server side!
Acitavate conda environment:
conda activate lmdeploy
Run model.
lmdeploy chat turbomind ./internlm-chat-7b-turbomind
Write a running script
with the following content:
from lmdeploy import turbomind as tm
if __name__ == "__main__":
model_path = "./internlm-chat-7b-turbomind" # 修改成你的路径
tm_model = tm.TurboMind.from_pretrained(model_path)
generator = tm_model.create_instance()
while True:
inp = input("[User] >>> ")
if inp == "exit":
prompt = tm_model.model.get_prompt(inp)
input_ids = tm_model.tokenizer.encode(prompt)
for outputs in generator.stream_infer(session_id=0, input_ids=[input_ids]):
res = outputs[1]
response = tm_model.tokenizer.decode(res)
print("[Bot] <<< {}".format(response))
Activate conda environment:
conda activate lmdeploy
Run the script: