0.4.3 error CUDA error: an illegal memory access was encountered #5376

maxin9966 · 2024-06-10T09:06:31Z

Your current environment

vllm 0.4.3
CUDA Driver Version: 555.42.02
4060Ti Super * 2

VLLM_ATTENTION_BACKEND=FLASH_ATTN CUDA_VISIBLE_DEVICES=0
python -m vllm.entrypoints.openai.api_server
--gpu-memory-utilization 0.85
--quantization gptq --host 0.0.0.0 --port 1234 -tp 1
--max-model-len 32768 --served-model-name qwen2
--trust-remote-code
--enable-prefix-caching

🐛 Describe the bug

** desc: **
I have tried running it both with a single and dual GPU, but after running for a period of time, it starts to report errors; the issue occurs 100% of the time. The commands used are as follows:

VLLM_ATTENTION_BACKEND=FLASH_ATTN CUDA_VISIBLE_DEVICES=0
python -m vllm.entrypoints.openai.api_server
--gpu-memory-utilization 0.85
--quantization gptq --host 0.0.0.0 --port 1234 -tp 1
--max-model-len 32768 --served-model-name qwen2
--trust-remote-code
--enable-prefix-caching

** error: **
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: RuntimeError: CUDA error: an illegal memory access was encountered
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

** logs: **
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: INFO: 192.168.1.161:38834 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: ERROR: Exception in ASGI application
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: Traceback (most recent call last):
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: result = await app( # type: ignore[func-returns-value]
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/applications.py", line 1054, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await super().call(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/applications.py", line 123, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.middleware_stack(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 186, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 164, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, _send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/cors.py", line 83, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 62, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await app(scope, receive, sender)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 762, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.middleware_stack(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 782, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await route.handle(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 297, in handle
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 77, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await wrap_app_handling_exceptions(app, request)(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await app(scope, receive, sender)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 72, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: response = await func(request)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 299, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise e
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 294, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raw_response = await run_endpoint_function(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await dependant.call(**values)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: generator = await openai_serving_chat.create_chat_completion(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/serving_chat.py", line 198, in create_chat_completion
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await self.chat_completion_full_generator(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/serving_chat.py", line 360, in chat_completion_full_generator
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for res in result_generator:
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 662, in generate
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for output in self._process_request(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise e
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for request_output in stream:
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 80, in anext
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise result
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: result = await app( # type: ignore[func-returns-value]
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/applications.py", line 1054, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await super().call(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/applications.py", line 123, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.middleware_stack(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 186, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 164, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, _send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/cors.py", line 83, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 62, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await app(scope, receive, sender)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 762, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.middleware_stack(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 782, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await route.handle(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 297, in handle
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 77, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await wrap_app_handling_exceptions(app, request)(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await app(scope, receive, sender)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 72, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: response = await func(request)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 299, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise e
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 294, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raw_response = await run_endpoint_function(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await dependant.call(**values)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: generator = await openai_serving_chat.create_chat_completion(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/serving_chat.py", line 198, in create_chat_completion
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await self.chat_completion_full_generator(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/serving_chat.py", line 360, in chat_completion_full_generator
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for res in result_generator:
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 662, in generate
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for output in self._process_request(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise e
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for request_output in stream:
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 80, in anext
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise result
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: result = await app( # type: ignore[func-returns-value]
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/applications.py", line 1054, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await super().call(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/applications.py", line 123, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.middleware_stack(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 186, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 164, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, _send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/cors.py", line 83, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 62, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await app(scope, receive, sender)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 762, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.middleware_stack(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 782, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await route.handle(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 297, in handle
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 77, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await wrap_app_handling_exceptions(app, request)(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await app(scope, receive, sender)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 72, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: response = await func(request)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 299, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise e
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 294, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raw_response = await run_endpoint_function(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await dependant.call(**values)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: generator = await openai_serving_chat.create_chat_completion(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/serving_chat.py", line 198, in create_chat_completion
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await self.chat_completion_full_generator(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/serving_chat.py", line 360, in chat_completion_full_generator
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for res in result_generator:
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 662, in generate
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for output in self._process_request(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise e
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for request_output in stream:
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 80, in anext
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise result
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: task.result()
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: has_requests_in_progress = await asyncio.wait_for(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/asyncio/tasks.py", line 479, in wait_for
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return fut.result()
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: request_outputs = await self.engine.step_async()
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: output = await self.model_executor.execute_model_async(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: output = await make_async(self.driver_worker.execute_model
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/concurrent/futures/thread.py", line 58, in run
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: result = self.fn(*self.args, **self.kwargs)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return func(*args, **kwargs)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/worker/worker.py", line 272, in execute_model
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: output = self.model_runner.execute_model(seq_group_metadata_list,
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return func(*args, **kwargs)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 738, in execute_model
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: output = self.model.sample(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/models/chatglm.py", line 379, in sample
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: next_tokens = self.sampler(logits, sampling_metadata)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return self._call_impl(*args, **kwargs)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return forward_call(*args, **kwargs)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/layers/sampler.py", line 96, in forward
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: sample_results, maybe_sampled_tokens_tensor = _sample(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/layers/sampler.py", line 655, in _sample
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return _sample_with_torch(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/layers/sampler.py", line 544, in _sample_with_torch
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: sample_results = _random_sample(seq_groups,
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/model_executor/layers/sampler.py", line 324, in _random_sample
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: random_samples = random_samples.cpu()
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: RuntimeError: CUDA error: an illegal memory access was encountered
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: The above exception was the direct cause of the following exception:
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: Traceback (most recent call last):
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py", line 419, in run_asgi
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: result = await app( # type: ignore[func-returns-value]
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/applications.py", line 1054, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await super().call(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/applications.py", line 123, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.middleware_stack(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 186, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 164, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, _send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/cors.py", line 83, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 62, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await app(scope, receive, sender)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 762, in call
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.middleware_stack(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 782, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await route.handle(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 297, in handle
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await self.app(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 77, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await wrap_app_handling_exceptions(app, request)(scope, receive, send)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise exc
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: await app(scope, receive, sender)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/starlette/routing.py", line 72, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: response = await func(request)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 299, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise e
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 294, in app
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raw_response = await run_endpoint_function(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await dependant.call(**values)
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: generator = await openai_serving_chat.create_chat_completion(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/serving_chat.py", line 198, in create_chat_completion
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: return await self.chat_completion_full_generator(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/entrypoints/openai/serving_chat.py", line 360, in chat_completion_full_generator
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for res in result_generator:
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 662, in generate
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: async for output in self._process_request(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 756, in _process_request
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: stream = await self.add_request(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 561, in add_request
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: self.start_background_loop()
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: File "/home/ma/miniconda3/envs/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 431, in start_background_loop
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: raise AsyncEngineDeadError(
6月 10 16:45:05 ma-MS-TZZ-Z690M bash[107468]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
6月 10 16:45:06 ma-MS-TZZ-Z690M bash[107468]: INFO 06-10 16:45:06 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 0 reqs>
6月 10 16:45:09 ma-MS-TZZ-Z690M systemd[1]: Stopping Uvicorn server for my app...

The text was updated successfully, but these errors were encountered:

JulyFinal · 2024-06-12T01:36:48Z

I have similary issue.

vllm 0.4.3
CUDA Driver Version: 535.129.03
GTX 4090

cmd: export CUDA_VISIBLE_DEVICES=0 && nohup python -m vllm.entrypoints.openai.api_server --model models/Qwen2-7B-Instruct-GPTQ-Int4 --host 192.168.168.242 --port 8001 --served-model-name "gpt-3.5-turbo" --tensor-parallel-size 1 --api-key "sk-FqMzXBwjitG7X3xMtVFuKEYDJ4dwQ9iD" --max-model-len 19000 --enable-prefix-caching > llm.log 2>&1 &

ERROR 06-12 08:59:48 async_llm_engine.py:45] Engine background task failed
ERROR 06-12 08:59:48 async_llm_engine.py:45] Traceback (most recent call last):
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
ERROR 06-12 08:59:48 async_llm_engine.py:45]     task.result()
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
ERROR 06-12 08:59:48 async_llm_engine.py:45]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-12 08:59:48 async_llm_engine.py:45]                                ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/.pyenv/versions/3.11.8/lib/python3.11/asyncio/tasks.py", line 489, in wait_for
ERROR 06-12 08:59:48 async_llm_engine.py:45]     return fut.result()
ERROR 06-12 08:59:48 async_llm_engine.py:45]            ^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
ERROR 06-12 08:59:48 async_llm_engine.py:45]     request_outputs = await self.engine.step_async()
ERROR 06-12 08:59:48 async_llm_engine.py:45]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
ERROR 06-12 08:59:48 async_llm_engine.py:45]     output = await self.model_executor.execute_model_async(
ERROR 06-12 08:59:48 async_llm_engine.py:45]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
ERROR 06-12 08:59:48 async_llm_engine.py:45]     output = await make_async(self.driver_worker.execute_model
ERROR 06-12 08:59:48 async_llm_engine.py:45]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/.pyenv/versions/3.11.8/lib/python3.11/concurrent/futures/thread.py", line 58, in run
ERROR 06-12 08:59:48 async_llm_engine.py:45]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-12 08:59:48 async_llm_engine.py:45]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-12 08:59:48 async_llm_engine.py:45]     return func(*args, **kwargs)
ERROR 06-12 08:59:48 async_llm_engine.py:45]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 272, in execute_model
ERROR 06-12 08:59:48 async_llm_engine.py:45]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 06-12 08:59:48 async_llm_engine.py:45]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-12 08:59:48 async_llm_engine.py:45]     return func(*args, **kwargs)
ERROR 06-12 08:59:48 async_llm_engine.py:45]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 738, in execute_model
ERROR 06-12 08:59:48 async_llm_engine.py:45]     output = self.model.sample(
ERROR 06-12 08:59:48 async_llm_engine.py:45]              ^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 345, in sample
ERROR 06-12 08:59:48 async_llm_engine.py:45]     next_tokens = self.sampler(logits, sampling_metadata)
ERROR 06-12 08:59:48 async_llm_engine.py:45]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-12 08:59:48 async_llm_engine.py:45]     return self._call_impl(*args, **kwargs)
ERROR 06-12 08:59:48 async_llm_engine.py:45]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-12 08:59:48 async_llm_engine.py:45]     return forward_call(*args, **kwargs)
ERROR 06-12 08:59:48 async_llm_engine.py:45]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py", line 96, in forward
ERROR 06-12 08:59:48 async_llm_engine.py:45]     sample_results, maybe_sampled_tokens_tensor = _sample(
ERROR 06-12 08:59:48 async_llm_engine.py:45]                                                   ^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py", line 655, in _sample
ERROR 06-12 08:59:48 async_llm_engine.py:45]     return _sample_with_torch(
ERROR 06-12 08:59:48 async_llm_engine.py:45]            ^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py", line 544, in _sample_with_torch
ERROR 06-12 08:59:48 async_llm_engine.py:45]     sample_results = _random_sample(seq_groups,
ERROR 06-12 08:59:48 async_llm_engine.py:45]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45]   File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py", line 324, in _random_sample
ERROR 06-12 08:59:48 async_llm_engine.py:45]     random_samples = random_samples.cpu()
ERROR 06-12 08:59:48 async_llm_engine.py:45]                      ^^^^^^^^^^^^^^^^^^^^
ERROR 06-12 08:59:48 async_llm_engine.py:45] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 06-12 08:59:48 async_llm_engine.py:45] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 06-12 08:59:48 async_llm_engine.py:45] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR 06-12 08:59:48 async_llm_engine.py:45] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 06-12 08:59:48 async_llm_engine.py:45] 
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7fac33874c20>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fac30499f50>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7fac33874c20>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fac30499f50>>)>
Traceback (most recent call last):
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
    task.result()
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
                               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/.pyenv/versions/3.11.8/lib/python3.11/asyncio/tasks.py", line 489, in wait_for
    return fut.result()
           ^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
    request_outputs = await self.engine.step_async()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
    output = await self.model_executor.execute_model_async(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/.pyenv/versions/3.11.8/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 272, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 738, in execute_model
    output = self.model.sample(
             ^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 345, in sample
    next_tokens = self.sampler(logits, sampling_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py", line 96, in forward
    sample_results, maybe_sampled_tokens_tensor = _sample(
                                                  ^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py", line 655, in _sample
    return _sample_with_torch(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py", line 544, in _sample_with_torch
    sample_results = _random_sample(seq_groups,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py", line 324, in _random_sample
    random_samples = random_samples.cpu()
                     ^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/home/tlserver/workspace/imitater/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 47, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

maxin9966 · 2024-06-12T05:35:06Z

@JulyFinal Several members in our community have encountered similar issues. Version 0.4.2 appears to be functioning normally, we're unsure whether this is due to a problem with flash-attn2 or an issue specific to version 0.4.3.

JulyFinal · 2024-06-12T06:14:55Z

@maxin9966 I also reproduced this problem in the latest version 0.5.0.

I have no choice but to revert back to Version 0.4.2.

gaye746560359 · 2024-06-12T07:21:26Z

我也有这个问题。

youkaichao · 2024-06-12T07:25:14Z

Hi, can you follow the tips so that we can get more information on which operation caused the problem?

In addition, does it work with --enforce-eager ?

gaye746560359 · 2024-06-12T07:33:35Z

@maxin9966 我也在最新版本0.5.0中复现了这个问题。

我别无选择，只能恢复到 0.4.2 版本。

使用0.4.2版本报错172.17.0.1:45344 - "OPTIONS /v1/chat/completions HTTP/1.1" 401 Unauthorized是什么原因？
docker run --gpus all -d -p 8888:8000 -v D:\docker\huggingface:/root/.cache/huggingface
--env "HUGGING_FACE_HUB_TOKEN=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" --ipc=host --name Qwen2-7B-Instruct-32k
vllm/vllm-openai:v0.4.2 --model Qwen/Qwen2-7B-Instruct
--served-model-name gpt-3.5-turbo --api-key sk-123456
--swap-space 0

maxin9966 · 2024-06-12T07:54:18Z

@youkaichao --enforce-eager I did not use this parameter, I tried it before and found that it significantly reduced throughput, so I haven't used it since. For a long time, using the same parameters in versions prior to 0.4.3 was always fine.

gaye746560359 · 2024-06-12T09:12:22Z

@maxin9966 我也在最新版本0.5.0中复现了这个问题。

我别无选择，只能恢复到 0.4.2 版本。

我0.4.2也是同样的问题

maxin9966 · 2024-06-12T15:00:10Z

@gaye746560359 是不是文本长度超过一定限度就报错？我这里做压力测试是没问题的，但是运行一段时间之后就报错了，是不是触发了长文本所以报错？

JulyFinal · 2024-06-12T15:17:35Z

我这边也是大量运行后某个点突然报错的，但报错的这块最近执行的任务并没有很长的情况出现。。。

gaye746560359 · 2024-06-13T02:05:51Z

@gaye746560359 是不是文本长度超过一定限度就报错？我这里做压力测试是没问题的，但是运行一段时间之后就报错了，是不是触发了长文本所以报错？

不清楚原因，问题是没有大面积爆发这个问题也是玄学？官方也没有说明

JulyFinal · 2024-06-13T02:43:59Z

@gaye746560359 是不是文本长度超过一定限度就报错？我这里做压力测试是没问题的，但是运行一段时间之后就报错了，是不是触发了长文本所以报错？

不清楚原因，问题是没有大面积爆发这个问题也是玄学？官方也没有说明

你是不是也是40系的显卡? 如果是的话可能是集中在40系这块. 使用A系列的似乎没有这个问题

maxin9966 · 2024-06-13T03:46:46Z

@JulyFinal 目前看都是40系，我这里是用了flash attn2，确定是有问题的，只是不确定问题在哪里。
我发现20系显卡（0.4.2环境下），同样的模型能支持的上下文比40系（0.4.3以上环境下）要少很多，但是20系从来没出过问题。

maxin9966 · 2024-06-13T03:47:43Z

暂时先放弃40系吧，我的工作流全部迁移到老环境上了

Penglikai · 2024-06-13T06:46:59Z

Hi, can you follow the tips so that we can get more information on which operation caused the problem?

In addition, does it work with --enforce-eager ?

Hi Youkai, could you please take a look at below error message to see if it helps?


a800:3651115:3772442 [0] misc/strongstream.cc:395 NCCL WARN Cuda failure 'an illegal memory access was encountered'
a800:3651115:3772442 [0] NCCL INFO init.cc:1899 -> 1

a800:3651115:3772442 [0] init.cc:2030 NCCL WARN commReclaim: comm 0x563f2b7e1450 (rank = 0) in abort, error 1
a800:3651115:3651247 [0] NCCL INFO [Service thread] Connection closed by localRank 0

a800:3651115:3651247 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'
a800:3651115:3651247 [0] NCCL INFO transport/net.cc:537 -> 1
a800:3651115:3651247 [0] NCCL INFO transport/net.cc:940 -> 1
a800:3651115:3651247 [0] NCCL INFO proxy.cc:980 -> 1
a800:3651115:3651247 [0] NCCL INFO proxy.cc:996 -> 1

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] include/alloc.h:248 NCCL WARN Cuda failure 'an illegal memory access was encountered'

a800:3651115:3772442 [0] misc/strongstream.cc:110 NCCL WARN Cuda failure 'an illegal memory access was encountered'
a800:3651115:3772442 [0] NCCL INFO init.cc:218 -> 1
a800:3651115:3772442 [0] NCCL INFO init.cc:1933 -> 1

a800:3651115:3772442 [0] init.cc:2065 NCCL WARN commReclaim: cleanup comm 0x563f2b7e1450 rank 0 failed in destroy/abort, error 1
a800:3651115:3772442 [0] NCCL INFO comm 0x563f2b7e1450 rank 0 nranks 1 cudaDev 0 busId 65000 - Abort COMPLETE

WoosukKwon · 2024-06-13T07:52:25Z

Could you try setting the env variable VLLM_ATTENTION_BACKEND=XFORMERS? We'd like to know whether the FlashAttention kernel causes this bug.

gaye746560359 · 2024-06-13T09:55:55Z

@gaye746560359 是不是文本长度超过一定限度就报错？我这里做压力测试是没问题的，但是运行一段时间之后就报错了，是不是触发了长文本所以报错？

不清楚原因，问题是没有大面积爆发这个问题也是玄学？官方也没有说明

你是不是也是40系的显卡? 如果是的话可能是集中在40系这块. 使用A系列的似乎没有这个问题

我是rtx4090

Penglikai · 2024-06-14T06:14:45Z

@gaye746560359 是不是文本长度超过一定限度就报错？我这里做压力测试是没问题的，但是运行一段时间之后就报错了，是不是触发了长文本所以报错？

不清楚原因，问题是没有大面积爆发这个问题也是玄学？官方也没有说明

你是不是也是40系的显卡? 如果是的话可能是集中在40系这块. 使用A系列的似乎没有这个问题

Hit this error on A800 lol

maxin9966 · 2024-06-14T09:04:04Z

@WoosukKwon I've been testing it out. The workflow for this test is relatively lightweight because we can't currently access the production environment. I'll run it for a while to see how things go.

w013nad · 2024-06-19T13:50:25Z

I think my error might be the same as this
#5687

It's a similar error and happens with a GPTQ version of Qwen2 but with 0.5.0.post1

rohanarora · 2024-06-19T18:03:31Z

Was running into the very same issue.
Woosuk (@WoosukKwon) Setting VLLM_ATTENTION_BACKEND=XFORMERS seems to have resolved it for our runs.

JulyFinal · 2024-06-24T07:42:44Z

@JulyFinal 目前看都是40系，我这里是用了flash attn2，确定是有问题的，只是不确定问题在哪里。我发现20系显卡（0.4.2环境下），同样的模型能支持的上下文比40系（0.4.3以上环境下）要少很多，但是20系从来没出过问题。

同样的模型能支持的上下文比40系（0.4.3以上环境下）要少很多

这个我也发现了 =-= 感觉很奇怪, 升级了个版本能多这么多上下文.

换成XFORMERS做后端确实到目前为止没报错过.

另外我感觉 --enable-prefix-caching 开启后是不是会导致返回的结果有些不稳定, 之前好像没有这种情况

KrishnaM251 · 2024-07-23T23:32:05Z

Hi @maxin9966 , have you resolved the issue? If so, what steps did you take to do so? If not, can you please provide steps to reproduce it using the following template (replacing the italicized values with your values)?

GPUs
- 8x A6000
Client code
- python3 benchmarks/benchmark_prefix_caching.py --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-path ShareGPT.json --enable-prefix-caching --num-prompts 20 --repeat-count 5 --input-length-range 128:256
Server code
- python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --gpu-memory-utilization 0.40 --tensor-parallel-size 8 --max-model-len 2048 --trust-remote-code --enable-prefix-caching --max-num-seqs 128
Env Vars
- export VLLM_ATTENTION_BACKEND=FLASH_ATTN
- export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8
Version(s) Tested
- v0.5.2

JulyFinal · 2024-07-24T00:42:19Z

Hi @maxin9966 , have you resolved the issue? If so, what steps did you take to do so? If not, can you please provide steps to reproduce it using the following template (replacing the italicized values with your values)?嗨@maxin9966，你解决了这个问题吗？如果是这样，您采取了哪些步骤来做到这一点？如果不是，您能否提供使用以下模板重现它的步骤（将斜体值替换为您的值）？

GPUs

8x A6000

Client code

python3 benchmarks/benchmark_prefix_caching.py --model meta-llama/Meta-Llama-3-70B-Instruct --dataset-path ShareGPT.json --enable-prefix-caching --num-prompts 20 --repeat-count 5 --input-length-range 128:256

Server code

python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --gpu-memory-utilization 0.40 --tensor-parallel-size 8 --max-model-len 2048 --trust-remote-code --enable-prefix-caching --max-num-seqs 128

Env Vars

export VLLM_ATTENTION_BACKEND=FLASH_ATTN

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8

Version(s) Tested

v0.5.2

just change to export VLLM_ATTENTION_BACKEND=XFORMERS

try it.

George-ao · 2024-07-28T07:53:45Z

@JulyFinal Several members in our community have encountered similar issues. Version 0.4.2 appears to be functioning normally, we're unsure whether this is due to a problem with flash-attn2 or an issue specific to version 0.4.3.

Yes, I agree. I was able to run gptq marlin model with enable-chunked-prefill on 0.4.2 but could not after this version. From my error log, it is related to flash attention that cause "RuntimeError: CUDA error: an illegal memory access was encountered"

robinren03 · 2024-08-13T15:22:44Z

@maxin9966 你好，想问一下这个问题在最近的版本上解决了吗？我在0.5.3上运行再次出现了这个问题。

TangJiakai · 2024-09-17T14:47:37Z

Still happened in version 0.6.1.post2!!!!

hiyforever · 2024-09-19T09:54:26Z

+1 in 0.6.0

hunanjsd · 2024-09-21T12:51:48Z

+1 in 0.6.0

JulyFinal · 2024-09-27T07:11:47Z

this bug fixed?

mkulariya · 2024-10-15T15:09:57Z

getting the same issue in L4 with version 0.6.2

firefighter-eric · 2024-10-29T02:37:50Z

use --enforce-eager to avoid

tenebrius · 2024-11-14T01:49:32Z

Still occurs with --enforce-eager
here are my settings:

    llm = LLM(model=model_name,
              quantization="fp8",
              enable_chunked_prefill=True,
              enable_prefix_caching=True,
              gpu_memory_utilization=0.5,
              max_num_batched_tokens=16384,
              max_model_len=2048,
              enforce_eager=True,
              )

maxin9966 added the bug Something isn't working label Jun 10, 2024

mpoemsl mentioned this issue Jun 19, 2024

[Bug]: CUDA illegal memory access error when enable_prefix_caching=True #5537

Open

colefranks mentioned this issue Jun 19, 2024

[Bug]: prefix-caching: inconsistent completions #5543

Open

dongxiaolong mentioned this issue Jun 26, 2024

[Bug]: with --enable-prefix-caching , /completions crashes server with echo=True above certain prompt length #5344

Open

maxin9966 closed this as completed Sep 27, 2024

PeterSH6 mentioned this issue Nov 14, 2024

Hangs during vllm rollout, no error message volcengine/verl#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.4.3 error CUDA error: an illegal memory access was encountered #5376

0.4.3 error CUDA error: an illegal memory access was encountered #5376

maxin9966 commented Jun 10, 2024

JulyFinal commented Jun 12, 2024 •

edited

Loading

maxin9966 commented Jun 12, 2024

JulyFinal commented Jun 12, 2024

gaye746560359 commented Jun 12, 2024

youkaichao commented Jun 12, 2024

gaye746560359 commented Jun 12, 2024

maxin9966 commented Jun 12, 2024

gaye746560359 commented Jun 12, 2024

maxin9966 commented Jun 12, 2024

JulyFinal commented Jun 12, 2024 via email

gaye746560359 commented Jun 13, 2024

JulyFinal commented Jun 13, 2024

maxin9966 commented Jun 13, 2024

maxin9966 commented Jun 13, 2024

Penglikai commented Jun 13, 2024

WoosukKwon commented Jun 13, 2024

gaye746560359 commented Jun 13, 2024

Penglikai commented Jun 14, 2024

maxin9966 commented Jun 14, 2024

w013nad commented Jun 19, 2024

rohanarora commented Jun 19, 2024

JulyFinal commented Jun 24, 2024 •

edited

Loading

KrishnaM251 commented Jul 23, 2024

JulyFinal commented Jul 24, 2024

George-ao commented Jul 28, 2024 •

edited

Loading

robinren03 commented Aug 13, 2024

TangJiakai commented Sep 17, 2024

hiyforever commented Sep 19, 2024

hunanjsd commented Sep 21, 2024

JulyFinal commented Sep 27, 2024

mkulariya commented Oct 15, 2024

firefighter-eric commented Oct 29, 2024

tenebrius commented Nov 14, 2024

0.4.3 error CUDA error: an illegal memory access was encountered #5376

0.4.3 error CUDA error: an illegal memory access was encountered #5376

Comments

maxin9966 commented Jun 10, 2024

Your current environment

🐛 Describe the bug

JulyFinal commented Jun 12, 2024 • edited Loading

maxin9966 commented Jun 12, 2024

JulyFinal commented Jun 12, 2024

gaye746560359 commented Jun 12, 2024

youkaichao commented Jun 12, 2024

gaye746560359 commented Jun 12, 2024

maxin9966 commented Jun 12, 2024

gaye746560359 commented Jun 12, 2024

maxin9966 commented Jun 12, 2024

JulyFinal commented Jun 12, 2024 via email

gaye746560359 commented Jun 13, 2024

JulyFinal commented Jun 13, 2024

maxin9966 commented Jun 13, 2024

maxin9966 commented Jun 13, 2024

Penglikai commented Jun 13, 2024

WoosukKwon commented Jun 13, 2024

gaye746560359 commented Jun 13, 2024

Penglikai commented Jun 14, 2024

maxin9966 commented Jun 14, 2024

w013nad commented Jun 19, 2024

rohanarora commented Jun 19, 2024

JulyFinal commented Jun 24, 2024 • edited Loading

KrishnaM251 commented Jul 23, 2024

JulyFinal commented Jul 24, 2024

George-ao commented Jul 28, 2024 • edited Loading

robinren03 commented Aug 13, 2024

TangJiakai commented Sep 17, 2024

hiyforever commented Sep 19, 2024

hunanjsd commented Sep 21, 2024

JulyFinal commented Sep 27, 2024

mkulariya commented Oct 15, 2024

firefighter-eric commented Oct 29, 2024

tenebrius commented Nov 14, 2024

JulyFinal commented Jun 12, 2024 •

edited

Loading

JulyFinal commented Jun 24, 2024 •

edited

Loading

George-ao commented Jul 28, 2024 •

edited

Loading