[Bugfix][SHM] Fix memory ordering problem in shm #3942

slippersss · 2025-11-01T10:37:55Z

What this PR does / why we need it?

This PR aims to fix memory ordering problem in share memory by patching message queue with an additional lock. The key point is to use the writer lock to enforce memory fence before the ready flag metadata_buffer[0] = 1 is set.

This is a temporary solution, and its impact on performance is currently being tested. Looking forward to more discussions and advices, thank you!

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

by ci

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

Signed-off-by: Zetong Li <slippersss@126.com>

github-actions · 2025-11-01T10:38:02Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request aims to fix a memory ordering problem in shared memory by introducing a lock. My review found a critical issue in the implementation where the lock is not used correctly, failing to prevent the race condition it was intended to fix. I've provided a code suggestion to correct the lock usage. Additionally, I've pointed out a minor issue with an unused import for general code hygiene.

gemini-code-assist · 2025-11-01T10:40:08Z

vllm_ascend/patch/platform/patch_message_queue.py

+            with self.writer_lock:
+                # mark the block as not written
+                metadata_buffer[0] = 0
+                # let caller write to the buffer
+                with self.buffer.get_data(self.current_idx) as buf:
+                    yield buf
+
+                # caller has written to the buffer
+                # NOTE: order is important here
+                # first set the read flags to 0
+                # then set the written flag to 1
+                # otherwise, the readers may think they already read the block
+                for i in range(1, self.buffer.n_reader + 1):
+                    # set read flag to 0, meaning it is not read yet
+                    metadata_buffer[i] = 0
+            # mark the block as written
+            metadata_buffer[0] = 1


There is a critical race condition here. The writer_lock is released before the written_flag (metadata_buffer[0]) is set to 1. A lock's release operation acts as a memory barrier, ensuring that all memory writes within the critical section are visible to other processes. By setting the written_flag outside the lock, you lose this guarantee. A reader process might see the written_flag as 1 before the data in the buffer is actually visible, leading to it reading stale or corrupted data. To fix this, the line metadata_buffer[0] = 1 must be moved inside the with self.writer_lock: block to ensure all writes are correctly ordered and synchronized.

Suggested change

with self.writer_lock:

# mark the block as not written

metadata_buffer[0] = 0

# let caller write to the buffer

with self.buffer.get_data(self.current_idx) as buf:

yield buf

# caller has written to the buffer

# NOTE: order is important here

# first set the read flags to 0

# then set the written flag to 1

# otherwise, the readers may think they already read the block

for i in range(1, self.buffer.n_reader + 1):

# set read flag to 0, meaning it is not read yet

metadata_buffer[i] = 0

# mark the block as written

metadata_buffer[0] = 1

with self.writer_lock:

# mark the block as not written

metadata_buffer[0] = 0

# let caller write to the buffer

with self.buffer.get_data(self.current_idx) as buf:

yield buf

# caller has written to the buffer

# NOTE: order is important here

# first set the read flags to 0

# then set the written flag to 1

# otherwise, the readers may think they already read the block

for i in range(1, self.buffer.n_reader + 1):

# set read flag to 0, meaning it is not read yet

metadata_buffer[i] = 0

# mark the block as written

metadata_buffer[0] = 1

gemini-code-assist · 2025-11-01T10:40:08Z

vllm_ascend/patch/platform/patch_core.py

+    from vllm.distributed.device_communicators.shm_broadcast import (
+        MessageQueue)


The MessageQueue is imported but never used within the run_engine_core function. This unused import should be removed to improve code clarity and maintainability.

[Bugfix][SHM] Fix memory ordering problem in share memory

e91f1b3

Signed-off-by: Zetong Li <slippersss@126.com>

gemini-code-assist bot reviewed Nov 1, 2025

View reviewed changes

slippersss changed the title ~~[Bugfix][SHM] Fix memory ordering problem in share memory~~ [Bugfix][SHM] Fix memory ordering problem in shm Nov 4, 2025

slippersss closed this Nov 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][SHM] Fix memory ordering problem in shm #3942

[Bugfix][SHM] Fix memory ordering problem in shm #3942

slippersss commented Nov 1, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 1, 2025

Uh oh!

gemini-code-assist bot Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		from vllm.distributed.device_communicators.shm_broadcast import (
		MessageQueue)

[Bugfix][SHM] Fix memory ordering problem in shm #3942

[Bugfix][SHM] Fix memory ordering problem in shm #3942

Conversation

slippersss commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

slippersss commented Nov 1, 2025 •

edited

Loading