-
Notifications
You must be signed in to change notification settings - Fork 543
[Bugfix][SHM] Fix memory ordering problem in shm #3942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Zetong Li <slippersss@126.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a memory ordering problem in shared memory by introducing a lock. My review found a critical issue in the implementation where the lock is not used correctly, failing to prevent the race condition it was intended to fix. I've provided a code suggestion to correct the lock usage. Additionally, I've pointed out a minor issue with an unused import for general code hygiene.
| with self.writer_lock: | ||
| # mark the block as not written | ||
| metadata_buffer[0] = 0 | ||
| # let caller write to the buffer | ||
| with self.buffer.get_data(self.current_idx) as buf: | ||
| yield buf | ||
|
|
||
| # caller has written to the buffer | ||
| # NOTE: order is important here | ||
| # first set the read flags to 0 | ||
| # then set the written flag to 1 | ||
| # otherwise, the readers may think they already read the block | ||
| for i in range(1, self.buffer.n_reader + 1): | ||
| # set read flag to 0, meaning it is not read yet | ||
| metadata_buffer[i] = 0 | ||
| # mark the block as written | ||
| metadata_buffer[0] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical race condition here. The writer_lock is released before the written_flag (metadata_buffer[0]) is set to 1. A lock's release operation acts as a memory barrier, ensuring that all memory writes within the critical section are visible to other processes. By setting the written_flag outside the lock, you lose this guarantee. A reader process might see the written_flag as 1 before the data in the buffer is actually visible, leading to it reading stale or corrupted data. To fix this, the line metadata_buffer[0] = 1 must be moved inside the with self.writer_lock: block to ensure all writes are correctly ordered and synchronized.
| with self.writer_lock: | |
| # mark the block as not written | |
| metadata_buffer[0] = 0 | |
| # let caller write to the buffer | |
| with self.buffer.get_data(self.current_idx) as buf: | |
| yield buf | |
| # caller has written to the buffer | |
| # NOTE: order is important here | |
| # first set the read flags to 0 | |
| # then set the written flag to 1 | |
| # otherwise, the readers may think they already read the block | |
| for i in range(1, self.buffer.n_reader + 1): | |
| # set read flag to 0, meaning it is not read yet | |
| metadata_buffer[i] = 0 | |
| # mark the block as written | |
| metadata_buffer[0] = 1 | |
| with self.writer_lock: | |
| # mark the block as not written | |
| metadata_buffer[0] = 0 | |
| # let caller write to the buffer | |
| with self.buffer.get_data(self.current_idx) as buf: | |
| yield buf | |
| # caller has written to the buffer | |
| # NOTE: order is important here | |
| # first set the read flags to 0 | |
| # then set the written flag to 1 | |
| # otherwise, the readers may think they already read the block | |
| for i in range(1, self.buffer.n_reader + 1): | |
| # set read flag to 0, meaning it is not read yet | |
| metadata_buffer[i] = 0 | |
| # mark the block as written | |
| metadata_buffer[0] = 1 |
| from vllm.distributed.device_communicators.shm_broadcast import ( | ||
| MessageQueue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this PR does / why we need it?
This PR aims to fix memory ordering problem in share memory by patching message queue with an additional lock. The key point is to use the writer lock to enforce memory fence before the ready flag
metadata_buffer[0] = 1is set.This is a temporary solution, and its impact on performance is currently being tested. Looking forward to more discussions and advices, thank you!
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
by ci