-
Notifications
You must be signed in to change notification settings - Fork 690
fix: handle error in port reserve and cleanup allocator #1536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes remove all code related to GPU temperature tracking and reporting from the resource management modules. This includes deleting temperature attributes, methods, and log outputs associated with GPU temperature in both the GPU information and allocation logic. Only GPU utilization is now logged or reported in statistics. Additionally, the port reservation utility was simplified by replacing a complex function with a streamlined class-based context manager. Changes
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
deploy/sdk/src/dynamo/sdk/lib/resource.py (1)
140-163: 💡 Verification agent🧩 Analysis chain
Any lingering consumers of the removed
temperaturefield will now break
GPUInfoandget_gpu_stats()no longer expose atemperaturekey/attribute.
To avoid runtimeKeyError/AttributeError, verify no other module still expects it.
🏁 Script executed:
#!/bin/bash # Find any remaining usages of the old temperature field/key. rg -n --no-heading -e '\.temperature' -e "['\"]temperature['\"]" | headLength of output: 1112
Consumers of removed
temperaturefield detected — update or remove referencesThe search uncovered multiple callsites still expecting a
temperatureattribute or key. These will fail at runtime withKeyError/AttributeError. Please update or remove these references, or reintroduce a default temperature field if intended.Affected locations:
- tests/serve/test_dynamo_serve.py:69, 76 (
"temperature": 0.1in test payloads)- launch/dynamo-run/src/subprocess/sglang_inc.py:57–58 (
request["sampling_options"]["temperature"])- examples/sglang/components/worker.py:104–105 (
request.sampling_options.temperature)- launch/dynamo-run/src/input/batch.rs:227 (
.temperature(template.as_ref().map_or(0.7, |t| t.temperature)))- launch/dynamo-run/src/input/text.rs:105 (
.temperature(template.as_ref().map_or(0.7, |t| t.temperature)))- examples/vllm_v0/components/worker.py:239–240 (
request.sampling_options.temperature)Recommended fixes:
- Remove or refactor any code/tests that refer to
temperature.- Update tests to align with the new
GPUInfoschema.- If sampling temperature is still needed, add it back into
GPUInfo(with a sensible default) or handle it upstream in the sampling logic.
🧹 Nitpick comments (2)
deploy/sdk/src/dynamo/sdk/lib/resource.py (1)
166-171: Docstring could be more explicit now that temperature is goneThe docstring still ends with the vague “etc.”.
Since this method now only updates memory, utilization and process info, tighten the wording to reflect that and avoid future confusion.-"""Update GPU statistics (utilization, memory etc.).""" +"""Refresh memory, utilization, and process statistics for all GPUs."""deploy/sdk/src/dynamo/sdk/cli/allocator.py (1)
233-237: Left-over trailing space & missing comma in log stringMinor formatting cleanup:
- There’s an extra space just before the closing quote in the
Utilizationfragment.- With the temperature column removed, keep the separating comma after the utilization block for consistency with earlier parts of the message.
- f"Utilization: {stat['gpu_utilization']}% " + f"Utilization: {stat['gpu_utilization']}%,"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deploy/sdk/src/dynamo/sdk/cli/allocator.py(1 hunks)deploy/sdk/src/dynamo/sdk/lib/resource.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
50d3c02 to
933aa3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
deploy/sdk/src/dynamo/sdk/cli/utils.py (2)
77-84:__exit__swallows all exceptions raised during socket close.Returning
Truefrom__exit__suppresses any exception that occurred inside thewithblock, not only the ones fromsocket.close().
Consider limiting suppression to the close-error case:- except socket.error as e: + except socket.error as e: logger.warning(f"Error while closing socket: {str(e)}") - # Don't re-raise the exception as this is cleanup code - return True + return True # Suppress only the close errorand leave the default
Nonepath untouched so real errors propagate.🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 77-77: Either all return statements in a function should return an expression, or none of them should.
(R1710)
61-66: Minor typing / style nits
self.socket: socket.socket | Noneandself.port: int | Noneimprove readability for static checkers.- Consider setting
SO_REUSEADDRto avoid Linux “TIME_WAIT” binding hiccups in quick succession tests.No functional change required if resource lifetime is short.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
deploy/sdk/src/dynamo/sdk/cli/allocator.py(1 hunks)deploy/sdk/src/dynamo/sdk/cli/utils.py(2 hunks)deploy/sdk/src/dynamo/sdk/lib/resource.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- deploy/sdk/src/dynamo/sdk/cli/allocator.py
- deploy/sdk/src/dynamo/sdk/lib/resource.py
🧰 Additional context used
🪛 Pylint (3.3.7)
deploy/sdk/src/dynamo/sdk/cli/utils.py
[refactor] 77-77: Either all return statements in a function should return an expression, or none of them should.
(R1710)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
deploy/sdk/src/dynamo/sdk/cli/utils.py (3)
61-76:PortReserverlooks solid but consider enablingSO_REUSEADDRto avoid TIME_WAIT clashesThe new class nicely addresses the API-break/high-FD-leak concerns raised earlier ✅.
For long-running CLIs that reserve/release the same port repeatedly, lingering sockets inTIME_WAITcan still causeEADDRINUSE. A one-liner beforebind()avoids that:self.socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)No behaviour change for the happy path; only improves robustness on rapid restarts.
82-88: Inconsistent return paths – drop thereturn Truefromclose_socket()
close_socket()returnsNonein the normal path butTruein the error path, tripping pylint R1710 and conveying no useful information (the caller ignores the value).
Simplest fix:- except socket.error as e: - logger.warning(f"Error while closing socket: {str(e)}") - # Don't re-raise the exception as this is cleanup code - return True + except socket.error as e: + logger.warning(f"Error while closing socket: {str(e)}") + # Swallow cleanup errorsNow every exit path returns
None, matching expectations.
91-100: Return-type annotation ofreserve_free_portis misleadingWith
@contextlib.contextmanagerapplied, the function actually returns a generator-based context manager, not anAbstractContextManager[int].
Prefer one of:-) -> contextlib.AbstractContextManager[int]: +) -> Iterator[int]: # simplest # or +) -> Generator[int, None, None]: # most explicitThis keeps static checkers (mypy/pyright) quiet and clarifies what callers receive.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
deploy/sdk/src/dynamo/sdk/cli/utils.py(2 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
deploy/sdk/src/dynamo/sdk/cli/utils.py
[refactor] 81-81: Either all return statements in a function should return an expression, or none of them should.
(R1710)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
deploy/sdk/src/dynamo/sdk/cli/utils.py (1)
67-77: Optional: setSO_REUSEADDRto minimise TIME_WAIT collisionsBinding to an ephemeral port is usually fine, but on some platforms the port can remain in
TIME_WAITafterclose(), causing sporadic “address already in use” when the caller immediately re-binds. Adding:self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)just before
bind()makes reuse deterministic while still reserving the port.
Low-risk, purely optional.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deploy/sdk/src/dynamo/sdk/cli/circus.py(1 hunks)deploy/sdk/src/dynamo/sdk/cli/utils.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- deploy/sdk/src/dynamo/sdk/cli/circus.py
🧰 Additional context used
🪛 Pylint (3.3.7)
deploy/sdk/src/dynamo/sdk/cli/utils.py
[refactor] 81-81: Either all return statements in a function should return an expression, or none of them should.
(R1710)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (1)
deploy/sdk/src/dynamo/sdk/cli/utils.py (1)
91-100: ```shell
#!/bin/bashList all invocations of reserve_free_port outside its definition file
rg -n 'reserve_free_port(' | grep -v 'deploy/sdk/src/dynamo/sdk/cli/utils.py'
</details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Overview:
Fixes
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit