fix: increase shm default size and make it configurable #2616

julienmancuso · 2025-08-21T20:28:49Z

Overview:

increase shm default size and make it configurable

Summary by CodeRabbit

New Features
- Added a sharedMemory setting to control /dev/shm (enable/disable and size) for deployments.
- Available at the component level (DynamoComponentDeployment) and as a per-service override (DynamoGraphDeployment).
- Supports Kubernetes quantity (e.g., “8Gi”) for sizing; can be disabled entirely.
- Default /dev/shm size is now 8Gi when not specified.

coderabbitai · 2025-08-21T20:37:06Z

Walkthrough

Adds a SharedMemory configuration to CRDs and API types, switches shared memory sizing in graph logic to spec-driven (default 8Gi, optional override, disable supported), and updates tests to reflect 8Gi shared memory. CRD Helm/operator YAMLs and deepcopy methods are updated accordingly.

Changes

Cohort / File(s)	Summary
CRD Helm templates `deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml`	Add spec.sharedMemory object (disabled: bool, size: int-or-string with quantity pattern) to DynamoComponentDeployment and per-service in DynamoGraphDeployment.
Operator API types and deepcopy `deploy/cloud/operator/api/v1alpha1/common.go`, `deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go`, `deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go`	Introduce SharedMemorySpec {Disabled, Size}; add optional SharedMemory to DynamoComponentDeploymentSharedSpec; generate deepcopy methods and wiring.
Operator CRD bases `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml`	Mirror sharedMemory schema into installed CRDs (component-level and per-service).
Dynamo graph logic `deploy/cloud/operator/internal/dynamo/graph.go`	Change generateSharedMemoryVolumeAndMount to accept SharedMemorySpec and return optional pointers; default size 8Gi; honor spec.Size; disable when spec.Disabled; update GenerateBasePodSpec call site.
Tests `deploy/cloud/operator/internal/dynamo/graph_test.go`, `deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go`	Increase expected EmptyDir SizeLimit to 8Gi; adjust tests to new shared memory defaults/behavior.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Reconciler
  participant Graph as Graph Generator
  participant SM as generateSharedMemoryVolumeAndMount
  participant Pod as PodSpec Builder

  User->>Reconciler: Apply CRD with spec.sharedMemory
  Reconciler->>Graph: GenerateBasePodSpec(component, sharedMemory)
  Graph->>SM: Build shared memory (SharedMemorySpec)
  alt sharedMemory.disabled == true
    SM-->>Graph: return nil, nil
  else size provided or default (8Gi)
    SM-->>Graph: return *Volume, *VolumeMount
  end
  alt Volume/Mount present
    Graph->>Pod: Append volume and mount
  else
    Graph-->>Pod: Skip shared memory
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat: add grove multinode support #2269 — Adjusts the same shared-memory generation in graph.go to use SharedMemorySpec; closely aligned with this spec-driven change.

Poem

In burrows of YAML, I thump with delight,
A tmpfs appears, sized just right.
Eight carrots—I mean Gi—by default we see,
Disable or tweak? Now easy as can be.
With gentle hops through deepcopy land,
My pod’s /dev/shm is perfectly planned. 🥕🐇

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (12)

deploy/cloud/operator/api/v1alpha1/common.go (1)
47-51: Add validation hints and docs to SharedMemorySpec (clarify default/override semantics).

The type looks good. Two small improvements:

Document the default (8Gi) and precedence: when Disabled is true, Size is ignored.

Consider adding a validation webhook in this package (or controller-level validation) to reject non-positive sizes when Disabled is false. CRD OpenAPI for resource.Quantity will allow “0” unless you gate it.

Minimal inline doc and future-proofing diff:
 type SharedMemorySpec struct {
-	Disabled bool              `json:"disabled,omitempty"`
-	Size     resource.Quantity `json:"size,omitempty"`
+	// Disabled disables mounting tmpfs at /dev/shm entirely. When true, Size is ignored.
+	Disabled bool `json:"disabled,omitempty"`
+	// Size requests the tmpfs size for /dev/shm. If omitted or zero, the operator
+	// defaults to 8Gi. Has effect only when Disabled is false.
+	Size resource.Quantity `json:"size,omitempty"`
 }
If you’d like stronger guarantees, I can sketch a lightweight admission validation to enforce “size > 0 when not disabled.”
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1)
796-797: Prefer a single 8Gi quantity helper to avoid test brittleness and magic numbers.

You repeat the literal resource.NewQuantity(8*1024*1024*1024, resource.BinarySI) across expected specs. Define it once per test and reuse to keep expectations stable and readable.

Example (outside the selected lines, near the top of the test function):
eightGi := resource.NewQuantity(8<<30, resource.BinarySI)
Then update expectations:
SizeLimit: eightGi,
Also, to reduce drift with production code, consider using the canonical volume name constant instead of the string:
Name: commonconsts.KubeValueNameSharedMemory,
I can provide a full patch touching all occurrences if you want.

Also applies to: 895-897
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
95-97: Good addition; add a usage note and keep API docs consistent.

Field wiring looks right. Add a brief note about the default and disabling behavior so it shows up in CRD docs, matching what tests assert (default 8Gi, omit volume when disabled).
-	// SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size).
+	// SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size).
+	// If omitted, the operator mounts /dev/shm with a default size of 8Gi.
+	// If disabled, no /dev/shm volume/mount is added.
 	SharedMemory *SharedMemorySpec `json:"sharedMemory,omitempty"`
Optional: if you ever decide to surface defaults via CRD, a mutating webhook is required because the whole object is optional; setting a default just on .size won’t trigger when sharedMemory itself is nil.
deploy/cloud/operator/internal/dynamo/graph_test.go (2)
1238-1238: Deduplicate 8Gi quantity and assert-by-intent (helper + constant names).

Great alignment to the new default. Two refinements to keep tests maintainable:

Introduce a shared helper/var for 8Gi to avoid repeated magic numbers and reduce accidental drift if the default changes again.

Use commonconsts.KubeValueNameSharedMemory for volume names wherever you currently use "shared-memory" in expectations, just like you already do for mounts.

Sketch (outside the selected lines, e.g., at file top or per test):
var eightGi = resource.NewQuantity(8<<30, resource.BinarySI) // 8 GiB

// ... in expectations:
SizeLimit: eightGi,
Name:      commonconsts.KubeValueNameSharedMemory,
Coverage suggestion:

Add two focused unit tests for generateSharedMemoryVolumeAndMount (or the nearest seam) to verify:

spec.SharedMemory.Disabled=true → no volume/mount produced.

spec.SharedMemory.Size=2Gi → SizeLimit=2Gi.
This complements the default-path assertions present here. I can draft these tests if helpful.

Also applies to: 1381-1381, 1736-1736, 1886-1886, 1992-1992, 2037-2037, 2513-2513, 2652-2652, 2759-2759, 2903-2903

1238-1238: Operational note: 8Gi tmpfs vs container memory limits (document/guard rails).

An EmptyDir with Medium: Memory counts toward the container’s memory cgroup. Bumping the default to 8Gi is fine, but pods with tighter memory limits may OOM well before hitting this SizeLimit. Consider:

Documenting this behavior in user-facing docs/CRD descriptions.

Optionally validating at reconcile time that requested /dev/shm size does not exceed the container memory limit (warn or clamp), when limits are specified.

If you want, I can propose a non-fatal Event warning emitted by the reconciler when Size > container memory limit.

Also applies to: 1381-1381, 1736-1736, 1886-1886, 1992-1992, 2037-2037, 2513-2513, 2652-2652, 2759-2759, 2903-2903
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)
10244-10255: Add CEL validation to prevent conflicting config (size set when disabled=true)

Right now, users can set both disabled: true and size: ...; the behavior would be ambiguous. Add a CEL rule to enforce that size must be omitted when disabled is true.

Apply this diff in-place to the sharedMemory schema:
   sharedMemory:
     description: SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size).
     properties:
       disabled:
         type: boolean
       size:
         anyOf:
           - type: integer
           - type: string
         pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
         x-kubernetes-int-or-string: true
+    x-kubernetes-validations:
+      - rule: '!(has(self.disabled) && self.disabled) || !has(self.size)'
+        message: 'When disabled is true, size must be omitted.'
     type: object
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (2)
10344-10354: Document defaults and surface them in the schema (8Gi, disabled=false).

Given the code defaults to 8Gi when unspecified and treats the feature as enabled unless explicitly disabled, encode and document those defaults in the CRD to prevent drift and improve UX.

Apply this diff snippet within the sharedMemory object:
 sharedMemory:
-  description: SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size).
+  description: SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size). If omitted, shared memory is enabled with a default size of 8Gi.
   properties:
     disabled:
+      description: When true, disables mounting tmpfs at /dev/shm for this service.
+      default: false
       type: boolean
     size:
-      anyOf:
-        - type: integer
-        - type: string
-      pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
+      description: Size of the /dev/shm tmpfs. Examples: "8Gi", "512Mi", 1073741824.
+      anyOf:
+        - type: integer
+          minimum: 0
+        - type: string
+          # disallow negative quantities; allow decimal or binary SI suffixes
+          pattern: ^(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
       x-kubernetes-int-or-string: true
+      default: "8Gi"
   type: object
10344-10354: Disallow unknown fields under sharedMemory.

To catch typos like “sizes” or “disable”, set additionalProperties: false on the sharedMemory object.

Apply:
 sharedMemory:
   description: SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size).
-  properties:
+  additionalProperties: false
+  properties:
     disabled:
       type: boolean
     size:
       anyOf:
         - type: integer
         - type: string
       pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
       x-kubernetes-int-or-string: true
   type: object
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
10244-10255: Enhance sharedMemory schema: add defaults and validation

Verified that the Go API (api/v1alpha1/common.go) and runtime graph logic (internal/dynamo/graph.go at lines 1184–1188) implement defaults (enabled by default, size = 8Gi), but the CRD and Helm templates do not surface these defaults or validation rules. Adding them will improve UX in UIs and kubectl explain.

• In deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (lines 10244–10255):
– Add default: false for disabled.
– Add default: "8Gi" and a descriptive description for size.
– Add an x-kubernetes-validations rule to forbid setting size when disabled=true.
• Mirror the same schema updates in the Helm template at deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (lines 10244–10251).
• (Optional) For consistency with other fields (e.g. ingress.enabled), consider renaming disabled → enabled (breaking change across API/types/graph logic) or aliasing/deprecating disabled.

Apply this diff to the CRD:
                sharedMemory:
                  description: SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size).
                  properties:
                    disabled:
                      type: boolean
+                     default: false
                    size:
                      anyOf:
                        - type: integer
                        - type: string
                      pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                      x-kubernetes-int-or-string: true
+                     description: Size of /dev/shm as a Kubernetes quantity. Examples: "8Gi", "1Gi", "512Mi".
+                     default: "8Gi"
                  type: object
+                x-kubernetes-validations:
+                  - rule: "!(has(self.disabled) && self.disabled && has(self.size))"
+                    message: "sharedMemory.size must be omitted when sharedMemory.disabled=true"
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (2)
10343-10355: Nice addition; align schema docs with runtime defaults and override precedence

The new sharedMemory override is well placed under per-service overrides. To reduce ambiguity for users, please document:

Default behavior (operator defaults to 8Gi when size is omitted).

Precedence vs component-level sharedMemory (per-service should override component-level).

You can update descriptions in-place without changing behavior.
                       sharedMemory:
-                        description: SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size).
+                        description: SharedMemory controls the tmpfs mounted at /dev/shm for this service. When omitted, the operator mounts /dev/shm with a default size (currently 8Gi). Per-service settings here override the component-level sharedMemory.
                         properties:
                           disabled:
+                            description: If true, do not mount /dev/shm for this service. Per-service value overrides component-level sharedMemory. Defaults to false.
                             type: boolean
                           size:
+                            description: Optional size limit for /dev/shm as a Kubernetes quantity (e.g., 8Gi). If omitted, the operator uses its default (currently 8Gi). Ignored when disabled is true.
                             anyOf:
                               - type: integer
                               - type: string
-                            pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
+                            pattern: ^(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
                             x-kubernetes-int-or-string: true
                         type: object
10343-10355: Operational note: 8Gi default may interact with node memory limits

An 8Gi sizeLimit doesn’t pre-allocate memory, but it caps tmpfs growth. On smaller nodes or tight quotas, processes writing to /dev/shm can still OOM if container/pod memory requests/limits are lower than the cap. Consider documenting this in user-facing docs and recommending alignment between sharedMemory.size and container memory limits.
deploy/cloud/operator/internal/dynamo/graph.go (1)
1183-1208: Centralize SharedMemory default and ensure consistency

Verified that neither the CRD definitions nor the Helm templates specify an alternate “8Gi” default, so the code’s literal is the sole source of truth. To reduce future drift and keep tests, CRDs, and templates in sync, I recommend extracting the “8Gi” default into a single constant.

• In deploy/cloud/operator/internal/dynamo/graph.go, replace the hard-coded literal:
-  // default: enabled=true, size=8Gi
-  size := resource.MustParse("8Gi")
+  // default: enabled=true, size=8Gi (centralized)
+  size := resource.MustParse(commonconsts.DefaultSharedMemorySize)
• Add to deploy/cloud/operator/internal/consts/consts.go:
const DefaultSharedMemorySize = "8Gi"
Optionally, if you’d like to guard against requesting a tmpfs larger than the container’s memory limit (which can lead to confusing OOMs), we can refactor this helper to accept the computed memory limit and clamp or log when spec.Size exceeds it. Let me know if you’d like a draft for that.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between cc48dd3 and 74aaf23.

📒 Files selected for processing (10)

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1 hunks)
deploy/cloud/operator/api/v1alpha1/common.go (1 hunks)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1 hunks)
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (2 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1 hunks)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (2 hunks)
deploy/cloud/operator/internal/dynamo/graph.go (2 hunks)
deploy/cloud/operator/internal/dynamo/graph_test.go (10 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.

🧬 Code graph analysis (1)

deploy/cloud/operator/internal/dynamo/graph.go (1)

deploy/cloud/operator/internal/consts/consts.go (1)

KubeValueNameSharedMemory (53-53)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Mirror Repository to GitLab
GitHub Check: Build and Test - dynamo

🔇 Additional comments (10)

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (2)
10244-10255: LGTM: new spec.sharedMemory field is well-shaped for K8s quantity and optional use

Good use of x-kubernetes-int-or-string with the canonical quantity regex; placing this under spec keeps the API surface consistent with adjacent resource knobs. No breaking-change risk since the field is optional.

10244-10255: Add descriptive metadata and document default 8 Gi behavior on sharedMemory

Confirmed that the operator’s graph builder sets the EmptyDir size limit to 8 GiB by default (resource.NewQuantity(810241024*1024, BinarySI)) in internal/dynamo/graph.go and is exercised in multiple controller and graph tests

CRD’s sharedMemory properties lack leaf-field descriptions and do not expose that omission of size (when disabled is false) yields an 8 Gi default

To ensure users discover this, enrich both fields with descriptions and note the default; if helpful, add an example in the schema

Proposed tweak (in deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml):
   sharedMemory:
     description: SharedMemory controls the tmpfs mounted at /dev/shm (enable/disable and size).
     properties:
       disabled:
+        description: Set to true to prevent mounting a /dev/shm tmpfs for this component.
         type: boolean
       size:
+        description: >
+          Maximum size of /dev/shm as a Kubernetes resource.Quantity (e.g., "8Gi").  
+          If unset (and disabled is not true), the operator defaults to 8 Gi.
         anyOf:
           - type: integer
           - type: string
         pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
         x-kubernetes-int-or-string: true
     type: object
Please verify in internal/dynamo/graph.go that:

omitting size indeed yields an 8 Gi EmptyDir

setting disabled: true returns no volume/mount so /dev/shm is not created
deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go (1)

246-250: Correct deep-copy handling for new SharedMemory pointer field.

Nil-safe allocation and invoking DeepCopyInto on SharedMemorySpec is consistent with the surrounding pattern and prevents aliasing. Looks good.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (3)

10343-10355: Good addition: per-service sharedMemory override looks well-scoped.

The field name, placement (under service overrides), and high-level description are sensible and align with the PR goal.

10343-10355: Placement Confirmed: sharedMemory Is Nested Under services.additionalProperties.properties

I’ve verified the indentation and hierarchy in the CRD schema:

At line 183, services is defined under spec.properties

Under services.additionalProperties.properties (indent level 20–22 spaces), sharedMemory appears alongside other per-service fields (annotations, autoscaling, serviceName, etc.)

After listing its properties and type: object, the block correctly closes back to the additionalProperties and then services scopes

No relocation is needed—sharedMemory is already correctly placed under spec.properties.services.additionalProperties.properties.

10343-10355: sharedMemory schema is consistent between Operator and Helm CRDs

I’ve compared the sharedMemory block across both the operator-generated CRDs and the Helm template for both Graph and Component deployments and found them identical—no schema drift detected. Given that these files are autogenerated from the same upstream definitions, no manual alignment is required at this time.

• deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml vs. deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
• deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml vs. deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml

No further changes needed.
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (2)
10349-10354: Good use of int-or-string for Quantity compatibility

Using anyOf with x-kubernetes-int-or-string is the right approach for resource.Quantity-style fields and maintains kubebuilder compatibility.

10343-10355: Action Required: Verify sharedMemory graph wiring & default behavior
CRD schema for sharedMemory is identical in both Helm templates (deploy/cloud/helm/crds/templates/nvidia.com_*deployments.yaml) and operator bases (deploy/cloud/operator/config/crd/bases/nvidia.com_*deployments.yaml)—the disabled/size properties, anyOf constraints, pattern, and x-kubernetes-int-or-string settings all match.
The Go API type in deploy/cloud/operator/api/v1alpha1/common.go defines
type SharedMemorySpec struct {
  Disabled bool              `json:"disabled,omitempty"`
  Size     resource.Quantity `json:"size,omitempty"`
}
and autogenerated DeepCopyInto/DeepCopy methods exist in zz_generated.deepcopy.go.
We were unable to locate the reconciliation or graph-layer code that:

Applies per-service override precedence over component-level sharedMemory

Injects a default 8Gi Size when none is specified
Please review the render/reconcile packages (e.g. under deploy/cloud/operator/internal/...) to confirm these behaviors are implemented.

No occurrences of the literal 8Gi were found in Go code or YAML tests—ensure tests are updated to assert an 8Gi default for sharedMemory.Size.
deploy/cloud/operator/internal/dynamo/graph.go (2)

1183-1208: Spec-driven /dev/shm generation looks correct

Default enablement, 8Gi default, optional override, and disable semantics are implemented cleanly. Pointer returns make the call site ergonomic for the “optional” behavior. Mounting to /dev/shm with EmptyDir{Medium: Memory} is the right approach for Linux workloads.

781-784: Original duplicate-volume guard suggestion is not sufficient
The proposed hasVolume(volumes, …) check only inspects the local volumes slice—not the merged podSpec.Volumes—so it cannot prevent name collisions when users supply a “shared-memory” volume via ExtraPodSpec.PodSpec. In this code, volumes is appended to podSpec.Volumes after merging any user-provided volumes, so guarding only the local slice leaves the real duplication unaddressed.

• The guard as written never sees user-supplied volumes (they live in podSpec.Volumes, not volumes)
• To truly avoid collisions, dedupe after merging ExtraPodSpec.PodSpec by checking podSpec.Volumes for the “shared-memory” name
• Similarly, mounts should be deduped against the final container’s VolumeMounts, not only the local container slice

Because the original patch would not prevent the API-server–rejected duplicates in practice, please disregard it.

Likely an incorrect or invalid review comment.

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml

deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml

biswapanda

Thanks! Lgtm

Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

Signed-off-by: Jason Zhou <jasonzho@jasonzho-mlt.client.nvidia.com>

Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>

Signed-off-by: nnshah1 <neelays@nvidia.com>

fix: increase shm default size and make it configurable

74aaf23

julienmancuso requested review from biswapanda, hhzhang16, hutm, ishandhanani, mohammedabdulwahhab and nnshah1 as code owners August 21, 2025 20:28

pull-request-size bot added the size/L label Aug 21, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 21, 2025 20:28 Inactive

github-actions bot added the fix label Aug 21, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 21, 2025 20:32 Inactive

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml Show resolved Hide resolved

deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go Show resolved Hide resolved

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml Show resolved Hide resolved

fix: increase shm default size and make it configurable

36c40ba

copy-pr-bot bot temporarily deployed to GITLAB August 21, 2025 20:39 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 21, 2025 20:43 Inactive

fix: increase shm default size and make it configurable

d7d49a9

copy-pr-bot bot temporarily deployed to GITLAB August 21, 2025 21:37 Inactive

atchernych approved these changes Aug 22, 2025

View reviewed changes

ishandhanani approved these changes Aug 22, 2025

View reviewed changes

biswapanda approved these changes Aug 23, 2025

View reviewed changes

julienmancuso merged commit 80279ad into main Aug 24, 2025
10 of 12 checks passed

julienmancuso deleted the jsm/dyn-922 branch August 24, 2025 13:42

julienmancuso added a commit that referenced this pull request Aug 25, 2025

fix: increase shm default size and make it configurable (#2616)

7056fb4

julienmancuso mentioned this pull request Aug 25, 2025

fix: increase shm default size and make it configurable (#2616) #2687

Merged

dmitry-tokarev-nv added a commit that referenced this pull request Aug 25, 2025

fix: increase shm default size and make it configurable (#2616) (#2687)

da24e19

Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>

This was referenced Aug 26, 2025

fix: cp metrics docs fix #2720

Merged

fix: fix hello world #2727 #2731

Closed

fix: hello world DGD #2732

Merged

hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025

fix: increase shm default size and make it configurable (#2616)

92d76f9

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>

nv-anants pushed a commit that referenced this pull request Aug 28, 2025

fix: increase shm default size and make it configurable (#2616)

a636385

jasonqinzhou pushed a commit that referenced this pull request Aug 30, 2025

fix: increase shm default size and make it configurable (#2616)

dbaaf3b

Signed-off-by: Jason Zhou <jasonzho@jasonzho-mlt.client.nvidia.com>

KrishnanPrash pushed a commit that referenced this pull request Sep 2, 2025

fix: increase shm default size and make it configurable (#2616)

5033fb0

Signed-off-by: Krishnan Prashanth <kprashanth@nvidia.com>

nnshah1 pushed a commit that referenced this pull request Sep 8, 2025

fix: increase shm default size and make it configurable (#2616)

6a39968

Signed-off-by: nnshah1 <neelays@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: increase shm default size and make it configurable #2616

fix: increase shm default size and make it configurable #2616

Uh oh!

julienmancuso commented Aug 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 21, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

biswapanda left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix: increase shm default size and make it configurable #2616

fix: increase shm default size and make it configurable #2616

Uh oh!

Conversation

julienmancuso commented Aug 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 21, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

biswapanda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

julienmancuso commented Aug 21, 2025 •

edited by coderabbitai bot

Loading