Skip to content

Conversation

@nv-anants
Copy link
Contributor

@nv-anants nv-anants commented Sep 10, 2025

Overview:

avoid connection reset and other network failures in curl download - https://github.com/ai-dynamo/dynamo/actions/runs/17615998578/job/50049141692?pr=2967#step:7:17474

#31 [runtime 13/25] RUN ARCH=$(dpkg --print-architecture) &&     case "$ARCH" in         amd64) PLATFORM=linux-amd64 ;;         arm64) PLATFORM=linux-arm64 ;;         *) echo "Unsupported architecture: $ARCH" && exit 1 ;;     esac &&     curl -fsSL "[https://github.com/prometheus/prometheus/releases/download/v3.4.1/prometheus-3.4.1.${PLATFORM}.tar.gz](https://github.com/prometheus/prometheus/releases/download/v3.4.1/prometheus-3.4.1.$%7BPLATFORM%7D.tar.gz)"     | tar -xz -C /tmp &&     mv "/tmp/prometheus-3.4.1.${PLATFORM}/prometheus" /usr/local/bin/ &&     chmod +x /usr/local/bin/prometheus &&     rm -rf "/tmp/prometheus-3.4.1.${PLATFORM}"
#31 1.996 curl: (55) Send failure: Connection reset by peer
#31 2.006 
#31 2.006 gzip: stdin: unexpected end of file
#31 2.006 tar: Unexpected EOF in archive
#31 2.006 tar: Unexpected EOF in archive
#31 2.006 tar: Error is not recoverable: exiting now

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of container image builds by adding automatic retry logic to the Prometheus download step, reducing failures from transient network issues.
    • Applies to multiple runtime variants, ensuring more consistent and successful builds without impacting runtime behavior or performance.

Signed-off-by: Anant Sharma <anants@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 10, 2025

Walkthrough

Added curl retry flags to Prometheus download steps in two Dockerfiles; other extraction and installation steps unchanged.

Changes

Cohort / File(s) Summary of Changes
Docker Prometheus download retry
container/Dockerfile.sglang, container/Dockerfile.vllm
Updated curl commands for Prometheus tarball to include --retry 5 --retry-delay 5 (and -fsSL in sglang). No other steps modified.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

I tap my paws and try, retry—
Five hops, five beats beneath the sky.
If clouds delay the tarball’s flight,
I wait, then nibble on the byte.
With steady whiskers, builds comply—
A bun that won’t give up the try. 🐇🛠️

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The current pull request description only provides a brief overview and omits the required template sections for Details, Where should the reviewer start, and Related Issues, making it difficult to understand the specific changes and their context. It does not follow the repository’s pull request description template, which expects a clear summary of changes, guidance on where to review, and links to related issues. Because these sections are missing, the description is incomplete and does not meet the project’s standards. Please update the PR description to include the missing sections from the template: add a Details section that summarizes the retry logic changes in both Dockerfiles, a Where should the reviewer start section pointing to container/Dockerfile.sglang and container/Dockerfile.vllm, and a Related Issues section using an action keyword (e.g., “Closes #xxx”) to reference the relevant issue. Completing these sections will ensure the description matches the required format and provides reviewers with the necessary context.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title “fix: add retry for curl downloads” succinctly captures the primary change of introducing retry logic to curl commands in the Dockerfiles, aligns with the conventional commit prefix “fix,” and remains concise without unnecessary detail, making it clear for any teammate scanning the history.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
container/Dockerfile.vllm (2)

230-234: Harden download: timeouts, broader retry, no pipe-to-tar, optional checksum.

  • Add timeouts to avoid long hangs.
  • Broaden retry to cover conn resets; use --retry-all-errors when available, else --retry-connrefused.
  • Avoid curl | tar to remove pipefail edge cases; download, verify, then extract.
  • Optionally verify SHA256 from upstream to reduce supply-chain risk.

Apply within this hunk:

-    curl -fsSL --retry 5 --retry-delay 5 "https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz" \
-    | tar -xz -C /tmp && \
+    curl -fsSL \
+      --retry ${CURL_RETRY:-5} \
+      --retry-delay ${CURL_RETRY_DELAY:-5} \
+      --retry-connrefused \
+      --connect-timeout ${CURL_CONNECT_TIMEOUT:-30} \
+      --max-time ${CURL_MAX_TIME:-600} \
+      -o "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz" \
+      "https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz" && \
+    tar -xzf "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz" -C /tmp && \
     mv "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}/prometheus" /usr/local/bin/ && \
     chmod +x /usr/local/bin/prometheus && \
-    rm -rf "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}"
+    rm -rf "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}" "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz"

Add build args once (outside this hunk) near the PROM_VERSION arg:

ARG CURL_RETRY=5
ARG CURL_RETRY_DELAY=5
ARG CURL_CONNECT_TIMEOUT=30
ARG CURL_MAX_TIME=600

Optional checksum verification (insert between curl and tar):

curl -fsSL "https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/sha256sums.txt" -o /tmp/prom.sha256 && \
grep "prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz" /tmp/prom.sha256 | sha256sum -c - && \

230-230: Consider DRYing curl retry args across Dockerfiles.

Define the CURL_* ARGs once and reuse here and in Dockerfile.sglang to keep behavior consistent.

Happy to push a follow-up commit wiring these ARGs into both Dockerfiles.

container/Dockerfile.sglang (1)

167-171: Mirror hardening from vLLM Dockerfile: timeouts, broader retry, no pipe-to-tar, optional checksum.

Keep both images aligned for reliability and security.

Apply within this hunk:

-    curl -fsSL --retry 5 --retry-delay 5 "https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz" \
-    | tar -xz -C /tmp && \
+    curl -fsSL \
+      --retry ${CURL_RETRY:-5} \
+      --retry-delay ${CURL_RETRY_DELAY:-5} \
+      --retry-connrefused \
+      --connect-timeout ${CURL_CONNECT_TIMEOUT:-30} \
+      --max-time ${CURL_MAX_TIME:-600} \
+      -o "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz" \
+      "https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz" && \
+    tar -xzf "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz" -C /tmp && \
     mv "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}/prometheus" /usr/local/bin/ && \
     chmod +x /usr/local/bin/prometheus && \
-    rm -rf "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}"
+    rm -rf "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}" "/tmp/prometheus-${PROM_VERSION}.${PLATFORM}.tar.gz"

Add (outside this hunk), near PROM_VERSION:

ARG CURL_RETRY=5
ARG CURL_RETRY_DELAY=5
ARG CURL_CONNECT_TIMEOUT=30
ARG CURL_MAX_TIME=600

Optional checksum verification (same as vLLM comment).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 905c920 and 34a91cc.

📒 Files selected for processing (2)
  • container/Dockerfile.sglang (1 hunks)
  • container/Dockerfile.vllm (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
container/Dockerfile.vllm (2)

230-230: LGTM: Added curl retries to Prometheus download.

Good step toward reducing flaky builds.


230-230: Verify and update retry flag based on curl version
Please confirm that the runtime image’s curl version (in container/Dockerfile.vllm:230) is ≥ 7.71; if so, replace --retry-connrefused with --retry-all-errors.

container/Dockerfile.sglang (2)

167-167: LGTM: Retry flags added to Prometheus curl.

Matches vllm image; should reduce transient failures.


167-167: Add --retry-all-errors when supported.

Same note as vLLM image; fall back to --retry-connrefused if older curl.

Use the same script adjusted for container/Dockerfile.sglang.

@nv-anants nv-anants merged commit 6a089b1 into main Sep 10, 2025
13 of 15 checks passed
@nv-anants nv-anants deleted the anants/curl-retry branch September 10, 2025 15:27
ayushag-nv pushed a commit that referenced this pull request Sep 15, 2025
Signed-off-by: Anant Sharma <anants@nvidia.com>
Signed-off-by: ayushag <ayushag@nvidia.com>
zhongdaor-nv pushed a commit that referenced this pull request Sep 15, 2025
Signed-off-by: Anant Sharma <anants@nvidia.com>
Signed-off-by: zhongdaor <zhongdaor@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants