Skip to content

Downloading images concurrently breaks in many ways #2722

@nirs

Description

@nirs

Description

We use limactl to create multiple cluster in parallel, all using the same k8s based template. If a developer already used lima the ubuntu server cloud image is already downloaded, but if this is the first time you download the image, or maybe you delete the ~/Library/Cache/lima, the image will be downloaded in parallel.

Since all limactl processes are downloading the same to the same temporary directory
~/Library/Caches/lima/download/by-url-sha256/xxxyyy/data.tmp and then try to rename the same temporary file to the same target file, the process can fail in various ways:

  • Directory not empty since other process already started the download
    2024-10-10 22:56:59,617 ERROR   [hub] failed to download "https://cloud-images.ubuntu.com/releases/24.04/release/ubuntu-24.04-server-cloudimg-arm64.img": unlinkat /Users/nir/Library/Caches/lima/download/by-url-sha256/002fbe468673695a2206b26723b1a077a71629001a5b94efd8ea1580e1c3dd06: directory not empty
    
  • Temporary file renamed by another process
    2024-10-10 22:58:28,695 ERROR   [dr1] failed to download "https://cloud-images.ubuntu.com/releases/24.04/release/ubuntu-24.04-server-cloudimg-arm64.img": rename /Users/nir/Library/Caches/lima/download/by-url-sha256/002fbe468673695a2206b26723b1a077a71629001a5b94efd8ea1580e1c3dd06/data.tmp /Users/nir/Library/Caches/lima/download/by-url-sha256/002fbe468673695a2206b26723b1a077a71629001a5b94efd8ea1580e1c3dd06/data: no such file or directory
    
  • Corrupted download (not sure it possible with current code)

An easy way to avoid the conflicts is to download to per-process temporary file:

data.tmp.{pid}

When the download is finished, renaming to data is safe even with multiple processes since posix rename is atomic. If we have N processes renaming N identical downloads:

data.tmp.123 -> data
data.tmp.345 -> data
data.tmp.678 -> data

All renames will succeed in unknown order, but since content is the same it does not matter.

A better but more complex way is to use a lockfile so the first process does the download, and the other processes wait on the lockfile. When the first process finish it unlock the lockfile and the next processes can grab it and find the downloaded file and continue.

Since this may happen at most once for every image, I would go with the simpler solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions