Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish the initrd and rootfs images separately for PXE booting #390

Closed
SerialVelocity opened this issue Feb 14, 2020 · 36 comments
Closed

Comments

@SerialVelocity
Copy link

SerialVelocity commented Feb 14, 2020

When PXE booting hosts, it's useful to be able to push the initrd over the network without the rootfs image included. We do this for two reasons:

  • Downloading large images using PXE can be slow. It takes an extra 5 minutes to download compared to loading the PXE from the initrd using curl.
  • You can write the rootfs to temporary storage to avoid the 550MB RAM penalty.

Is it possible to publish these images to https://builds.coreos.fedoraproject.org? The initrd used to be published as the installer-initramfs and the rootfs had to be extracted from the live-initramfs

@SerialVelocity SerialVelocity changed the title Publish the initrd and squashfs images separately for PXE booting Publish the initrd and rootfs images separately for PXE booting Feb 14, 2020
@dustymabe dustymabe added the meeting topics for meetings label Feb 14, 2020
@lucab
Copy link
Contributor

lucab commented Feb 18, 2020

Thanks for the report. I'm having a bit of a hard time grasping your setup, so I'd be happy if you could expand a couple of details:

  • 5 minutes to TFTP the live initramfs (~600 MiB) from your PXE server over a local link sounds quite bad. Can you please share your current PXE config? Is there maybe something wrong at the network level which is badly throttling local TFTP?
  • when you are PXE-booting from scratch a new machine, how does the "write the rootfs to temporary storage" flow work?
  • assuming we publish an initrd without the rootfs, how do you stitch together the whole live system? That is, how does the curl step fit in the flow and how is that plus rootfs pivoting automated?

@jamescassell
Copy link
Collaborator

5 minutes to TFTP the live initramfs (~600 MiB) from your PXE server over a local link sounds quite bad. Can you please share your current PXE config? Is there maybe something wrong at the network level which is badly throttling local TFTP?

I can confirm very slow TFTP transfers, at least with both PXE and destination servers on the same ESXi environment.

@cgwalters
Copy link
Member

cgwalters commented Feb 19, 2020

This may be one reason Anaconda uses a separate "stage2" that's the rootfs rather than embedding it in the initramfs.

This also intersects heavily with #352 in that the fulliso build that exists but is not shipped by FCOS added code to find the rootfs which we could extend to fetching from the kernel commandline over HTTP from the initramfs, the same way as Anaconda's inst.stage2 argument.

@jlebon
Copy link
Member

jlebon commented Feb 19, 2020

One strawman here:

  • For the live ISO: we keep baking in the rootfs
  • For the live PXE: we publish the rootfs and initrd as separate imgs. For users that want pure PXE/TFTP, they can just provide both images via initrd=file1,file2. Other users can specify a karg to the HTTP rootfs img to overlay.

@cgwalters
Copy link
Member

For the live PXE: we publish the rootfs and initrd as separate imgs. For users that want pure PXE/TFTP, they can just provide both images via initrd=file1,file2. Other users can specify a karg to the HTTP rootfs img to overlay.

Hum...you're proposing that the rootfs is a cpio that has a single file squashfs file? And in the HTTP case we just unwrap the cpio? That's a really clever hack if it works!

@jlebon
Copy link
Member

jlebon commented Feb 19, 2020

Hum...you're proposing that the rootfs is a cpio that has a single file squashfs file? And in the HTTP case we just unwrap the cpio? That's a really clever hack if it works!

Yup, exactly! Implementation-wise, cosa would just spit out the two CPIOs separately anyway and in the live ISO case we'd just append it in isolinux.cfg.

@dustymabe
Copy link
Member

We discussed this in the meeting today:

11:49:31      dustymabe | #proposed we need to consider our options here for providing
                        | installer artifacts and *live* artifacts. There is a slight
                        | intersection here with #352. We'd like to take some time to
                        | evalute #390 and #352 together and come up with some changes
                        | we'd like to make to improve the situation for users.

@dustymabe dustymabe added jira for syncing to jira and removed meeting topics for meetings labels Feb 19, 2020
@SerialVelocity
Copy link
Author

Sorry for the delay @lucab, here's the answers to your questions:

5 minutes to TFTP the live initramfs (~600 MiB) from your PXE server over a local link sounds quite bad. Can you please share your current PXE config? Is there maybe something wrong at the network level which is badly throttling local TFTP?

I was actually downloading the initramfs via https from my boot server using iPXE and iPXE seems to take a long time to downloading large files.

TFTP is similar though (haven't used it for CoreOS PXE booting though) as a lot of machines cannot use a block size other than 512 bytes. I found that this incurred a large performance penalty. This is based on my poor memory though as it was a couple of years ago when I last did this.

when you are PXE-booting from scratch a new machine, how does the "write the rootfs to temporary storage" flow work?
assuming we publish an initrd without the rootfs, how do you stitch together the whole live system? That is, how does the curl step fit in the flow and how is that plus rootfs pivoting automated?

I know this is unsupported from CoreOS but I supply a secondary initramfs (similar to what @jlebon suggested) which sets up a systemd unit that curls the squashfs file, outputs it to a named partition, and symlinks /root.squashfs to the partition.

If I just write it directly to /root.squashfs (so it lives in RAM), it is still significantly faster than using iPXE though.

@cgwalters

Hum...you're proposing that the rootfs is a cpio that has a single file squashfs file? And in the HTTP case we just unwrap the cpio? That's a really clever hack if it works!

I actually did this when testing that I was splitting the cpios correctly as they are just appended to each other. One thing to know is that there already is a cpio which only has a single squashfs file in but it is currently appended to all the other cpios.

Here's the code that I use to extract the root.squashfs:
https://github.com/SerialVelocity/coreos-pxe-images/blob/4c1e8461c7225cbf7a0cc39cb0e4ca88ddc22fd9/.circleci/config.yml#L82-L83

@jlebon
Copy link
Member

jlebon commented Mar 19, 2020

This was discussed recently in the context of #352. We are considering supporting a stage 2 mechanism. This could also enable easier factory reset. At the artifact level, we're hesitant to offer the root squashfs separately.

Some ideas on how to offer this functionality (not necessarily mutually exclusive):

  • Have a tool (possibly part of coreos-installer) that splits the root initrd from the fat initrd we publish.
  • @dustymabe had the idea of instead of publishing a new artifact, we publish an offset at which to split the fat initrd to get the two initrds. This would allow for example HTTP clients to download them in two files using HTTP range requests.

Thinking more on this, the offset idea is neat, though it's still a kind of "artifact" that users will come to rely on. It would also be a bit awkward to present in the stream metadata. We might be OK with that though. Having just a tool provide the possibility to change implementation details later on, so we could start with that at least.

@SerialVelocity
Copy link
Author

At the artifact level, we're hesitant to offer the root squashfs separately.

If a "root squashfs" artefact is produced, it makes it much easier to use standard caching middlewares for PXE booting. If it isn't, a custom application/processor has to be built to split the fat initrd up.

@dustymabe
Copy link
Member

You are correct, it does make that flow harder. Right now we feel there is value in not providing many different options that users have to choose from so we're thinking of taking a middle ground approach: still provide the "fat" initramfs as the only PXE artifact, but provide a tool/publish data that will allow people who really need to do trim out the root squashfs and provide that a different way.

If we get more and more community members reporting that this is a problem and the current setup is not ideal for them then we'll definitely reconsider.

@bgilbert
Copy link
Contributor

iPXE on Packet seems to take ~5 minutes to download the initrd from our builds bucket over HTTPS.

@dustymabe
Copy link
Member

iPXE on Packet seems to take ~5 minutes to download the initrd from our builds bucket over HTTPS.

Sounds like an argument for delivering a thin initramfs?

@bgilbert bgilbert added the meeting topics for meetings label Apr 1, 2020
@dustymabe
Copy link
Member

We discussed this in the meeting today.

13:18:01     dustymabe | #info we'll do more testing for #390 to see if it's the remote server causing the slowdown or actually a slowdown in iPXE

We'd like to know if what @bgilbert was seeing is because of the remote server actually taking a long amount of time to serve the artifact or if it is genuinely because of the performance of iPXE. If it turns out it is indeed iPXE we'll have to evaluate our current stance again because the Packet/iPXE flow is something that we don't consider to be a corner case.

@dustymabe dustymabe removed the meeting topics for meetings label Apr 1, 2020
@SerialVelocity
Copy link
Author

I'm not sure about packet, but in my original issue message I actually meant iPXE. It took 5 minutes to download the initrd through iPXE and less than 1 minute if I ran a curl. I don't have the exact numbers to hand anymore though.

@fclerg
Copy link

fclerg commented Apr 4, 2020

I was having the same issue of iPXE taking at least 5 minutes to download the initrd from a local repo behind Nginx. The download progress went instantly from 0 to 99% and remained there until completion.

I noticed that it got much faster (<30sec) after removing some old sub_filter rules, similar to those ones that were used years ago and leftover.

Using curl to download the file took around 30 seconds with or without those rules.

By looking further into it with traffic captures, I noticed that Nginx http response includes those 2 headers when not using sub_filter:

Content-Length: 10838728
Accept-Ranges: bytes

When having sub_filter in the config, Nginx response includes Transfer-Encoding: chunked but no Content-Length.

It seems to be a known issue with the iPXE downloader, if no Content-Length header is included in the server response, as reported in this thread.

When having the server including this header, the download of the initrd should be even faster than with tftp.

@SerialVelocity
Copy link
Author

Just to confirm, that was HTTP, not HTTPS, right?

In my case, the content length header is present but the artefact is being served over HTTPS.

@fclerg
Copy link

fclerg commented Apr 5, 2020

HTTP as my iPXE version doesn't support https.

@jlebon
Copy link
Member

jlebon commented Apr 7, 2020

@SerialVelocity Are you able to try fetching over HTTP as a datapoint?

@SerialVelocity
Copy link
Author

Unfortunately I am not able to try fetching over HTTP currently.

@jlebon
Copy link
Member

jlebon commented May 11, 2020

Another thing related to this we discussed in the community meeting was making the live ISO ship the stage2 CPIO instead once we have that so that it contains everything needed to PXE boot (though I think @bgilbert was hesitant on that suggestion). This would also allow it to leverage the same stage2 mechanism as PXE, which is nice.

The downside though is that it would increase memory consumption (because we'd have to extract the CPIO into RAM), which is an issue today (#407). It'd probably also make the live ISO boot slower because of the extraction step. So it might not be worth it in the end if we can't resolve those issues easily.

@bgilbert
Copy link
Contributor

bgilbert commented Jul 17, 2020

Proposal

  • Leave the ISO image unchanged for now.
  • Ship three PXE artifacts: kernel, initramfs, and a new rootfs.
    • The initramfs image will be the actual initramfs, concatenated with a cpio.gz-wrapped set of hashes (see below).
    • The rootfs will be a single cpio.gz (not multiple concatenated archives) containing the root squashfs and the osmet data.
  • Add an optional karg that fetches the rootfs image separately over HTTP or HTTPS. Do not support TFTP (configure as a second initrd if you need that). To support custom TLS CAs and self-signed server certs, HTTPS fetches will bypass CA checks.
  • If users don't want to set up an HTTP server, they can specify the rootfs image as a second initrd.
  • The initramfs will include two sets of hashes: one of the rootfs image, and one of its extracted contents.
    • The former will be used for verifying a downloaded rootfs: download to tmpfs, hash, then extract. This makes it safe to use HTTP or bypass CA checks. It requires 2x RAM, but so does extraction of the current unified image. (Later, if we want to avoid the 2x overhead, we can ship hashes of chunks and a tool to perform streaming verification.)
    • The latter will be used for verifying a rootfs image appended as a second initramfs. This is not a security check, since the appended initramfs can overwrite the hashes. It's intended to catch accidental version skew between the initramfs and rootfs.

Implementation plan

  1. Pipeline changes
    1. Change cosa buildextend-live to generate both initramfs and rootfs artifacts. If --legacy-pxe-artifacts is specified, continue putting the squashfs in the initramfs image, and create a rootfs image which only contains a flag file. Otherwise, generate split artifacts as described, including hashes. Enable the legacy option in the FCOS pipeline, but not in RHCOS. Add code to inject a warning MOTD if FCOS is booted without the rootfs image. Add the rootfs artifact to stream metadata and the FCOS download page. Update PXE documentation in FCOS docs and coreos-installer.
    2. Post to coreos-status, warning that PXE systems will need to add the rootfs as a second initrd. Provide a migration period, perhaps six weeks. Maybe switch next to the new model early.
    3. Drop --legacy-pxe-artifacts from the pipeline and from cosa. Drop the warning MOTD code.
  2. In parallel, add initramfs code to fetch the rootfs image over HTTP(S), check its hash, and unpack it. Update PXE documentation in FCOS docs.

Step 2 can be deferred if needed.

Test cases

  • Loading rootfs via {PXELINUX, iPXE, GRUB/UEFI}
  • Loading rootfs via {HTTP, HTTPS, HTTPS with self-signed cert}
  • Loading rootfs with bad checksum via {PXE, HTTP}

@cgwalters
Copy link
Member

Awesome summary! This sounds great to me!

@jlebon
Copy link
Member

jlebon commented Jul 17, 2020

Provide a migration period, perhaps one month.

Any reason for this short window? One month is just 2 releases. WDYT about 4 or 5 releases?

@dustymabe
Copy link
Member

Any reason for this short window? One month is just 2 releases. WDYT about 4 or 5 releases?

2 might be too short, but maybe not all the way to 5. I think if people haven't changed things by 3 or 4 they probably missed or forgot about the pending change. It would be nice to get this in FCOS and battle tested a bit sooner than later. I also wonder if we could drop --legacy-pxe-artifacts from next soon and not have to wait.

@dustymabe
Copy link
Member

* Loading `rootfs` via {PXELINUX, iPXE}

It would be nice if we would start adding a Grub/UEFI testing to the mix:

  • Loading rootfs via {PXELINUX, iPXE, GRUB(UEFI)}

@dustymabe
Copy link
Member

This looks really great @bgilbert!

@miabbott
Copy link
Member

Love the detailed write-up! Thanks to all who contributed to the discussion.

Could we add something to the plan about documenting these changes and how users would use the multiple artifacts?

@bgilbert
Copy link
Contributor

bgilbert commented Jul 21, 2020

Updates

  • For verifying a downloaded rootfs, I'm pursuing streaming verification after all; it wasn't much implementation effort. rdcore: add stream-hash subcommand coreos-installer#305 is the verifier.
  • For verifying an appended rootfs, we don't actually need to check hashes at all; we can generate a random cookie at build time and embed it in both initramfs and rootfs.
  • s/--legacy-pxe-artifacts/--legacy-pxe/

Proposed deprecation schedule

  • August 11: ship stub rootfs in FCOS next, testing, and stable. Announce deprecation plan to coreos-status.
  • August 25: ship separate rootfs in next.
  • September 22: ship separate rootfs in testing.
  • October 6: ship separate rootfs in stable; drop legacy support code.

That gives 2 weeks' migration time for next and 6 for testing/stable. I agree with #390 (comment) that a longer deprecation period won't actually help people migrate.

bgilbert added a commit to coreos/fedora-coreos-config that referenced this issue Jul 27, 2020
- If the separate rootfs image was appended as a second initrd, make sure
  the initramfs and rootfs are from the same build,
- else if we were asked to fetch the rootfs over HTTP(S), do so,
- else if we're shipping the legacy initramfs image during the
  deprecation window, add a MOTD,
- else fail.

If we see the karg to fetch the rootfs, automatically enable network.

See coreos/fedora-coreos-tracker#390.
@bgilbert
Copy link
Contributor

bgilbert commented Oct 5, 2020

This week's stable release will switch to the separate rootfs image, completing the migration. I'll close this out.

@bgilbert bgilbert closed this as completed Oct 5, 2020
@dustymabe
Copy link
Member

The fix for this went into stable stream release 32.20200923.3.0. All future releases will have this new split rootfs/initramfs.

Thanks @bgilbert for the hard work on this one!

dghubble added a commit to poseidon/typhoon that referenced this issue Nov 25, 2020
* Fedora CoreOS stable (after Oct 6) ships separate initramfs
and rootfs images, used as initrd's
* Update profiles to match the Matchbox examples, which have
already switched to the new profile and to remove the unused
kernel args
* Requires Fedora CoreOS version which ships rootfs images
(e.g. stable 32.20200923.3.0 or later)

Rel:

* coreos/fedora-coreos-tracker#390 (comment)
* poseidon/matchbox@da0df01#diff-4541f7b7c174f6ae6270135942c1c65ed9e09ebe81239709f5a9fb34e858ddcf

Supercedes #888
dghubble added a commit to poseidon/typhoon that referenced this issue Nov 25, 2020
* Fedora CoreOS stable (after Oct 6) ships separate initramfs
and rootfs images, used as initrd's
* Update profiles to match the Matchbox examples, which have
already switched to the new profile and to remove the unused
kernel args
* Requires Fedora CoreOS version which ships rootfs images
(e.g. stable 32.20200923.3.0 or later)

Rel:

* coreos/fedora-coreos-tracker#390 (comment)
* poseidon/matchbox@da0df01#diff-4541f7b7c174f6ae6270135942c1c65ed9e09ebe81239709f5a9fb34e858ddcf

Supercedes #888
elemental-lf added a commit to elemental-lf/typhoon that referenced this issue Mar 8, 2021
The way that upstream has chosen doesn't work with the HP
servers, even though it works fine an the master desktop
nodes and the development cluster.

See https://docs.fedoraproject.org/en-US/fedora-coreos/live-booting-ipxe/#_pxe_images
and coreos/fedora-coreos-tracker#390.
Snaipe pushed a commit to aristanetworks/monsoon that referenced this issue Apr 13, 2023
* Fedora CoreOS stable (after Oct 6) ships separate initramfs
and rootfs images, used as initrd's
* Update profiles to match the Matchbox examples, which have
already switched to the new profile and to remove the unused
kernel args
* Requires Fedora CoreOS version which ships rootfs images
(e.g. stable 32.20200923.3.0 or later)

Rel:

* coreos/fedora-coreos-tracker#390 (comment)
* poseidon/matchbox@da0df01#diff-4541f7b7c174f6ae6270135942c1c65ed9e09ebe81239709f5a9fb34e858ddcf

Supercedes poseidon#888
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants