-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor setPXEFilename to be less complicated #149
Conversation
8b167e6
to
38c8d46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved pending successful CI run
Codecov Report
@@ Coverage Diff @@
## main #149 +/- ##
==========================================
- Coverage 38.52% 37.28% -1.25%
==========================================
Files 41 41
Lines 2749 2744 -5
==========================================
- Hits 1059 1023 -36
- Misses 1603 1635 +32
+ Partials 87 86 -1
Continue to review full report at Codecov.
|
5637d52
46b10e1
to
0b0614c
Compare
if j.instance != nil && j.instance.State == "active" { | ||
// We set a filename because if a machine is actually trying to PXE and nothing is sent it may hang for | ||
// a while waiting for any possible ProxyDHCP packets and it would delay booting from disks. | ||
// This short cuts all that when we know we want to be booting from disk. | ||
return "/pxe-is-not-allowed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this logic be in isPXEAllowed()
? if we're defining when a machine should pxe or not might be good to have it all in one place. that way when we remove it it's all together! 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On an unrelated note, I've wondered if rather than returning this /pxe-is-not-allowed we should instead actually return a proper bootloader, that simply boots from HD. There is an example of this in the netboot.xyz. I wonder if that would be clearer to the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both good points. I'll fix @jacobweinstock's in this PR and open up an issue for your @splaspood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait no @jacobweinstock no this doesn't belong in isPXEAllowed
as that doesn't return a filename and we want to differentiate between "boot to disk by returning a bogus name" vs "really should not be pxe'ing and we're better off causing the boot process to hang, so return nothing"
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, one last thing. You said
"boot to disk by returning a bogus name" vs "really should not be pxe'ing and we're better off causing the boot process to hang, so return nothing"
this business logic feels very opaque. I really would like to see us rethink and reorganize this codebase to be clearer and if possible less opinionated about this type of logic. Anyway, just something to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, one last thing. You said
No worries, keep em coming!
"boot to disk by returning a bogus name" vs "really should not be pxe'ing and we're better off causing the boot process to hang, so return nothing"
this business logic feels very opaque. I really would like to see us rethink and reorganize this codebase to be clearer and if possible less opinionated about this type of logic. Anyway, just something to think about.
Agree, I hope by removing cacher stuff we can then make progress on removing packet/EM only things (phone-home, password callback, ...) and then have a much smaller code base where we can see all this opinionated business logic that we can hash out if they make sense or not and clean up if they do.
f2b3f97
to
107b64a
Compare
@jacobweinstock / @splaspood rebased on top of latest master, no other changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would adding some tests be a possibility, by chance?
if j.instance != nil && j.instance.State == "active" { | ||
// We set a filename because if a machine is actually trying to PXE and nothing is sent it may hang for | ||
// a while waiting for any possible ProxyDHCP packets and it would delay booting from disks. | ||
// This short cuts all that when we know we want to be booting from disk. | ||
return "/pxe-is-not-allowed" |
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
@jacobweinstock I can't seem to reply to your minimized comment, so here goes. Many good points and I almost did some of that in this PR but backed off from doing so to avoid making it even larger or wider in scope. I wanted to try and keep the core of the PR to cleaning up the function and then later can come back with some other refactorings. |
How do we want to deal with the code coverage CI failures? I'm reviewing some stale PRs and I don't think we want to let this one die on the vine, it's really close, though it needs a rebase too. |
It's been long enough that this PR should be rebased. I'd argue we should make the code coverage check happy, or we should get rid of it. |
9cc01af
to
14ee5a7
Compare
Signed-off-by: Manuel Mendez <mmendez@equinix.com>
No longer needed as nixpkgs doesn't set GOPATH/GOROOT. Signed-off-by: Manuel Mendez <mmendez@equinix.com>
Signed-off-by: Manuel Mendez <mmendez@equinix.com>
This corrects some use of joblog when j.Logger is already available and should be preferred. Also gets rid of an unnecessary log line. Signed-off-by: Manuel Mendez <mmendez@equinix.com>
PXE booting is a hardware level thing, should be checked from there. Signed-off-by: Manuel Mendez <mmendez@equinix.com>
More context in name is just for human context. Added AllowPXE explicitly because the next commit is going to change some of the logic and I don't want to change the code and the tests at the same time. All of these cases that return a valid filename either don't care about allowPXE or wanted it set anyway. Signed-off-by: Manuel Mendez <mmendez@equinix.com>
There's no real point in tftp'ing our iPXE binary, execing it, waiting for re-DHCP only to *then* check `isPXEAllowed`. Might as well do that earlier and save some bytes and time. Signed-off-by: Manuel Mendez <mmendez@equinix.com>
14ee5a7
to
5652164
Compare
dhcp/pxe.go
Outdated
@@ -94,6 +94,13 @@ func SetupPXE(ctx context.Context, rep, req *dhcp4.Packet) bool { | |||
dhcplog.With("mac", req.GetCHAddr(), "xid", req.GetXID()).Info("no client GUID provided") | |||
} | |||
|
|||
// Wanted by PXE clients to learn of PXE servers, not required but responses with PXEClient ClassID are preferred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in Figure 2-2 PXE Client Timeouts
and Figure 2-3 PXE Client Response to DHCP Server Containing a Proxy DHCP Service
and a few other places in the pxespec seem to indicate that a DHCP offer must have PXEClient
in option 60. that would make this comment misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more info in the commit message which has an excerpt from the spec:
Section 2.2.2 of the Intel PXE Spec states:
The purpose of the timeouts is to ensure the PXE client gives precedence to servers supplying
“PXEClient” specific configuration tags. The PXE boot ROM must function as a normal DHCP boot
ROM in the absence of a PXE specific response. However, the PXE boot ROM must wait for
specified times to see if a PXE response is available before using a non-PXE configuration.
Note precedence in the spec, this lines up with our real world experience with all our nics in production. We do not setting the "PXEClient" class id when the first dhcp message comes in where we respond with filename:$our-iPXE-build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally understand that the real-world experience shows this option might not be needed but I believe the pxe spec shows that the server must send this option and I think we should implement the pxe spec regardless. All I'm suggesting is removing "not required" in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec does not require it though. Section 2.2.2 Timeout calls out:
However, the PXE boot ROM must wait for specified times to see if a PXE response is available before using a non-PXE configuration
This is a SHOULD, not MUST.
I'll reword the comment though to drop a mention of required or not though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, DHCP server/client interactions are terribly undocumented. I did happen to find this. Talks a little about the logic for option 60 with PXEClient
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its also from this time where the spec is written but failing smart/helpful was seen as good. Which leads to the spec and implementations to differ a lot :/ .
@@ -68,16 +68,16 @@ func Setup(rep *dhcp4.Packet) { | |||
rep.SetIP(dhcp4.OptionLogServer, conf.PublicSyslogIPv4) // Have iPXE send syslog to me. | |||
} | |||
|
|||
var packetVersion = []byte{1, 0, 255} | |||
var ouriPXEVersion = []byte{1, 0, 255} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the update. Would it be possible to add a direct pointer to where in Boots we are overriding this version? This is helpful but still a bit opaque.
Just getting rid of a Packetism. We override the iPXE version that iPXE uses for itself and sends to the dhcp server. This is how we distinguish an iPXE burned into a nic by a vendor vs the one we build and send via tftp. There are better options here, but not as convenient to apply. We can define the IPXE_VERSION via an env var when building ipxe but other options would require a patch to one or more locations in the iPXE code. This version number is no longer incrementing in upstream iPXE so is low-risk to modify. Signed-off-by: Manuel Mendez <mmendez@equinix.com>
This patch changes setPXEFilename so that it mostly only cares about AllowPXE and the machine type, but not exclusively just those two bits of info. We still care about an instance being active or not because we don't want to set any filename if the machine has an instance and is not active. We don't want to set a filename so that the nic hangs on boot since this is scenario indicates a machine that has no business even being on, much less trying to PXE boot. In contrast when AllowPXE == false and instance.state == "active" and the machine is attempting to PXE boot indicates that the BIOS/FW is setup with netboot as a boot option prior to booting from disk. In this case we send a non existent file so that we can short circuit the nic boot attempt and go on to the next boot option. Signed-off-by: Manuel Mendez <mmendez@equinix.com>
5652164
to
8740415
Compare
Section 2.2.2 of the Intel PXE Spec states: > The purpose of the timeouts is to ensure the PXE client gives precedence to servers supplying > “PXEClient” specific configuration tags. The PXE boot ROM must function as a normal DHCP boot > ROM in the absence of a PXE specific response. However, the PXE boot ROM must wait for > specified times to see if a PXE response is available before using a non-PXE configuration. Which sort of explains why PXE clients that are not *our* iPXE build will still take the `filename` offered up even though the ClassID was not set for PXEClient when job/dhcp.go:setPXEFilename:`if !isOuriPXE` is the branch taken. SetupPXE seems like the best place for this setting any way as we setup the PXE response packet without need for job information. Signed-off-by: Manuel Mendez <mmendez@equinix.com>
This way its easier to test since it is now a pure fucntion. I was getting a panic when trying to rework the not-our-ipxe case for an unknown machine because it would try to call j.Error(). This is just better code. Signed-off-by: Manuel Mendez <mmendez@equinix.com>
…ings This way is more extensible since we don't have to add more and more bool args for newer cpus or firmware interfaces. Signed-off-by: Manuel Mendez <mmendez@equinix.com>
8740415
to
831bd82
Compare
// TODO: make this actually check for iPXE and use ipxe' build system's ability to set name. | ||
// This way we could set to something like "Packet iPXE" and then just look for that in the identifier sent in dhcp. | ||
// This way we could set to something like "Boots iPXE" and then just look for that in the identifier sent in dhcp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can https://github.com/tinkerbell/boots/blob/4615ecf55f49f1666333a84b6009a0411bf4e2da/job/ipxe.go#L73 be modified to Boots iPXE
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes good catch
## Description This is a revert of the revert that occurred in #217 + fix that caused the original revert. There's also removal of what is effectively useless/dead code (isPXEClient/pxeClient var/handling). I also removed the leading `/` from `/nonexistent` because a `/` is already prefixed and `//nonexistent` has bothered me for years. ## Why is this needed #216 was a great contribution from @rgl, unfortunately we ran into issues on some of our Ampere eMAG systems (#225). Turns out that the eMAG boot firmware just straight up ignores filename values if the response packet is marked as `PXEClient`. #216 changed the logic subtly by setting the response packet's class as `HTTPClient` if http was supported by the booting machine, and otherwise setting `PXEClient`. Before #216 we would set `PXEClient` only for the dhcp responses that were going out to our tftp'd build of iPXE, the initial filename: response was `PXEClient`-less. Once I put back the only `PXEClient` for our iPXE everything worked out fine again (tested using sandbox), but then I realized that now pxeClient was likely useless because I suspected iPXE did not care/require it. I tested this too and found that iPXE is happy to tftp a file provided by a `PXEClient`-less DHCP response. So why keep code around that is effectively useless? Less code is better than any code so that caused me to write up d1878bd. Fixes #210 Fixes #225 This debugging also opens up the way to bring back #149 as that had to be reverted due to the same "always set `PXEClient`" issue that I was not equipped to debug easily/quickly then. ## How Has This Been Tested? Lots of testing on EM machines using github.com/tinkerbell/sandbox's terraform support (I plan to clean up and PR https://github.com/mmlb/sandbox/tree/fix-terraform-and-add-arm64-support at some point). ## How are existing users impacted? What migration steps/scripts do we need? Back to faster/better iPXE binary downloads by utilizing TCP/HTTP instead of UDP/TFTP.
Description
Pretty large clean up of job.setPXEFilename to get rid of (some) Packet/EM-isms and clean up the logic around sending a file name or not.
Why is this needed
I'm working on removing cacher and this test was getting in the way, it was too complicated to keep.
How Has This Been Tested?
Unit tests pass. Still need to do a few boot/provision tests.
How are existing users impacted? What migration steps/scripts do we need?
NA
Checklist:
I have: