Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor setPXEFilename to be less complicated #149

Merged
merged 12 commits into from
Oct 4, 2021

Conversation

mmlb
Copy link
Contributor

@mmlb mmlb commented Apr 23, 2021

Description

Pretty large clean up of job.setPXEFilename to get rid of (some) Packet/EM-isms and clean up the logic around sending a file name or not.

Why is this needed

I'm working on removing cacher and this test was getting in the way, it was too complicated to keep.

How Has This Been Tested?

Unit tests pass. Still need to do a few boot/provision tests.

How are existing users impacted? What migration steps/scripts do we need?

NA

Checklist:

I have:

  • updated the documentation and/or roadmap (if required)
  • added unit or e2e tests
  • provided instructions on how to upgrade

@mmlb mmlb force-pushed the refactor-setPXEFilename branch from 8b167e6 to 38c8d46 Compare April 23, 2021 22:11
Copy link
Contributor

@markjacksonfishing markjacksonfishing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending successful CI run

@mmlb mmlb changed the title Refactore setPXEFilename to make less complicated Refactor setPXEFilename to make less complicated Apr 23, 2021
@mmlb mmlb changed the title Refactor setPXEFilename to make less complicated Refactor setPXEFilename to be less complicated Apr 23, 2021
job/hardware.go Outdated Show resolved Hide resolved
@ScottGarman ScottGarman self-requested a review April 23, 2021 22:16
ScottGarman
ScottGarman previously approved these changes Apr 23, 2021
@codecov
Copy link

codecov bot commented Apr 23, 2021

Codecov Report

Merging #149 (99ec04d) into main (36f12f8) will decrease coverage by 1.24%.
The diff coverage is 37.50%.

❗ Current head 99ec04d differs from pull request most recent head 831bd82. Consider uploading reports for the commit 831bd82 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #149      +/-   ##
==========================================
- Coverage   38.52%   37.28%   -1.25%     
==========================================
  Files          41       41              
  Lines        2749     2744       -5     
==========================================
- Hits         1059     1023      -36     
- Misses       1603     1635      +32     
+ Partials       87       86       -1     
Impacted Files Coverage Δ
job/hardware.go 0.00% <0.00%> (ø)
job/job.go 34.37% <ø> (-0.68%) ⬇️
job/dhcp.go 45.23% <42.85%> (-12.72%) ⬇️
job/helpers.go 9.91% <0.00%> (-15.71%) ⬇️
job/events.go 47.74% <0.00%> (-1.94%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36f12f8...831bd82. Read the comment docs.

@mmlb mmlb added the do-not-merge Signal to Mergify to block merging of the PR. label Apr 23, 2021
@mmlb mmlb dismissed stale reviews from ScottGarman and markjacksonfishing via 5637d52 April 23, 2021 22:21
@mmlb mmlb force-pushed the refactor-setPXEFilename branch 2 times, most recently from 46b10e1 to 0b0614c Compare April 23, 2021 22:22
Comment on lines +116 to +142
if j.instance != nil && j.instance.State == "active" {
// We set a filename because if a machine is actually trying to PXE and nothing is sent it may hang for
// a while waiting for any possible ProxyDHCP packets and it would delay booting from disks.
// This short cuts all that when we know we want to be booting from disk.
return "/pxe-is-not-allowed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this logic be in isPXEAllowed()? if we're defining when a machine should pxe or not might be good to have it all in one place. that way when we remove it it's all together! 😁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On an unrelated note, I've wondered if rather than returning this /pxe-is-not-allowed we should instead actually return a proper bootloader, that simply boots from HD. There is an example of this in the netboot.xyz. I wonder if that would be clearer to the end user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both good points. I'll fix @jacobweinstock's in this PR and open up an issue for your @splaspood

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait no @jacobweinstock no this doesn't belong in isPXEAllowed as that doesn't return a filename and we want to differentiate between "boot to disk by returning a bogus name" vs "really should not be pxe'ing and we're better off causing the boot process to hang, so return nothing"

This comment was marked as duplicate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, one last thing. You said

"boot to disk by returning a bogus name" vs "really should not be pxe'ing and we're better off causing the boot process to hang, so return nothing"

this business logic feels very opaque. I really would like to see us rethink and reorganize this codebase to be clearer and if possible less opinionated about this type of logic. Anyway, just something to think about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, one last thing. You said

No worries, keep em coming!

"boot to disk by returning a bogus name" vs "really should not be pxe'ing and we're better off causing the boot process to hang, so return nothing"

this business logic feels very opaque. I really would like to see us rethink and reorganize this codebase to be clearer and if possible less opinionated about this type of logic. Anyway, just something to think about.

Agree, I hope by removing cacher stuff we can then make progress on removing packet/EM only things (phone-home, password callback, ...) and then have a much smaller code base where we can see all this opinionated business logic that we can hash out if they make sense or not and clean up if they do.

@mmlb mmlb force-pushed the refactor-setPXEFilename branch 2 times, most recently from f2b3f97 to 107b64a Compare April 30, 2021 18:33
@mmlb
Copy link
Contributor Author

mmlb commented Apr 30, 2021

@jacobweinstock / @splaspood rebased on top of latest master, no other changes.

@mmlb mmlb added ready-to-merge Signal to Mergify to merge the PR. and removed do-not-merge Signal to Mergify to block merging of the PR. labels Apr 30, 2021
Copy link
Member

@jacobweinstock jacobweinstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would adding some tests be a possibility, by chance?

Comment on lines +116 to +142
if j.instance != nil && j.instance.State == "active" {
// We set a filename because if a machine is actually trying to PXE and nothing is sent it may hang for
// a while waiting for any possible ProxyDHCP packets and it would delay booting from disks.
// This short cuts all that when we know we want to be booting from disk.
return "/pxe-is-not-allowed"

This comment was marked as duplicate.

@mmlb
Copy link
Contributor Author

mmlb commented May 3, 2021

@jacobweinstock I can't seem to reply to your minimized comment, so here goes. Many good points and I almost did some of that in this PR but backed off from doing so to avoid making it even larger or wider in scope. I wanted to try and keep the core of the PR to cleaning up the function and then later can come back with some other refactorings.

jacobweinstock
jacobweinstock previously approved these changes May 3, 2021
@ScottGarman
Copy link
Contributor

How do we want to deal with the code coverage CI failures? I'm reviewing some stale PRs and I don't think we want to let this one die on the vine, it's really close, though it needs a rebase too.

@splaspood
Copy link
Contributor

It's been long enough that this PR should be rebased. I'd argue we should make the code coverage check happy, or we should get rid of it.

@jacobweinstock jacobweinstock removed the ready-to-merge Signal to Mergify to merge the PR. label Jun 30, 2021
@mmlb mmlb force-pushed the refactor-setPXEFilename branch 3 times, most recently from 9cc01af to 14ee5a7 Compare September 30, 2021 15:38
@mmlb mmlb requested a review from jacobweinstock September 30, 2021 15:38
Signed-off-by: Manuel Mendez <mmendez@equinix.com>
No longer needed as nixpkgs doesn't set GOPATH/GOROOT.

Signed-off-by: Manuel Mendez <mmendez@equinix.com>
Signed-off-by: Manuel Mendez <mmendez@equinix.com>
This corrects some use of joblog when j.Logger is already available
and should be preferred. Also gets rid of an unnecessary log line.

Signed-off-by: Manuel Mendez <mmendez@equinix.com>
PXE booting is a hardware level thing, should be checked from there.

Signed-off-by: Manuel Mendez <mmendez@equinix.com>
More context in name is just for human context. Added AllowPXE
explicitly because the next commit is going to change some of the logic
and I don't want to change the code and the tests at the same time.
All of these cases that return a valid filename either don't care
about allowPXE or wanted it set anyway.

Signed-off-by: Manuel Mendez <mmendez@equinix.com>
There's no real point in tftp'ing our iPXE binary, execing it,
waiting for re-DHCP only to *then* check `isPXEAllowed`. Might as well
do that earlier and save some bytes and time.

Signed-off-by: Manuel Mendez <mmendez@equinix.com>
@mmlb mmlb force-pushed the refactor-setPXEFilename branch from 14ee5a7 to 5652164 Compare September 30, 2021 17:58
dhcp/pxe.go Outdated
@@ -94,6 +94,13 @@ func SetupPXE(ctx context.Context, rep, req *dhcp4.Packet) bool {
dhcplog.With("mac", req.GetCHAddr(), "xid", req.GetXID()).Info("no client GUID provided")
}

// Wanted by PXE clients to learn of PXE servers, not required but responses with PXEClient ClassID are preferred.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Figure 2-2 PXE Client Timeouts and Figure 2-3 PXE Client Response to DHCP Server Containing a Proxy DHCP Service and a few other places in the pxespec seem to indicate that a DHCP offer must have PXEClient in option 60. that would make this comment misleading.

Copy link
Contributor Author

@mmlb mmlb Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more info in the commit message which has an excerpt from the spec:

Section 2.2.2 of the Intel PXE Spec states:

The purpose of the timeouts is to ensure the PXE client gives precedence to servers supplying
“PXEClient” specific configuration tags. The PXE boot ROM must function as a normal DHCP boot
ROM in the absence of a PXE specific response. However, the PXE boot ROM must wait for
specified times to see if a PXE response is available before using a non-PXE configuration.

Note precedence in the spec, this lines up with our real world experience with all our nics in production. We do not setting the "PXEClient" class id when the first dhcp message comes in where we respond with filename:$our-iPXE-build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally understand that the real-world experience shows this option might not be needed but I believe the pxe spec shows that the server must send this option and I think we should implement the pxe spec regardless. All I'm suggesting is removing "not required" in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec does not require it though. Section 2.2.2 Timeout calls out:

However, the PXE boot ROM must wait for specified times to see if a PXE response is available before using a non-PXE configuration

This is a SHOULD, not MUST.

I'll reword the comment though to drop a mention of required or not though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, DHCP server/client interactions are terribly undocumented. I did happen to find this. Talks a little about the logic for option 60 with PXEClient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its also from this time where the spec is written but failing smart/helpful was seen as good. Which leads to the spec and implementations to differ a lot :/ .

@@ -68,16 +68,16 @@ func Setup(rep *dhcp4.Packet) {
rep.SetIP(dhcp4.OptionLogServer, conf.PublicSyslogIPv4) // Have iPXE send syslog to me.
}

var packetVersion = []byte{1, 0, 255}
var ouriPXEVersion = []byte{1, 0, 255}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the update. Would it be possible to add a direct pointer to where in Boots we are overriding this version? This is helpful but still a bit opaque.

mmlb added 2 commits October 1, 2021 19:54
Just getting rid of a Packetism.

We override the iPXE version that iPXE uses for itself and sends to the dhcp
server. This is how we distinguish an iPXE burned into a nic by a vendor vs the
one we build and send via tftp. There are better options here, but not as
convenient to apply. We can define the IPXE_VERSION via an env var when building
ipxe but other options would require a patch to one or more locations in the
iPXE code. This version number is no longer incrementing in upstream iPXE so is
low-risk to modify.

Signed-off-by: Manuel Mendez <mmendez@equinix.com>
This patch changes setPXEFilename so that it mostly only cares about AllowPXE
and the machine type, but not exclusively just those two bits of info.

We still care about an instance being active or not because we don't want to
set any filename if the machine has an instance and is not active. We don't
want to set a filename so that the nic hangs on boot since this is scenario
indicates a machine that has no business even being on, much less trying to
PXE boot. In contrast when AllowPXE == false and instance.state == "active"
and the machine is attempting to PXE boot indicates that the BIOS/FW is setup
with netboot as a boot option prior to booting from disk. In this case we
send a non existent file so that we can short circuit the nic boot attempt
and go on to the next boot option.

Signed-off-by: Manuel Mendez <mmendez@equinix.com>
@mmlb mmlb force-pushed the refactor-setPXEFilename branch from 5652164 to 8740415 Compare October 1, 2021 19:54
@mmlb mmlb requested a review from jacobweinstock October 1, 2021 19:56
mmlb added 3 commits October 1, 2021 20:10
Section 2.2.2 of the Intel PXE Spec states:

> The purpose of the timeouts is to ensure the PXE client gives precedence to servers supplying
> “PXEClient” specific configuration tags. The PXE boot ROM must function as a normal DHCP boot
> ROM in the absence of a PXE specific response. However, the PXE boot ROM must wait for
> specified times to see if a PXE response is available before using a non-PXE configuration.

Which sort of explains why PXE clients that are not *our* iPXE build will still
take the `filename` offered up even though the ClassID was not set for
PXEClient when job/dhcp.go:setPXEFilename:`if !isOuriPXE` is the branch taken.

SetupPXE seems like the best place for this setting any way as we setup the PXE
response packet without need for job information.

Signed-off-by: Manuel Mendez <mmendez@equinix.com>
This way its easier to test since it is now a pure fucntion. I was getting a
panic when trying to rework the not-our-ipxe case for an unknown machine
because it would try to call j.Error(). This is just better code.

Signed-off-by: Manuel Mendez <mmendez@equinix.com>
…ings

This way is more extensible since we don't have to add more and more bool args for
newer cpus or firmware interfaces.

Signed-off-by: Manuel Mendez <mmendez@equinix.com>
@mmlb mmlb force-pushed the refactor-setPXEFilename branch from 8740415 to 831bd82 Compare October 1, 2021 20:10
@mmlb mmlb removed the request for review from splaspood October 4, 2021 20:22
@mergify mergify bot merged commit 156ee90 into tinkerbell:main Oct 4, 2021
@mmlb mmlb deleted the refactor-setPXEFilename branch October 4, 2021 20:23
// TODO: make this actually check for iPXE and use ipxe' build system's ability to set name.
// This way we could set to something like "Packet iPXE" and then just look for that in the identifier sent in dhcp.
// This way we could set to something like "Boots iPXE" and then just look for that in the identifier sent in dhcp.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good catch

mergify bot added a commit that referenced this pull request Dec 9, 2021
## Description

This is a revert of the revert that occurred in #217 + fix that caused the original revert.
There's also removal of what is effectively useless/dead code (isPXEClient/pxeClient var/handling).
I also removed the leading `/` from `/nonexistent` because a `/` is already prefixed and `//nonexistent` has bothered me for years.

## Why is this needed

#216 was a great contribution from @rgl, unfortunately we ran into issues on some of our Ampere eMAG systems (#225).
Turns out that the eMAG boot firmware just straight up ignores filename values if the response packet is marked as `PXEClient`.
#216 changed the logic subtly by setting the response packet's class as `HTTPClient` if http was supported by the booting machine, and otherwise setting `PXEClient`. Before #216 we would set `PXEClient` only for the dhcp responses that were going out to our tftp'd build of iPXE, the initial filename: response was `PXEClient`-less.

Once I put back the only `PXEClient` for our iPXE everything worked out fine again (tested using sandbox), but then I realized that now pxeClient was likely useless because I suspected iPXE did not care/require it. I tested this too and found that iPXE is happy to tftp a file provided by a `PXEClient`-less DHCP response. So why keep code around that is effectively useless? Less code is better than any code so that caused me to write up d1878bd.

Fixes #210
Fixes #225

This debugging also opens up the way to bring back #149 as that had to be reverted due to the same "always set `PXEClient`" issue that I was not equipped to debug easily/quickly then.

## How Has This Been Tested?

Lots of testing on EM machines using github.com/tinkerbell/sandbox's terraform support (I plan to clean up and PR https://github.com/mmlb/sandbox/tree/fix-terraform-and-add-arm64-support at some point).

## How are existing users impacted? What migration steps/scripts do we need?

Back to faster/better iPXE binary downloads by utilizing TCP/HTTP instead of UDP/TFTP.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge Signal to Mergify to merge the PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants