Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/exec/engine: write empty cache config when not provided #1002

Merged
merged 2 commits into from
Jul 8, 2020

Conversation

arithx
Copy link
Contributor

@arithx arithx commented Jun 18, 2020

When the user-provided config is empty write an empty cache config to
prevent re-fetching it in every stage.

Closes #950

@bgilbert
Copy link
Contributor

I'm not wild about special-casing this at such a high level. Would it make more sense for acquireConfig to special-case an ErrEmpty return from fetchProviderConfig, skip validation (if necessary), and write out the cache normally?

The other issue is that #950 calls for something a bit more general than what's done here: refactoring so that the non-fetch stages can only read from cache, and will fail outright if the cache isn't available.

@arithx arithx force-pushed the fetch branch 2 times, most recently from 9faafb6 to 20b19c0 Compare June 19, 2020 06:07
@arithx
Copy link
Contributor Author

arithx commented Jun 19, 2020

Updated.

Split the internals of acquireConfig into two functions acquireCachedConfig & acquireProviderConfig and made it so acquireProviderConfig would only occur in fetch stages.

Moved the writing of an empty config to after acquireProviderConfig returns ErrEmpty.

Fedora infra died while I was trying to build images for testing so latest set isn't tested yet.

Copy link
Member

@sohankunkerkar sohankunkerkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #950, one thing I originally imagined was literally moving all the code related to fetching configs from engine.go into fetch/fetch.go. I guess this is a little complicated now with the new fetch-offline stage... Still probably worth doing some refactoring eventually, though definitely not a blocker!

func (e *Engine) acquireConfig(stageName string) (cfg types.Config, err error) {
cfg, err = e.acquireCachedConfig()
if err != nil {
if os.IsNotExist(err) && strings.HasPrefix(stageName, "fetch") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for fetch/fetch-offline, we shouldn't try to even use the cache. Not sure whether we want to take that further and also error out if it's present. But it feels misleading to not actually fetch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I've updated it to only attempt to fetch from the provider config in fetch* stages.

internal/exec/engine.go Show resolved Hide resolved
@arithx arithx force-pushed the fetch branch 2 times, most recently from 4748735 to b862614 Compare June 30, 2020 04:52
@arithx
Copy link
Contributor Author

arithx commented Jun 30, 2020

Still probably worth doing some refactoring eventually, though definitely not a blocker!

Yeah, I'd rather punt that out for now and keep this simple.

Change pushed addressing comment & fixing blackbox tests (we were previously relying on the old functionality and not actually running the fetch stage in blackbox tests).

@arithx
Copy link
Contributor Author

arithx commented Jun 30, 2020

Hmm, the tls.fetchfile.http.compressed variants seem to be failing. Will dig in more in the morning.

Jun 30 06:08:03 qemu0 kola-runext-blackbox.sh[1284]:             CRITICAL : failed to update timeouts and CAs for fetcher: gzip: invalid header
Jun 30 06:08:03 qemu0 kola-runext-blackbox.sh[1284]:             CRITICAL : failed to acquire config: gzip: invalid header
Jun 30 06:08:03 qemu0 kola-runext-blackbox.sh[1284]:             CRITICAL : Ignition failed: gzip: invalid header

@arithx
Copy link
Contributor Author

arithx commented Jul 1, 2020

Okay, I think I've tracked down the bug. I'm not sure how we're avoiding this at the moment though. Essentially what seems to be happening is that the compressed tests work for the initial fetch (and are able to add the sources).

However, we're re-writing the CA cert in place here which is resulting in the CA source now being a simple base64 encoded data URL but still is specifying the previous compression method (gzip in this case). We then write the config with the changed CA cert source to the cached config.

This causes subsequent reads of the config from the cache to fail when attempting to read a base64 string via a gzip reader. A simple solution is to just nuke the Compression for the CA at the time of rewrite but I'm wondering if we should be preserving the compression type specified to the cached config.

cc @bgilbert

@bgilbert
Copy link
Contributor

bgilbert commented Jul 1, 2020

Compression support for CAs was added as part of resource consolidation in spec 3.1.0 and this interaction was missed. I don't think it's important to preserve the original compression field in the cached config. We're already clearing the HTTP headers.

@arithx
Copy link
Contributor Author

arithx commented Jul 2, 2020

Tests now passing

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits, LGTM otherwise!

if err == errors.ErrEmpty {
e.Logger.Info("%v: provider config was empty, writing empty cache config", err)
if err = renameio.WriteFile(e.ConfigCache, []byte("{}"), 0640); err != nil {
e.Logger.Crit("failed to empty write cached config: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
e.Logger.Crit("failed to empty write cached config: %v", err)
e.Logger.Crit("failed to write empty cached config: %v", err)

func (e *Engine) acquireConfig() (cfg types.Config, err error) {
// acquireConfig will perform differently based on the stage it is being
// called from. In fetch stages it will attempt to fetch the provider
// config (writing an empty provider config if it is empty. In all other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// config (writing an empty provider config if it is empty. In all other
// config (writing an empty provider config if it is empty). In all other

Copy link
Contributor

@bgilbert bgilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in commit message: interal.

internal/exec/engine.go Outdated Show resolved Hide resolved
internal/exec/engine.go Outdated Show resolved Hide resolved
internal/resource/http.go Outdated Show resolved Hide resolved
@arithx
Copy link
Contributor Author

arithx commented Jul 7, 2020

Updated

internal/exec/engine.go Show resolved Hide resolved
internal/exec/engine.go Outdated Show resolved Hide resolved
arithx added 2 commits July 7, 2020 21:51
When the user-provided config is empty write an empty cache config.
Additionally restrict fetching configs to only `fetch` stages.
CAs are re-written into the fetched config as a base64 string to allow
cached reads to not require an additional fetch. When compression
support was added during the resource consolidation refactor for the
3.1.0 spec this interaction was missed.

The blackbox tests did not catch this as they were always clearing the
config cache on each run.
@arithx
Copy link
Contributor Author

arithx commented Jul 8, 2020

pushed

@arithx arithx merged commit b025750 into coreos:master Jul 8, 2020
jlebon added a commit to jlebon/fedora-coreos-config that referenced this pull request Jul 15, 2020
We shouldn't use `/run/ignition.json` to determine whether a user
config was provided since it's implementation details. Instead, use the
new official journal messages that Ignition emits.

This is complicated by the fact that we need to support RHCOS, where the
journal messages haven't been backported. Use the fact that we always
have a base config to key off of whether to use the old behaviour vs the
new one. (More accurately, we'd want to check for
coreos/ignition#1002, but there's no easy way to
do this from the outside. Alternatively we can check the Ignition
version, though that's deeply nested under `/usr/lib/dracut/...`).

Anyway, this should be temporary until RHCOS moves to spec v3.

Closes: coreos#514
jlebon added a commit to coreos/fedora-coreos-config that referenced this pull request Jul 15, 2020
We shouldn't use `/run/ignition.json` to determine whether a user
config was provided since it's implementation details. Instead, use the
new official journal messages that Ignition emits.

This is complicated by the fact that we need to support RHCOS, where the
journal messages haven't been backported. Use the fact that we always
have a base config to key off of whether to use the old behaviour vs the
new one. (More accurately, we'd want to check for
coreos/ignition#1002, but there's no easy way to
do this from the outside. Alternatively we can check the Ignition
version, though that's deeply nested under `/usr/lib/dracut/...`).

Anyway, this should be temporary until RHCOS moves to spec v3.

Closes: #514
jlebon added a commit to jlebon/ignition that referenced this pull request Jul 16, 2020
Regression from coreos#958. We switched the list of providers from an array to
a map. But iteration order through a map is undefined, so we lost the
precedence of providers.

I think this is the cause behind a lot of the FCOS installer test
timeouts, such as:

coreos/coreos-assembler#1597

There, we pass the Ignition config for the PXE boot via
`ignition.config.url`, but if the metal (no-op) fetcher appears earlier
than the `cmdline` fetcher, we get no config. And similarly for the
installed system when the no-op fetcher appears before the `system`
fetcher (which coreos-installer's `--ignition-file` leverages).

The likelihood of this happening increased in the v2.4.0 release due to
coreos#1002, which only gave us one try
to iterate over the correct provider first (at the `fetch` stage),
rather than every stage having a go at it.

Closes: coreos/coreos-assembler#1597
jlebon added a commit to jlebon/ignition that referenced this pull request Jul 16, 2020
Regression from coreos#958. We switched the list of providers from an array to
a map. But iteration order through a map is undefined, so we lost the
precedence of providers.

I think this is the cause behind a lot of the FCOS installer test
timeouts, such as:

coreos/coreos-assembler#1597

There, we pass the Ignition config for the PXE boot via
`ignition.config.url`, but if the metal (no-op) fetcher appears earlier
than the `cmdline` fetcher, we get no config. And similarly for the
installed system when the no-op fetcher appears before the `system`
fetcher (which coreos-installer's `--ignition-file` leverages).

The likelihood of this happening increased in the v2.4.0 release due to
coreos#1002, which only gave us one try
to iterate over the correct provider first (at the `fetch` stage),
rather than every stage having a go at it.

Closes: coreos/coreos-assembler#1597
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Canonicalize the fetch stage as the only stage during which Ignition fetches configs
4 participants