Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs/build: don't start new build until previous build completes #823

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dustymabe
Copy link
Member

Since 973bcf9 ("jobs/build: rerun build-arch if previous build is
incomplete"), there is a race possible where the build job rerun logic
could kick in before the release jobs initially triggered for that
build has finished. We don't want to queue builds in that case.

Let's just prevent starting up the build job at all if there are
remaining pieces from another build for this stream running.

We only actually run the jobs when `uploading` is set. So it seems wrong
to update the build description. In fact, let's not even waste time
looking for missing arches or reading build metadata.

Patch best viewed with whitespace ignored.
@dustymabe
Copy link
Member Author

did some trivial testing on this. It seems to work

Since 973bcf9 ("jobs/build: rerun `build-arch` if previous build is
incomplete"), there is a race possible where the `build` job rerun logic
could kick in before the `release` jobs initially triggered for that
build has finished. We don't want to queue builds in that case.

Let's just prevent starting up the build job at all if there are
remaining pieces from another build for this stream running.

Co-authored-by: Jonathan Lebon <jonathan@jlebon.com>
@jlebon
Copy link
Member

jlebon commented Feb 23, 2023

Hmm, I thought we wanted to keep the property of being able to start a new build while the previous build is still finishing? If that's not the case, a simpler approach is to always wait for the multi-arch and release jobs. And that would come with benefits like centralizing Slack notifications and not having to support an extra parameter and diverging between FCOS and RHCOS. (Edit: actually, another huge benefit would be that it'd be completely race-free. We'd be able to simplify our lock logic by a lot.)

@dustymabe
Copy link
Member Author

Hmm, I thought we wanted to keep the property of being able to start a new build while the previous build is still finishing?

ehh, I've never been too set on that as a requirement.

If that's not the case, a simpler approach is to always wait for the multi-arch and release jobs. And that would come with benefits like centralizing Slack notifications and not having to support an extra parameter and diverging between FCOS and RHCOS. (Edit: actually, another huge benefit would be that it'd be completely race-free. We'd be able to simplify our lock logic by a lot.)

I think the thing I wanted to keep more was the fidelity between the job and failure reporting. i.e. if the release job fails I don't really want the x86_64 build job to report failure. I'd also like notifications to come through when the x86_64 job completes (letting us know it succeeded) and this typically happens way before the other jobs.

Maybe what we need is another toplevel job that's the entry point and will fail if any of the parts fail. Not sure how easy this would be to do because there is a lot of conditional logic baked in, but could be something to consider.

@jlebon
Copy link
Member

jlebon commented Feb 27, 2023

ehh, I've never been too set on that as a requirement.

We should probably think on this a bit more and look at some numbers. It means we significantly lower the rate at which we can build a stream and also increase the latency between requesting a build and having it.

I think the thing I wanted to keep more was the fidelity between the job and failure reporting. i.e. if the release job fails I don't really want the x86_64 build job to report failure.

Note though that's exactly what 4c02574 does to help the ART case. :)

I'd also like notifications to come through when the x86_64 job completes (letting us know it succeeded) and this typically happens way before the other jobs.

Hmm yeah, I think we can get that by having the build job send its message when it's done and just waiting for the multi-arch and release jobs? We don't need to propagate failures from those if we don't want (that'll have to be a parameter that replaces WAIT_FOR_RELEASE_JOB). I think though there's more in general we could do to make notifications better.

Maybe what we need is another toplevel job that's the entry point and will fail if any of the parts fail. Not sure how easy this would be to do because there is a lot of conditional logic baked in, but could be something to consider.

Yeah, maybe. There's a natural tension it seems between RHCOS and FCOS wrt how we want the pipeline to behave. We might end up there (a separate job), though ideally we try to rework things so that they become more similar while still preserving what we value.

We could actually probably implement waiting for the build-arch and release jobs (race-free) without lots of locks and without giving up on pipelining. The milestone step is relevant here.

Anyway, I guess we need a fix soon for the race condition. I think #822 has a much smaller impact on pipeline behaviour; WDYT about merging that one for now while we discuss the above?

@dustymabe
Copy link
Member Author

dustymabe commented Feb 27, 2023

ehh, I've never been too set on that as a requirement.

We should probably think on this a bit more and look at some numbers. It means we significantly lower the rate at which we can build a stream and also increase the latency between requesting a build and having it.

I'm not so sure this is a significant difference than what we require today. I'd say it's not too common that we have multiple builds of the same stream in flight.

I think the thing I wanted to keep more was the fidelity between the job and failure reporting. i.e. if the release job fails I don't really want the x86_64 build job to report failure.

Note though that's exactly what 4c02574 does to help the ART case. :)

Indeed, which is why it's a parameter and not the default.

I'd also like notifications to come through when the x86_64 job completes (letting us know it succeeded) and this typically happens way before the other jobs.

Hmm yeah, I think we can get that by having the build job send its message when it's done and just waiting for the multi-arch and release jobs? We don't need to propagate failures from those if we don't want (that'll have to be a parameter that replaces WAIT_FOR_RELEASE_JOB). I think though there's more in general we could do to make notifications better.

👍

Maybe what we need is another toplevel job that's the entry point and will fail if any of the parts fail. Not sure how easy this would be to do because there is a lot of conditional logic baked in, but could be something to consider.

Yeah, maybe. There's a natural tension it seems between RHCOS and FCOS wrt how we want the pipeline to behave. We might end up there (a separate job), though ideally we try to rework things so that they become more similar while still preserving what we value.

We could actually probably implement waiting for the build-arch and release jobs (race-free) without lots of locks and without giving up on pipelining. The milestone step is relevant here.

yes, but that does imply that pipelining is something we heavily take advantage of today or want to in the future. I really don't think (in the current pipeline setup) it's something we really need.

Maybe if we had a slim version of the build that happened "all the time" and a heavy one that happened once a day or something. But right now we're running the heavy one all the time and that doesn't lend itself well to pipelining IMO. quickly invalidating all those images we just created and cloud uploads we just did is a bit of an anti-pattern.

Anyway, I guess we need a fix soon for the race condition. I think #822 has a much smaller impact on pipeline behaviour; WDYT about merging that one for now while we discuss the above?

I commented over there. I think I still prefer this (let's talk more!) but won't block that PR either.

@jlebon
Copy link
Member

jlebon commented Feb 27, 2023

I commented over there. I think I still prefer this (let's talk more!) but won't block that PR either.

Agree, let's keep talking! I've updated that PR with your feedback.

I think you're probably right we don't heavily rely on pipelining today. I just didn't want to make this change quickly to fix the race without thinking it through since it has a wider impact. I think there's an opportunity here to converge on something simpler than the status quo. Maybe we can get together and discuss what we want the "ideal" to look like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants