Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design and implementation ideas for job retries #509

Open
r-c-n opened this issue Mar 27, 2024 · 2 comments
Open

Design and implementation ideas for job retries #509

r-c-n opened this issue Mar 27, 2024 · 2 comments

Comments

@r-c-n
Copy link
Contributor

r-c-n commented Mar 27, 2024

Opening a discussion here to put together ideas about the requirements, use cases and design of job retries.

I'll start by pointing out that there isn't a single solution to this, it'll obviously depend on the requirements we need, but even within a defined set of requirements there are many ways to implement retries and different approaches.

IMO, regardless of the solution we choose in the end, handling job retries from a high-level point of view (where a retry is an entity that's exposed by the API) offers more flexibility than handling them exclusively from a low-level point of view (where a retry is something intra-API that's automatically handled by the core and not exposed outside), although that doesn't mean that both are incompatible. A mix of the two can be a sane option, where a low-level retry stage is automatically performed by the runtimes if they detect a clear condition to trigger a retry (such as what Lava does) with no API/Core intervention, and a higher-level retry stage can then be performed by whatever logic interacting with the API results. Under this scheme, the low-level retry stage would act as a filter that filters out certain runtime-specific retries that could be too noisy to handle at a higher level.

Questions

How will retries be triggered?

  • Automatically (low-level, platform-based)?
    • What will trigger an automatic retry?
  • Will API clients (users / pipeline / tools) submit or request job retries?
    • What information will be needed to submit / request a retry?
    • Can they submit a retry of anything or only of certain job types?
    • Job ownership: which users can trigger retries for a certain job?

Is there a retry limit per run?

How to conceptualize a "retry"?

  • Does it need to be a separate entity?
  • Can a retry be simply a "job"?
  • If not, how to define the relationship between a job run and its retries?
@nuclearcat
Copy link
Member

In situations of unreliable test results kernel maintainers often need to rerun the tests to ensure the results are correct. This can be due to a variety of reasons such as flaky tests, network issues, or other transient failures.
Previous attempts to implement retry logic have been unsuccessful due to the way the event logic implemented. Even if we spoof pubsub event and retry build node, the event logic will fetch
nodeid with real data (different from spoofed) and will fail to trigger test event.
This time we will put more effort to implement retry logic for lab tests.

This particular implementation will target specific use-case so answers to the questions above will be tailored to this use-case:

Retries will be triggered by manual pub-sub event. This event might be initiated on KernelCI system user over web-interface or/and over special endpoint on API, we have to decide on this.
Likely event should contain at least build node id, fields it need to override, and filter for events to select only specific job/platform.
Retries will be limited to specific job types, for example, lab tests.

There is no retry limit per run.
Each retry will create new "job" node, attached to kernel build node. We should add to this node remark that it is a manually triggered retry, and not automatically scheduled job.

Further information will be posted as follow-up comments.

@r-c-n
Copy link
Contributor Author

r-c-n commented Jun 28, 2024

Linking this proposal here as well: kernelci/kernelci-pipeline#512, since I think it's relevant. IMO the main complication is that currently job generation and running always involves the scheduler, and I think decoupling the scheduler logic from the job generation and running will make it possible to implement features like retries separately as part of the pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants