-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design and implementation ideas for job retries #509
Comments
In situations of unreliable test results kernel maintainers often need to rerun the tests to ensure the results are correct. This can be due to a variety of reasons such as flaky tests, network issues, or other transient failures. This particular implementation will target specific use-case so answers to the questions above will be tailored to this use-case: Retries will be triggered by manual pub-sub event. This event might be initiated on KernelCI system user over web-interface or/and over special endpoint on API, we have to decide on this. There is no retry limit per run. Further information will be posted as follow-up comments. |
Linking this proposal here as well: kernelci/kernelci-pipeline#512, since I think it's relevant. IMO the main complication is that currently job generation and running always involves the scheduler, and I think decoupling the scheduler logic from the job generation and running will make it possible to implement features like retries separately as part of the pipeline. |
Opening a discussion here to put together ideas about the requirements, use cases and design of job retries.
I'll start by pointing out that there isn't a single solution to this, it'll obviously depend on the requirements we need, but even within a defined set of requirements there are many ways to implement retries and different approaches.
IMO, regardless of the solution we choose in the end, handling job retries from a high-level point of view (where a retry is an entity that's exposed by the API) offers more flexibility than handling them exclusively from a low-level point of view (where a retry is something intra-API that's automatically handled by the core and not exposed outside), although that doesn't mean that both are incompatible. A mix of the two can be a sane option, where a low-level retry stage is automatically performed by the runtimes if they detect a clear condition to trigger a retry (such as what Lava does) with no API/Core intervention, and a higher-level retry stage can then be performed by whatever logic interacting with the API results. Under this scheme, the low-level retry stage would act as a filter that filters out certain runtime-specific retries that could be too noisy to handle at a higher level.
Questions
How will retries be triggered?
Is there a retry limit per run?
How to conceptualize a "retry"?
The text was updated successfully, but these errors were encountered: