Design and implementation ideas for job retries #509

r-c-n · 2024-03-27T09:38:17Z

Opening a discussion here to put together ideas about the requirements, use cases and design of job retries.

I'll start by pointing out that there isn't a single solution to this, it'll obviously depend on the requirements we need, but even within a defined set of requirements there are many ways to implement retries and different approaches.

IMO, regardless of the solution we choose in the end, handling job retries from a high-level point of view (where a retry is an entity that's exposed by the API) offers more flexibility than handling them exclusively from a low-level point of view (where a retry is something intra-API that's automatically handled by the core and not exposed outside), although that doesn't mean that both are incompatible. A mix of the two can be a sane option, where a low-level retry stage is automatically performed by the runtimes if they detect a clear condition to trigger a retry (such as what Lava does) with no API/Core intervention, and a higher-level retry stage can then be performed by whatever logic interacting with the API results. Under this scheme, the low-level retry stage would act as a filter that filters out certain runtime-specific retries that could be too noisy to handle at a higher level.

Questions

How will retries be triggered?

Automatically (low-level, platform-based)?
- What will trigger an automatic retry?
Will API clients (users / pipeline / tools) submit or request job retries?
- What information will be needed to submit / request a retry?
- Can they submit a retry of anything or only of certain job types?
- Job ownership: which users can trigger retries for a certain job?

Is there a retry limit per run?

How to conceptualize a "retry"?

Does it need to be a separate entity?
Can a retry be simply a "job"?
If not, how to define the relationship between a job run and its retries?

nuclearcat · 2024-06-28T09:41:52Z

In situations of unreliable test results kernel maintainers often need to rerun the tests to ensure the results are correct. This can be due to a variety of reasons such as flaky tests, network issues, or other transient failures.
Previous attempts to implement retry logic have been unsuccessful due to the way the event logic implemented. Even if we spoof pubsub event and retry build node, the event logic will fetch
nodeid with real data (different from spoofed) and will fail to trigger test event.
This time we will put more effort to implement retry logic for lab tests.

This particular implementation will target specific use-case so answers to the questions above will be tailored to this use-case:

Retries will be triggered by manual pub-sub event. This event might be initiated on KernelCI system user over web-interface or/and over special endpoint on API, we have to decide on this.
Likely event should contain at least build node id, fields it need to override, and filter for events to select only specific job/platform.
Retries will be limited to specific job types, for example, lab tests.

There is no retry limit per run.
Each retry will create new "job" node, attached to kernel build node. We should add to this node remark that it is a manually triggered retry, and not automatically scheduled job.

Further information will be posted as follow-up comments.

r-c-n · 2024-06-28T09:45:47Z

Linking this proposal here as well: kernelci/kernelci-pipeline#512, since I think it's relevant. IMO the main complication is that currently job generation and running always involves the scheduler, and I think decoupling the scheduler logic from the job generation and running will make it possible to implement features like retries separately as part of the pipeline.

JenySadadia mentioned this issue Jun 28, 2024

Implement retry logic for lab tests kernelci/kernelci-pipeline#665

Closed

nuclearcat mentioned this issue Jul 3, 2024

Add feature kernelci-try kernelci/kernelci-core#2594

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design and implementation ideas for job retries #509

Design and implementation ideas for job retries #509

r-c-n commented Mar 27, 2024 •

edited

Loading

nuclearcat commented Jun 28, 2024

r-c-n commented Jun 28, 2024

Design and implementation ideas for job retries #509

Design and implementation ideas for job retries #509

Comments

r-c-n commented Mar 27, 2024 • edited Loading

Questions

How will retries be triggered?

Is there a retry limit per run?

How to conceptualize a "retry"?

nuclearcat commented Jun 28, 2024

r-c-n commented Jun 28, 2024

r-c-n commented Mar 27, 2024 •

edited

Loading