-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC - AR Cost Reduction Plan #79
RFC - AR Cost Reduction Plan #79
Conversation
2014925
to
f5950d7
Compare
f5950d7
to
75d474c
Compare
75d474c
to
17de958
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit difficult to comment on this RFC as there's many trailheads to proposed to investigate, but not many specific changes proposed. Data on where existing costs are spent may highlight where/how to optimize, as would estimates of how each proposed change is expected reduce costs. Potential downsides such as reduced execution speed are not discussed, likely as investigation has not yet yielded data.
### Improve our AR Configuration | ||
|
||
SIG-Build has recently started the effort to reduce our build times across all platforms and is currently investigating where optimizations can be made. | ||
|
||
We will address the stability of our AR pipeline to reduce the number times a maintainer has to repeat the AR run for flaky tests or any other AR failure that is not directly related to the incoming code change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this section highlights areas of investigation, there does not appear to be specific changes proposed. It is fairly unclear what will be done to improve configuration/build-times/stability. Merge queues below is adequately specific.
|
||
We will address the stability of our AR pipeline to reduce the number times a maintainer has to repeat the AR run for flaky tests or any other AR failure that is not directly related to the incoming code change. | ||
|
||
We will also investigate using a merge queue in order to significantly reduce the total number of AR runs we need to execute. The purpose of the merge queue will be to batch together incoming PRs and run the AR on the combined changes. If the AR succeeds, the PRs are then merged in together. If the AR fails, the PR that caused the failure is removed from the queue and AR is retried. SIG-Build is currently working on a [mechanism](https://github.com/o3de/sig-build/blob/main/rfcs/rfc-bld-20220623-1-build-failure-rca.md) to identify owners of AR build failures and would enable this automation in a merge queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are likely to be implementation and UX costs to using a build queue. There may also be new inefficiencies introduced when performing failing builds of changes A+B+C+D+E which contains new bugs, and still requires a dozen build attempts for the five changes to get approval. However it seems like this needs investigation before a specific change is proposed.
It is also unclear how ownership information relates to build queuing. While notification can prompt action after getting rejected by the queue, the same action could be prompted regardless of build cadence or queuing. This seems like extraneous information, since it does not clarify why using a build queue is appropriate or how a build queue is expected to improve the pipeline.
|
||
### Utilize lower cost build resources | ||
|
||
Addressing the runtimes and stability of the AR will allow us to migrate some of the jobs to lower cost build resources. This will include utilizing lower spec instances in our current infrastructure and also hosted platforms like GitHub Actions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unclear how runtime and stability affect the ability to migrate to different hardware. Long builds and intermittent failures should cause inefficiencies at similar rates. The question seems to be about optimizing cost per compute-unit. Is this trying to highlight time constraints, which create tension from moving to the slowest, cheapest hardware available?
minor: May be more specific to state "pipeline duration" instead of "runtime" which has multiple meanings
### Revisit our testing approach in the PR workflow | ||
|
||
A very importance aspect of reducing our costs is revisiting our approach to setting up our build and test targets in the AR. This investigation will be driven by SIG-Build in collaboration will the other SIGs to determine how to get the best value from the AR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is not making a specific suggested change, nor is it clarifying important metrics. I recommend defining a specific budget cap in terms of total execution-duration per pipeline run, and then we can hold discussion with other SIGs about slicing up where that budget gets spent.
minor: important
_Phase 3 Cost Improvements:_ | ||
|
||
* Migrate PR checks that are compatible with GitHub’s hosted runners. | ||
* Separate longer running tasks that are not suitable to gate PRs or are incompatible with GitHub actions to the merge queue checks, post-merge checks, or nightly/weekly checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer running, lower-priority tasks are already separated into the nightly checks. I don't see a necessary change unless there is a new heuristic of what "too long" means, such as a specific execution time budget.
|
||
### Example: 22.10 Release AR Metrics | ||
|
||
For the 22.10 release we saw a high AR failure rate (48%) and total runs (2.2 avg per PR) for PRs targeting the stabilization branch. There was an [issue](https://github.com/o3de/o3de/pull/12346) in the stabilization branch that made AR runs intermittently fail. This went undetected for a few days while developers attempted to re-run their ARs to get a successful build. To prevent issues like this, we need to setup mechanisms that raise awareness about these failures and escalate it so developers address the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this data does include intermittent failures, pull requests also contain legitimate failures being correctly caught by tests (or legitimate build failures, asset failures, etc). And while there is currently no way for contributors to claim or indicate which were intermittent in this dataset, the data from branch-update runs should be that subset of only the intermittent failures which initially passed AR.
SIG-Testing has a metrics-based proposal to improve detection of intermittent failures in branch update runs and the periodic runs: o3de/sig-testing#64
...but the onus will always be on SIGs to address instability they own in their product and its tests.
Signed-off-by: Brian Herrera <briher@amazon.com>
17de958
to
f86147f
Compare
Yes, the purpose of this doc is to provide the direction SIG-Build is taking to reduce costs for the AR at a high level. Primarily our changes to the workflow like integrating merge queues, allocating engineering resources to redesign our jobs to run with GitHub actions, and moving some tests outside the AR. It's very likely this doc will spawn other more technically detailed RFCs for the components discussed here like integrating S3SIS into our pipeline to improve our caching. There would be a lot of cover if it was all included in this doc. And like you mentioned there are still outstanding investigations that need to be performed to validate some of the assumptions in this doc. I can provide more data on our cost structure related to running the AR. I'll highlight that each job in the AR is executed on its own host which incurs EC2/EBS costs and provide details like the instance type, etc. and how the proposals in this doc plans to address it. |
61e42b0
to
fd6ef44
Compare
Signed-off-by: Brian Herrera <briher@amazon.com>
fd6ef44
to
b85d8fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the AR change plan and it looks good. Approved.
No description provided.