Skip to content

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Jan 12, 2021

Changes the configuration of the cancel workflow action to skip schedule/push events from canceling. This has the effect that duplicates of all direct pushes (master merges or direct pushes to the spark repository are not cancelled)

What changes were proposed in this pull request?

Update to CI cancel policy to skip direct pushes. Duplicates will only be cancelled for Pull Requests.

Why are the changes needed?

Apparenlty the aggressive behavior of the cancel action which also cancels duplicate master builds is too agressive for spark community. This change spares merges to master and scheduled builds from duplicate checking (as a result all merges to master will be always build to completion).

The previous behavior of the action was that in case of subsequent merges to master, only the latest one was guaranteed to complete. Other changes could be cancelled before they complete to save job queue.

Does this PR introduce any user-facing change?

No, except if the master builds were somehow facing the users (but it's unlikely taking into account the ASF release policy).
There was a potential that some changes that could be detected by specific master merge failing could be detected later (in one of the subsequent builds) which could make investigation of the root cause of failure a bit more difficult, because it could have been introduced in one of the commits between two completed builds merges. But this is at most impacting the timeline of release close to release itself, not the release itself.

How was this patch tested?

This configuration parameter has been tested earlier in Airflow. We used it initially, but since our master builds are heavy and we have extensive tests in the PRs and investigation for failed builds is not as difficult we found that limiting the strain on Github Action and cancelling master builds was more important for the health of the whole ASF community and we removed that configuration.

Tested in https://github.com/potiuk/spark/runs/1688506527?check_suite_focus=true#step:2:46 where the action found other master builds in progress but did not add them as candidates to cancel.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon HyukjinKwon changed the title Disables canceling push/schedule workflows [SPARK-34053][INFRA][FOLLOW-UP] Disables canceling push/schedule workflows Jan 12, 2021
@potiuk
Copy link
Member Author

potiuk commented Jan 12, 2021

@HyukjinKwon
Copy link
Member

cc @dongjoon-hyun and @mik-laj FYI.

@HyukjinKwon
Copy link
Member

add to whitelist

Changes the configuration of the cancel workflow action to skip schedule/push events from canceling. This has the effect that duplicates of all direct pushes (master merges or direct pushes to the spark repository are not cancelled)
@potiuk potiuk force-pushed the skip-schedule-push-branches branch from dd54694 to 8cdd4b1 Compare January 12, 2021 14:02
@potiuk
Copy link
Member Author

potiuk commented Jan 12, 2021

Actually I also removed the "if pull_request". This is really useful because your "master" pushes will also trigger the cancel workflow that will keep on cancelling the PR duplicates. This helps in case the queue is full

@potiuk
Copy link
Member Author

potiuk commented Jan 12, 2021

You can see it in operation here: https://github.com/potiuk/spark/actions

And especially here: https://github.com/potiuk/spark/runs/1688506527?check_suite_focus=true#step:2:46

I had two builds running from master at the time when I pushed it and you will see that the action found them running but did not add them as candidates for cancelling because they were a 'push' type.

@HyukjinKwon
Copy link
Member

👌 sounds good!

@HyukjinKwon
Copy link
Member

I'm gonna merge and see how it goes. Other tests here are very unlikely related anyway.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

Let's monitor the builds and see how it goes ..

@potiuk
Copy link
Member Author

potiuk commented Jan 12, 2021

Let's monitor the builds and see how it goes ..

Let me know if you see any problems. The action has pretty comprehensive logs and tells exactly what it does, but if you see any problems I am happy to jump-in and help.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 13, 2021

Looks like it cancels the current workflow running in each PR:

#31034

Screen Shot 2021-01-13 at 9 51 52 AM
Screen Shot 2021-01-13 at 9 51 49 AM

#31151

Screen Shot 2021-01-13 at 9 37 13 AM
Screen Shot 2021-01-13 at 9 37 09 AM

#31142

Screen Shot 2021-01-13 at 12 00 58 PM
Screen Shot 2021-01-13 at 12 00 50 PM

#29737

Screen Shot 2021-01-14 at 9 33 47 AM
Screen Shot 2021-01-14 at 9 33 51 AM

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 13, 2021

But it seems working fine in general though .. hm .. there seems a hole somewhere. let me monitor a bit more. cc @dongjoon-hyun, @maropu, @viirya, @Ngone51, @ulysses-you, @LuciferYang FYI.

@potiuk
Copy link
Member Author

potiuk commented Jan 13, 2021

I think it works as expected. Unfortunately GA have very poor filtering in the "actions" list but there is one trick - you can filter by "branch" and for PRs it works pretty well:

You will see here an example : SPARK-34092 branch was pushed twice and one of the two workflows managed to complete from the first push before it was re-pushed. The other workflow was still running when re-push happened and it was cancelled (as expected).

Screenshot from 2021-01-13 09-55-54

The second push failed however (but it was failing test not cancelling).

@potiuk
Copy link
Member Author

potiuk commented Jan 13, 2021

Also it's not easy to mtch the "canceling" workflow (unfortunately :( ) but you can scroll through the list of workflow and the cancelling workflow will be usually adjacent to the triggering workflow:

Screenshot from 2021-01-13 10-02-18

If you look at the logs of this worfkflow you will see what happened (hoever due the way cancel action works it might be another cancel job that was run slightly earlier from the queue).

In this particular case it is the previous 'cancel duplicates" job that triggered the cancel:

https://github.com/apache/spark/runs/1692933716?check_suite_focus=true

And you can see the logs telling exactly what happened:

Screenshot from 2021-01-13 10-05-20

Hope it helps :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants