-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a automatic PR-revert utility for failing buildbots #535
Conversation
8677cac
to
ed4d20d
Compare
This will be a cronjob in the buildbot master server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me on principle, but we need to work out the kinks by testing this out.
REPO_OWNER = "python" | ||
REPO_NAME = "cpython" | ||
FORK_OWNER = os.getenv("FORK_OWNER", REPO_OWNER) | ||
REPO_CLONE_DIR = Path("/tmp/cpython") # Replace with actual path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what process do you plan to replace all the placeholder directories, emails, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the question. What do you mean with “what process”?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean the values of the fork owner and the tokens and all of that I was planning to have a systemd launch script or a bash script with the environment variables and run that as the cron job.
if that is what you meant, do you have a better preference?
I tested it in python/cpython#124870 I will clean it up before we land now that I know you like the approach |
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
003f324
to
ed16034
Compare
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
ed16034
to
bc97a6f
Compare
I'd appreciate if you could write down the problem this solves. I wasn't the discussions for this, and it seems that there's a few steps of reasoning leading to this solution that aren't clear to me. A lot of the buildbot issues I see won't be caught by this script, since they test several commits: refleaks and other slow builders, high-activity events, or just an unrelated docs commit coming in at the same time. |
I have mixed feelings about this feature, create a revert PR when a buildbot fails. In my experience, we need some engineering to decide if a failure is random, a real regression, or something else. Also, when there is a regression, I prefer a discussion to decide what to do rather than just revert. Having a change revert is bad for the change author, it's not always needed. For me, that's more a technical solution for a human issue which should be solved by interacting, rather than having big "revert" hammer. |
This PR is the result of some discussion we had at the core developer sprint. The idea of the tool is that this PR opens (but doesn't merge) revert PRs when a builtbot builder has failed more than N times and it was green before and if the failure corresponds to a single commit. The idea is not to force the revert on people because a human still needs to merge the PR and the discussion needs to happen as well but to start the process in a way that's not someone opening a revert PR to someone else (it's a bot opening the revert PR). This won't caught everything but it will caught as many things we can be sure that are caused by a single commit and the PRs will be open by a bot and not a human, highlighting that this is an automated process to not have the CI broken. The problem that this solves is not having the CI red for a long time making fixes or reverts more challenging as time passes and the reverts are not clean anymore. We have discussed many times (not only in this last core dev sprint) that having the buildbot failing for a long time makes things much more difficult, specially when issues start to pile up and people are slow to act. |
For reference, this logic would not catch any of the 36 currently failing stable builders. For all of them, they either always failed or the first build of the most recent red streak has more/less than one associated Change. If the hammer does turn out to be reliable, I'd be worried about another problem: once people rely on it, its blind spots become CPython's blind spots. (That's a recent scar of course -- I caused this kind of issue with my daily reports!) All in all, I do agree we need to do something, but I'm skeptical about this script. Sorry for telling you after you wrote it! If we are to go ahead: the script won't hurt, and the best way to debug such three-way service integration is in production. Code-wise, ship it! :) The logic here looks good, and I'll take a look at integrating it in the “release” dashboard (which we'll need anyway to catch the multi-commit builds):
|
I'm a bit puzzled by the pushback here. This automation is implementing an existing policy that we haven't been enforcing very consistently:
This policy is even mentioned explicitly in Victor's unofficial devguide. Sure, the revert PRs this script makes will only solve a subset of the cases. But that's by design. It's a conservative approach to avoid producing PRs that revert changes we're not really sure about. And even in the case when a PR is created, we don't have to merge the revert. We can ping the original change author first, pointing to the PR, saying that "hey, look, there's an urgent thing for you to look at". |
I agree with that. The script does the right thing. The question is if it'll do the right thing often enough to be worth it. |
I have not had a chance to actually review the code, all thoughts here assume that it's perfect :) My biggest concern with this is just logistical; I'm not sure it should be part of the If it doesn't already do so, this should definitely trigger a build of the revert PR on the failing bot(s). |
Ok, we can place this on some new repo, no problem, but I worry about the discovery and reusability of the common bits. But I am open to that if this is the consensus.
I does not, I will modify it if we decide to go forward 👍 |
I'm neutral, I don't have a strong opinion on this PR :-) I'm afraid of two things. If the policy is applied blindly, contributors who worked weeks on a PR will see their PR reverted in 24h without having an opportunity to dig into the issue, whereas the fix can take 5 minutes. It can be disappointing and send the wrong signal to contributors: "your work is not worth it, it has been rejected". I'm also afraid of seeing tons of "revert" PRs being created automatically and someone will have to go through the list to decide which ones should be applied or not. Reviewing a revert PR requires to fully understand the issue which can mean: fixing the regression :-) Maybe things will be smooth and people will understand the need to revert automatically if no one is available to write a fix.
Right. But in practice, it's usually rare to revert, and easier to fix the regression. For me, a revert is like the last resort action. Well, I just wanted to share my concerns, I don't want to hold the work here. If you want to give it a try, go ahead, and we can adjust the process later depending on how it goes. |
Said differently: applying a revert in 24 hours is too short for most people availability. If you want a strict policy, maybe we should give a week to contributors to fix the issue. Usually, a red buildbot is not a major blocker issue and it can stay red a few days. |
Right, which is why I said nobody's first reaction will be to apply a proposed revert PR without talking to the person. In fact, the author is
This is false. Reverting = returning to a previously known state. If done quickly, there is minimal collateral that needs to be taken into account.
We can definitely refine the magic number of hours before which we apply the revert. The 24 hour period that you yourself implemented in the past was based on the following observations:
|
Nobody will merge the revert automatically, so there is no risk of things happening without human intervention. The PRs are filed and acting quickly is important to avoid piling up issues in the builedbot fleet that merge reverts more difficult and challenging for the people involved in the revert. We have being doing this by hand years. This just adds some help and makes the revert coming from a bot, which a lot of people in several core dev sprints and language summit talk thought it was a good idea. |
I won't block the PR, and I'll help deploying it if you make it my job -- just say so :) Since I think that this will need a human in the loop way more often than not, I'm instead investing time in a dashboard that shows the “Breaking build” and its associated change(s). Here's a very much work-in-progress static preview (as in, it won't automatically update). |
The signal is: your work was good because we merged it but unfortunately broke the CI and that has consequences to everyone in the project so we are opening this PR to potentially revert it to avoid further problems for everyone. If you are afraid about this we can tweak the revert message all that we want to make it clear why this is happening, the policy and the consequences but the reality of the situation is that not reverting has duplicated or triplicated the amount of work to find and fix CI problems for years and we have evidence of that (think of all the problems with ref leaks and crashes in sub interpreters in 3.11) |
This won't happen: the chances of all the conditions being true at the same time (so we are sure that the PR is not a false positive) its actually quite rare. I did a simulation and across this year it would have filed 4 revert PRs |
It's hard to say this when you have a working implementation ready, but: For 4 actions a year, I think that this is too much code & infrastructure to maintain. We do need to automate reverts, but the process needs a human in the loop in all but the simplest cases. I filed a plan here: python/core-workflow#559 |
Ok, I guess I will close the PR then. I do apologise if this sounds a bit negative but I don't have more energy to keep defending my views on this, specially since unfortunately I am not feeling that this is being a collaboration. |
No description provided.