-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface fetch/pull/push kill_after_timeout and reset default to None #1340
Conversation
alright, the test suite runs locally and gives the following coverage:
|
Thanks a lot for the initiative. With the offending release yanked I believe we win some time to wait for @Yobmod to chime in on this. My preference, if I understand what's happening correctly, is to set the default timeout to |
Hi, I think setting the timeout default to None works better, rather than a larger number. Or it could be removed altogether. |
FWIW, the one concern I have about surfacing the argument here is that I think there might be confusion between |
@Yobmod, @dwhswenson ; I have reverted the default to None, please let me know what you think. @Byron, I will not be able to push to this branch for the next ~60 hour, so feel free to push to it if you want this to be finished up before monday |
@sroet Thanks for your help with this and the incredibly quick turnaround time, and for everyone to share their thoughts to make this better.
To me it seems that The actual process termination happens here and it looks like it will succeed on non-windows but might still fail to kill stalled processes on windows. To my mind, that's acceptable and I'd rather wait for users to report issues due to this than to keep the |
This took a bit more work than expected, but should now work with the keyword
That did not really work out, as most user facing functions (
The reading threads had no way of handling the closing of the streams, so added a check to the exception. It should now raises like this: >>> a = git.Repo('.')
>>> a.remotes.origin.fetch(kill_after_timeout=0.01)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sander/github_files/GitPython/git/remote.py", line 868, in fetch
kill_after_timeout=kill_after_timeout)
File "/home/sander/github_files/GitPython/git/remote.py", line 732, in _get_fetch_info_from_stderr
proc.wait(stderr=stderr_text)
File "/home/sander/github_files/GitPython/git/cmd.py", line 490, in wait
raise GitCommandError(remove_password_if_present(self.args), status, errstr)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(-15)
cmdline: git fetch -v origin
stderr: 'error: process killed because it timed out. kill_after_timeout=0.01 seconds' When writing the tests, I saw that the error handling code of
@Byron Is this mismatch intentional/wanted? I will update the initial post to indicate the current status |
Thanks a million for all the effort you are putting in!
It's based on the observation on how these processes usually work and communicate errors. However, if it feels out-of-place or wrong there is no reason not to improve it while you are here. It wouldn't have been the first time that there are long-standing issues in the code-base that somehow never surfaced. While I encourage you for these fixes I also warn that the test-suite might not catch all the subtle behaviour either, so it's probably better to leave as much unchanged as possible while fixing the issue at hand. Note that CI is failing because of the linter and white-space related errors. It's strange that it asks me to approve running CI on this PR over and over again and I hope it will just run when the fix is pushed. |
See https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/, in particular:
IMO, that last is a bad choice by GitHub. I understand the desire to avoid abuse by crypto-miners, but there should be an option for the approval to be for the duration of the PR. |
Thanks for the heads-up, I wasn't aware of this change.
Couldn't agree more. Here it seems they are over-steering towards abuse-avoidance and trade convenience of using their platform. Instead, if legitimate PRs turn out to become abusive later, it's something I'd happily report - after it happened. |
@bryon, sorry, I should have checked mypy and flake8 localy. they now return:
|
Looking at the test failure for 9678226, it seems like the CI environment is too quick |
Hmm, it started failing locally with timeout=0 , so reset it to a small value. Hopefully this should be small enough to at least let CI fail out, otherwise I would like a reviewer to see if the test also fails on their local system... |
A bit of info on why the error is not raised but still shows up in the log as:
This is because there is a race condition: in the time it takes between the main thread triggering the timeout (which gives the log) and telling the process to terminate (which should trigger the error by setting the status to not 0), the process can complete correctly. In this case, I decided of letting the code recover instead of erroring out. |
It's interesting as CI still fails even though the timeout is small. Can there be a 'too-small' due to same race elsewhere? Maybe the CI is onto something here and indicates some fixable issue, and for now I'd be inclined to believe that. |
@Byron the issue lies in the fact that the CI python thread is too slow compared to the git process, I can force CI behavior locally by altering the lines: for t in threads:
t.join(timeout=kill_after_timeout)
if t.is_alive():
if isinstance(process, Git.AutoInterrupt):
process._terminate() to for t in threads:
t.join(timeout=kill_after_timeout)
if t.is_alive():
if isinstance(process, Git.AutoInterrupt):
#TODO debug code
import time
time.sleep(1)
process._terminate() What happens is that inside the # did the process finish already so we have a return code ?
try:
if proc.poll() is not None:
self.status = proc.poll()
return None which allows the process status to be and even later it still does:
Which also allows the process to terminate gracefully if the kill signal is not fast enough. Now, I can force set @Byron is that (not allowing the process to recover) the preferred solution here? [EDIT]: addition; The GitCommandError is only raised (inside |
A robust implementation is preferred any time, and it seems like a system that can recover to states that are changing is the better option. Ultimately CI should work and ideally work for the next few years without spurious failures due to races. |
Added a class variable to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, it looks like this PR is ready to be merged. I'd love to have another reviewer though, maybe @Yobmod could take a final look?
I'd be happy to try a new release when merged, too.
@Byron they already have been: |
Can imagine, thanks for dealing with the CI on this one! |
(hopefully) closes #1339
This PR does as of e058c4c :
timeout
tokill_after_timeout
and set default toNone
instead of10 s
kill_after_timeout
to user facing functions (remote. pull/push/fetch
)AutoInterrupt
so we can._terminate()
the underlying process and grab status codes from terminated processesAutoInterrupt._status_code_if_terminate
; if non-zero it overrides any status code from the process in_terminate
. This can be used to prevent race conditions in tests, as described in this commentSome history of the commits (while trying a different strategy):
aafb300 changes the default timeout to 60 s
1d26515 allows
push
,pull
andfetch
keywordtimeout
that will be propagated downfebd4fe alters some tests to use the old 10 s timeout for
fetch
andpull
4113d01 also alters a test call to
pull
to add a timeout of 10sc55a8e3 fixes the test suite
7df33f3 sets the default timeout to
None
if the current default is too lose, users can now give their own timeout