Skip to content

GitPython repo.index.commit() spawns persistent git.exe instance, holds handles to repo #553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Dec 2, 2016 · 6 comments

Comments

@ghost
Copy link

ghost commented Dec 2, 2016

I am trying to use GitPython for some repo manipulation, but ran into issues with my app, with handles open where i wouldn't expect.

Bug-jarring the issue, it seems that calling repo.index.commit() results in a several (4, consistently) git.exe processes being spawned, each holding a handle to the repo's root directory newRepo. When the test below goes to delete this TempDir, the processes are still up, causing a failure on context-manager __exit__(). My app occasionally needs to do a similar cleanup, so hits the same issue.

On the one hand, it looks like there is no contect-manager capable repo-wrapper, meaning it makes sense if some resources is open, it will not typically be cleaned/GC'd before the tempdir __exit__(). On the other hand - if Repo is going to behave like that, it really should not persist resources that have such side-effects.

Here is a working unittest:

import unittest
import git
import tempfile
import os.path

class Test(unittest.TestCase):

    def testCreateRepo(self):
        with tempfile.TemporaryDirectory(prefix=(__loader__.name) + "_") as mydir:

            # MAKE NEW REPO 
            repo = git.Repo.init(path=os.path.join(mydir, "newRepo"), mkdir=True)
            self.assertTrue(os.path.isdir(os.path.join(repo.working_dir, ".git")), "Failed to make new repo?")
            
            # MAKE FILE, COMMIT REPO
            testFileName = "testFile.txt"
            open(os.path.join(repo.working_dir, testFileName) , "w").close()
            repo.index.add([testFileName])
            self.assertTrue(repo.is_dirty())
            
            #### 
            # COMMENTING THIS OUT --> TEST PASSES
            repo.index.commit("added initial test file") 
            self.assertFalse(repo.is_dirty())
            #### 
            
            # adding this does not affect the handle
            git.cmd.Git.clear_cache()
            
            
            print("done") # exception thrown right after this, on __exit__
            # i can also os.walk(topdown=False) and delete all files and dirs (including .git/)
            # it is just the  newRepo/  folder itself who's handle is held open
            
if __name__ == '__main__':
    unittest.main()

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\%USER%\AppData\Local\Temp\EXAMPLE_gitpython_v3kbrly_\newRepo'

digging a little deeper, it seems that gitPython spawns multiple instances of git.exe processes, and each of them holds a handle to the root folder of the repo newRepo.

  • set a breakpoint immediately before the error, use sysinternals/handle to see open handles to newRepo ... git.exe (4 separate PID's of git.exe to be precise)
  • using sysinternals/procexp i can see that that they are all spawned from the python instance
    -- I'm typically running this from PyDev, but i verified the issue reproduces under vanila command line invocation of python.exe as well
  • the exception indicates it's a handle to newRepo being held. Adding a little extra code to the above I think that is the only handle held. I am able to successfully os.remove/os.rmdir() every dir and file, including all of .git/; and i finally manually recreate the issue seen on exit() in my example when i os.rmdir(newRepo)

stepping through, it's the call to repo.index.commit() that actually leads to the the git.exe(s) being spawned.

@ankostis
Copy link
Contributor

ankostis commented Dec 2, 2016

That's a quagmire of bugs!

  1. Each git-repo instance indeed holds persistent git commands, that are supposed to be cleared by repo.clear_cache(). Otherwise, on Windows only, these commands prevent the repo-dir from being deleted.
    So you have to invoke repo.clear_cache().
    But that alone won't work unless you have garbage-collect first!

  2. Additionally, the temp-file still may not be deleted because it contains read-only files (the blobs in .git/objects), so you need extra code for that. This code exists in git.util.rmtree() but currently git.util modules is being masked by git.index.util module, due to a bug, so you cannot use it - you have to copy it :-(

  3. Finally, your code also had an error, you have to invoke clear_cache() on the repo instance, not directly on Git class.

So now the code becomes:

import unittest
import gc
import git
import tempfile
import os.path
import shutil
import stat


def rmtree(path):
    """Remove the given recursively.

    :note: we use shutil rmtree but adjust its behaviour to see whether files that
        couldn't be deleted are read-only. Windows will not remove them in that case"""

    def onerror(func, path, exc_info):
        # Is the error an access error ?
        os.chmod(path, stat.S_IWUSR)
        print('dfsffdss')
        try:
            func(path)  # Will scream if still not possible to delete.
        except Exception as ex:
            raise

    return shutil.rmtree(path, False, onerror)


class Test(unittest.TestCase):

    def testCreateRepo(self):
        with tempfile.TemporaryDirectory(prefix=(__loader__.name) + "_") as mydir:

            # MAKE NEW REPO
            repo = git.Repo.init(path=os.path.join(mydir, "newRepo"), mkdir=True)
            try:
                self.assertTrue(os.path.isdir(os.path.join(repo.working_dir, ".git")), "Failed to make new repo?")

                # MAKE FILE, COMMIT REPO
                testFileName = "testFile.txt"
                open(os.path.join(repo.working_dir, testFileName) , "w").close()
                repo.index.add([testFileName])
                self.assertTrue(repo.is_dirty())

                repo.index.commit("added initial test file")
                self.assertFalse(repo.is_dirty())

                print("done")

            finally:
                gc.collect()
                repo.git.clear_cache()
                rmtree(repo.git_dir)

if __name__ == '__main__':
    unittest.main()

Please report if everything has been solved.

@ghost
Copy link
Author

ghost commented Dec 2, 2016

@ankostis i do see the test passing now, thank you for your quick response. Definitely an error not to be calling clear_cache on my end, too - thanks for pointing that out.

What are your thoughts on if this merits a fix/improvement in gitpython?

IMHO, this is something that does merit a change. By least-surprise, i think the user rightly expects that IFF repo does not implement context-manager, then it doesn't have side effects like persistent ownership of the repo. "Caching" is this case is not something that should have visible effects (beyond performance) for the user. And exposing gc.collect() as part of the 'cleanup api' is definitely not least-surprise.

Many ways to solve this; it seems to me that adding a context manager (or at least a "release()" function) might be the most usable, and would would make the behavior discoverable.

As a minor side point, i wasn't able to find any documentation that would definitively indicate clear_cache was actually necessary here, i just guessed as it was the closest thing to 'cleanup()' i could find.

@ankostis
Copy link
Contributor

ankostis commented Dec 2, 2016

We would have fixed it yesterday, if we could :-)

Just search for tag.leaks to get an idea of the invested effort to reach to the point where git-python runs decently in Windows on PY34+.

The 1st and easiest fix is to retrofit git.Repo as a context-manager, so instead of try..finally you would use it as with Repo() as repo: ....

@ghost
Copy link
Author

ghost commented Dec 2, 2016

I think the fix @ankostis (turns out - previously) has developed adding context mgr to Repo will be sufficient to close this bug.

@ankostis I'm new to github, don't see an easy way to label / vote for your fix ankostis@47f6442 to be included in gitpython 2.1.1 in the main gitpython repo?

Thanks again for your quick response & resolution here.

@ankostis
Copy link
Contributor

ankostis commented Dec 8, 2016

Assuming #555 gets merged in the next release, in your case the code would become simply like that:

class Test(unittest.TestCase):

    def testCreateRepo(self):
        with tempfile.TemporaryDirectory(prefix=(__loader__.name) + "_") as mydir:
            with git.Repo.init(path=os.path.join(mydir, "newRepo"), mkdir=True) as repo:
                self.assertTrue(os.path.isdir(os.path.join(repo.working_dir, ".git")), "Failed to make new repo?")

                # MAKE FILE, COMMIT REPO
                testFileName = "testFile.txt"
                open(os.path.join(repo.working_dir, testFileName) , "w").close()
                repo.index.add([testFileName])
                self.assertTrue(repo.is_dirty())

                repo.index.commit("added initial test file")
                self.assertFalse(repo.is_dirty())

                print("done")

@Byron
Copy link
Member

Byron commented Dec 8, 2016

@mboard182 It looks like the context-manager addition is in-flight and might make it in the release planned for today.

emonty added a commit to emonty/GitPython that referenced this issue Mar 8, 2017
In issue gitpython-developers#553 it was originally mentioned that the gc.collect was
required before clearing the cache, but in the code the collect happens
after the cache clear. Move the first gc.collect to before the cache to
be consistent with the original description.
@Byron Byron closed this as completed in 454feda Nov 2, 2017
riley-martine pushed a commit to riley-martine/GitPython that referenced this issue Dec 7, 2023
Under Windows, tempfile objects are holding references to open files
until the garbage collector closes them and frees them.  Explicit
calls to gc.collect() were added to the finalizer for the Repo class
to force them to be closed synchronously.  However, this is expensive,
especially in large, long-running programs.  As a temporary measure
to alleviate the performance regression on other platforms, only
perform these calls when running under Windows.

Fixes gitpython-developers#553
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants