-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First batch of GitPython pruning #2902
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2902 +/- ##
==========================================
+ Coverage 90.28% 90.29% +<.01%
==========================================
Files 246 246
Lines 31877 31887 +10
==========================================
+ Hits 28781 28791 +10
Misses 3096 3096
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a performance impact/benefit. I expect that previous implementation used in memory db
This is needed in rerun.py.
The direct use of git.cmd.Git outside of GitRepo is discouraged.
rerun currently uses the GitPython api for this.
Using GitRepo.get_git_dir is a bit more awkward because it involves passing ds.repo as an argument to its own static method, but doing so avoids using git.cmd.Git outside of GitRepo.
Despite get_hexsha's docstring claiming to support "any type of Git object identifier", it doesn't actually support tags.
'git rev-parse' is arguably the idiomatic Git way to get a commit's hash, but using get_hexsha() is consistent with other parts of DataLad and eliminates a use of GitPython outside of GitRepo.
'git show --format=%H' will only print the hash of commit objects. For a tag, it will print the commit hash, but also tag information, so the downstream assert in get_hexsha() that expects one line will fail. For tree and blob IDs, --format is simply ignored. 'git rev-parse --verify' could be used for a more general "get this object's hash", but handling the "no commits yet" case would require an additional git call, so stay with 'git show', at least for now.
As described in the previous message, passing a tag fails because it outputs multiple lines, and downstream code expects just the single hash specified by "%H". Make a tag work as expected by casting it as a commit.
rerun.py uses GitPython to call 'git show'. Add a format_commit method to GitRepo so that we can use that instead.
Doing so will let us eliminate another use of GitPython in rerun.
9c000f6
to
703552c
Compare
Not sure. I'm not familiar with how gitpython uses the db internally and am not sure when it comes into play. (Though this seems like a more general concern for #2879.) The rev-parse call is probably the most relevant one for this pr. A crude test: $ python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr.get_hexsha()"
100 loops, best of 3: 8.79 msec per loop
$ python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr._git_custom_command('', ['git', 'rev-parse', 'HEAD'])"
100 loops, best of 3: 8.12 msec per loop
$ python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr.repo.git.rev_parse('HEAD')"
100 loops, best of 3: 7.71 msec per loop |
@kyleam Thanks for the test. Personally, I would be willing to pay a higher price for not having that DB. This is pretty much the only state information that we are carrying around (besides whether a repo has an annex or not). From my POV we are not benefitting from it much, and we actually had more then one issue with leakage/destructor calls in the past. My vote is on improving the statelessness of Dataset/Repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the general attitude towards gitpython DB vs. straight git calls (see prev comment), this LGTM.
cmd = ['git', 'show', '--no-patch', "--format=%H"] | ||
if object: | ||
cmd.append(object) | ||
cmd = ['git', 'show', '-z', '--no-patch', '--format=' + fmt] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, another one without -z
. Thx for catching that!
Re generic "From my POV we are not benefitting from it much,", hopefully @bpoldrack could support my memory of an early analysis where we compared against straight git calls and concluded that performance benefit was notable. But that was in the past, and even git could have become faster, and the way we use it might have changed not requiring as much of git commits structure analysis |
Stepping back, my current understanding is that there is general agreement that GitPython should not be used outside of |
True
yeah, I just wondered if we would have any performance impact since GitPython's $> python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr._git_custom_command('', ['git', 'rev-parse', 'HEAD'])" 100 loops, best of 3: 16.1 msec per loop
$> python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr.repo.git.rev_parse('HEAD')"
100 loops, best of 3: 16.5 msec per loop so , it even gets better a tiny bit (should I finally get a new laptop? ;)), so seems to be a non-issue for this particular change, so we should go ahead with this one. |
FTR: First of all, I'm happy to see this change. Second I do remember some performance testing in the past, but the benefit from GitPython was notable only for a very small set of queries. I doubt it will notably affect our actual commands, since we do a lot more than querying for a reference. And most things just lead to direct git calls done by GitPython if you dig into it. Plus: Even where do use features that benefit from in memory DB, I think it's mostly just a single call that hardly compensates for the offset of creating that DB in the first place. Edit: |
Upgrade of [git-annex] to the most recent available to your release is advisable since a number of issues were resolved at that level. ### Major refactoring and deprecations - `datalad.consts.LOCAL_CENTRAL_PATH` constant was deprecated in favor of `datalad.locations.default-dataset` [configuration] variable ([#2835]) ### Minor refactoring - `"notneeded"` messages are no longer reported by default results renderer - [run] no longer shows commit instructions upon command failure when `explicit` is true and no outputs are specified ([#2922]) - `get_git_dir` moved into GitRepo ([#2886]) - `_gitpy_custom_call` removed from GitRepo ([#2894]) - `GitRepo.get_merge_base` argument is now called `commitishes` instead of `treeishes` ([#2903]) ### Fixes - [update] should not leave the dataset in non-clean state ([#2858]) and some other enhancements ([#2859]) - Fixed chunking of the long command lines to account for decorators and other arguments ([#2864]) - Progress bar should not crash the process on some missing progress information ([#2891]) - Default value for `jobs` set to be `"auto"` (not `None`) to take advantage of possible parallel get if in `-g` mode ([#2861]) - [wtf] must not crash if `git-annex` is not installed etc ([#2865]), ([#2865]), ([#2918]), ([#2917]) - Fixed paths (with spaces etc) handling while reporting annex error output ([#2892]), ([#2893]) - `__del__` should not access `.repo` but `._repo` to avoid attempts for reinstantiation etc ([#2901]) - Fix up submodule `.git` right in `GitRepo.add_submodule` to avoid added submodules being non git-annex friendly ([#2909]), ([#2904]) - [run-procedure] ([#2905]) - now will provide dataset into the procedure if called within dataset - will not crash if procedure is an executable without `.py` or `.sh` suffixes - Use centralized `.gitattributes` handling while setting annex backend ([#2912]) - `GlobbedPaths.expand(..., full=True)` incorrectly returned relative paths when called more than once ([#2921]) ### Enhancements and new features - Report progress on [clone] when installing from "smart" git servers ([#2876]) - Stale/unused `sth_like_file_has_content` was removed ([#2860]) - Enhancements to [search] to operate on "improved" metadata layouts ([#2878]) - Output of `git annex init` operation is now logged ([#2881]) - New - `GitRepo.cherry_pick` ([#2900]) - `GitRepo.format_commit` ([#2902]) - [run-procedure] ([#2905]) - procedures can now recursively be discovered in subdatasets as well. The uppermost has highest priority - Procedures in user and system locations now take precedence over those in datasets. * tag '0.11.0': Make it a 0.11.0 release since there were some API RFings REL: slight tune up to Changelog following the advices DOC: v0.10.4: Mention change in procedure precedence (a0cbcba) DOC: v0.10.4: Fix description of db715b7 DOC: v0.10.4: Improve description of 6f615a4 DOC: v0.10.4: Remove duplicate word
## 0.11.0 (Oct 23, 2018) -- Soon-to-be-perfect [git-annex] 6.20180913 (or later) is now required - provides a number of fixes for v6 mode operations etc. ### Major refactoring and deprecations - `datalad.consts.LOCAL_CENTRAL_PATH` constant was deprecated in favor of `datalad.locations.default-dataset` [configuration] variable ([#2835]) ### Minor refactoring - `"notneeded"` messages are no longer reported by default results renderer - [run] no longer shows commit instructions upon command failure when `explicit` is true and no outputs are specified ([#2922]) - `get_git_dir` moved into GitRepo ([#2886]) - `_gitpy_custom_call` removed from GitRepo ([#2894]) - `GitRepo.get_merge_base` argument is now called `commitishes` instead of `treeishes` ([#2903]) ### Fixes - [update] should not leave the dataset in non-clean state ([#2858]) and some other enhancements ([#2859]) - Fixed chunking of the long command lines to account for decorators and other arguments ([#2864]) - Progress bar should not crash the process on some missing progress information ([#2891]) - Default value for `jobs` set to be `"auto"` (not `None`) to take advantage of possible parallel get if in `-g` mode ([#2861]) - [wtf] must not crash if `git-annex` is not installed etc ([#2865]), ([#2865]), ([#2918]), ([#2917]) - Fixed paths (with spaces etc) handling while reporting annex error output ([#2892]), ([#2893]) - `__del__` should not access `.repo` but `._repo` to avoid attempts for reinstantiation etc ([#2901]) - Fix up submodule `.git` right in `GitRepo.add_submodule` to avoid added submodules being non git-annex friendly ([#2909]), ([#2904]) - [run-procedure] ([#2905]) - now will provide dataset into the procedure if called within dataset - will not crash if procedure is an executable without `.py` or `.sh` suffixes - Use centralized `.gitattributes` handling while setting annex backend ([#2912]) - `GlobbedPaths.expand(..., full=True)` incorrectly returned relative paths when called more than once ([#2921]) ### Enhancements and new features - Report progress on [clone] when installing from "smart" git servers ([#2876]) - Stale/unused `sth_like_file_has_content` was removed ([#2860]) - Enhancements to [search] to operate on "improved" metadata layouts ([#2878]) - Output of `git annex init` operation is now logged ([#2881]) - New - `GitRepo.cherry_pick` ([#2900]) - `GitRepo.format_commit` ([#2902]) - [run-procedure] ([#2905]) - procedures can now recursively be discovered in subdatasets as well. The uppermost has highest priority - Procedures in user and system locations now take precedence over those in datasets. * tag '0.11.0': CHANGELOG adjusted to reflect new minimal version of git-annex
This eliminates some of the GitPython API calls mentioned in #2879.
repo.git.rev_parse
calls through use and extension ofGitRepo.get_hexsha
and through newGitRepo.commit_exists
. (ece5229 explains the remaining two.)repo.git.show
calls with use of newGitRepo.format_commit
.repo.git_dir
uses.GitRepo.is_ancestor
and dropsrepo.git.merge_base
.