Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First batch of GitPython pruning #2902

Merged
merged 15 commits into from
Oct 8, 2018

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Oct 6, 2018

This eliminates some of the GitPython API calls mentioned in #2879.

  • Eliminates nearly all repo.git.rev_parse calls through use and extension of GitRepo.get_hexsha and through new GitRepo.commit_exists. (ece5229 explains the remaining two.)
  • Removes all repo.git.show calls with use of new GitRepo.format_commit.
  • Removes all repo.git_dir uses.
  • Add GitRepo.is_ancestor and drops repo.git.merge_base.

  • This change is complete

@mih mih mentioned this pull request Oct 6, 2018
25 tasks
@codecov
Copy link

codecov bot commented Oct 6, 2018

Codecov Report

Merging #2902 into master will increase coverage by <.01%.
The diff coverage is 97.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2902      +/-   ##
==========================================
+ Coverage   90.28%   90.29%   +<.01%     
==========================================
  Files         246      246              
  Lines       31877    31887      +10     
==========================================
+ Hits        28781    28791      +10     
  Misses       3096     3096
Impacted Files Coverage Δ
datalad/support/tests/test_repodates.py 100% <ø> (ø) ⬆️
datalad/interface/run.py 100% <100%> (ø) ⬆️
datalad/interface/rerun.py 96.26% <100%> (-0.19%) ⬇️
datalad/interface/tests/test_save.py 100% <100%> (ø) ⬆️
datalad/interface/tests/test_run.py 99.83% <100%> (ø) ⬆️
datalad/support/tests/test_gitrepo.py 100% <100%> (ø) ⬆️
datalad/support/gitrepo.py 88.93% <95.83%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a98a016...703552c. Read the comment docs.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a performance impact/benefit. I expect that previous implementation used in memory db

kyleam added 15 commits October 7, 2018 10:53
This is needed in rerun.py.
The direct use of git.cmd.Git outside of GitRepo is discouraged.
rerun currently uses the GitPython api for this.
Using GitRepo.get_git_dir is a bit more awkward because it involves
passing ds.repo as an argument to its own static method, but doing so
avoids using git.cmd.Git outside of GitRepo.
Despite get_hexsha's docstring claiming to support "any type of Git
object identifier", it doesn't actually support tags.
'git rev-parse' is arguably the idiomatic Git way to get a commit's
hash, but using get_hexsha() is consistent with other parts of DataLad
and eliminates a use of GitPython outside of GitRepo.
'git show --format=%H' will only print the hash of commit objects.
For a tag, it will print the commit hash, but also tag information, so
the downstream assert in get_hexsha() that expects one line will fail.
For tree and blob IDs, --format is simply ignored.

'git rev-parse --verify' could be used for a more general "get this
object's hash", but handling the "no commits yet" case would require
an additional git call, so stay with 'git show', at least for now.
As described in the previous message, passing a tag fails because it
outputs multiple lines, and downstream code expects just the single
hash specified by "%H".  Make a tag work as expected by casting it as
a commit.
rerun.py uses GitPython to call 'git show'.  Add a format_commit
method to GitRepo so that we can use that instead.
Doing so will let us eliminate another use of GitPython in rerun.
@kyleam kyleam force-pushed the enh-degitpython-rerun branch from 9c000f6 to 703552c Compare October 7, 2018 14:55
@kyleam
Copy link
Contributor Author

kyleam commented Oct 7, 2018

I wonder if there is a performance impact/benefit. I expect that previous implementation used in memory db

Not sure. I'm not familiar with how gitpython uses the db internally and am not sure when it comes into play. (Though this seems like a more general concern for #2879.)

The rev-parse call is probably the most relevant one for this pr. A crude test:

$ python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr.get_hexsha()"
100 loops, best of 3: 8.79 msec per loop

$ python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr._git_custom_command('', ['git', 'rev-parse', 'HEAD'])"
100 loops, best of 3: 8.12 msec per loop

$ python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr.repo.git.rev_parse('HEAD')"
100 loops, best of 3: 7.71 msec per loop

@mih
Copy link
Member

mih commented Oct 7, 2018

@kyleam Thanks for the test. Personally, I would be willing to pay a higher price for not having that DB. This is pretty much the only state information that we are carrying around (besides whether a repo has an annex or not). From my POV we are not benefitting from it much, and we actually had more then one issue with leakage/destructor calls in the past.

My vote is on improving the statelessness of Dataset/Repo.

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the general attitude towards gitpython DB vs. straight git calls (see prev comment), this LGTM.

cmd = ['git', 'show', '--no-patch', "--format=%H"]
if object:
cmd.append(object)
cmd = ['git', 'show', '-z', '--no-patch', '--format=' + fmt]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, another one without -z. Thx for catching that!

@yarikoptic
Copy link
Member

Re generic "From my POV we are not benefitting from it much,", hopefully @bpoldrack could support my memory of an early analysis where we compared against straight git calls and concluded that performance benefit was notable. But that was in the past, and even git could have become faster, and the way we use it might have changed not requiring as much of git commits structure analysis

@kyleam
Copy link
Contributor Author

kyleam commented Oct 8, 2018

Stepping back, my current understanding is that there is general agreement that GitPython should not be used outside of GitRepo, and whether we should remove GitPython altogether is a separate issue. If that is true, I fail to see how this peformance discussion pertains to this PR since converting repo.git calls to GitRepo method calls would seem the clear direction to go. Instead, the question would be whether the relevant GitRepo method (here GitRepo.get_hexsha) should use GitPython.

@yarikoptic
Copy link
Member

Stepping back, my current understanding is that there is general agreement that GitPython should not be used outside of GitRepo, and whether we should remove GitPython altogether is a separate issue.

True

If that is true, I fail to see how this peformance discussion pertains to this PR since converting repo.git calls to GitRepo method calls would seem the clear direction to go. Instead, the question would be whether the relevant GitRepo method (here GitRepo.get_hexsha) should use GitPython.

yeah, I just wondered if we would have any performance impact since GitPython's .git.rev_parse was replaced with get_hexsha which does git call... FWIW here is timing from my laptop on datalad code repo

$> python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr._git_custom_command('', ['git', 'rev-parse', 'HEAD'])"                                                                              100 loops, best of 3: 16.1 msec per loop
$> python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr.repo.git.rev_parse('HEAD')"
100 loops, best of 3: 16.5 msec per loop

so , it even gets better a tiny bit (should I finally get a new laptop? ;)), so seems to be a non-issue for this particular change, so we should go ahead with this one.

@yarikoptic yarikoptic merged commit e8f0e44 into datalad:master Oct 8, 2018
@kyleam kyleam deleted the enh-degitpython-rerun branch October 8, 2018 15:36
@bpoldrack
Copy link
Member

bpoldrack commented Oct 9, 2018

FTR: First of all, I'm happy to see this change. Second I do remember some performance testing in the past, but the benefit from GitPython was notable only for a very small set of queries. I doubt it will notably affect our actual commands, since we do a lot more than querying for a reference. And most things just lead to direct git calls done by GitPython if you dig into it. Plus: Even where do use features that benefit from in memory DB, I think it's mostly just a single call that hardly compensates for the offset of creating that DB in the first place.

Edit:
To underline the last point: Remember how we benefited from making the instantiation of GitPython Repo lazy?

@yarikoptic yarikoptic modified the milestone: Release 0.10.4 Oct 22, 2018
@yarikoptic yarikoptic modified the milestone: Release 0.10.4 Oct 22, 2018
yarikoptic added a commit that referenced this pull request Oct 24, 2018
 Upgrade of [git-annex] to the most recent available to your release is
 advisable since a number of issues were resolved at that level.

 ### Major refactoring and deprecations

 - `datalad.consts.LOCAL_CENTRAL_PATH` constant was deprecated in favor
   of `datalad.locations.default-dataset` [configuration] variable
   ([#2835])

 ### Minor refactoring

 - `"notneeded"` messages are no longer reported by default results
   renderer
 - [run] no longer shows commit instructions upon command failure when
   `explicit` is true and no outputs are specified ([#2922])
 - `get_git_dir` moved into GitRepo ([#2886])
 - `_gitpy_custom_call` removed from GitRepo ([#2894])
 - `GitRepo.get_merge_base` argument is now called `commitishes` instead
   of `treeishes` ([#2903])

 ### Fixes

 - [update] should not leave the dataset in non-clean state ([#2858])
   and some other enhancements ([#2859])
 - Fixed chunking of the long command lines to account for decorators
   and other arguments ([#2864])
 - Progress bar should not crash the process on some missing progress
   information ([#2891])
 - Default value for `jobs` set to be `"auto"` (not `None`) to take
   advantage of possible parallel get if in `-g` mode ([#2861])
 - [wtf] must not crash if `git-annex` is not installed etc ([#2865]),
   ([#2865]), ([#2918]), ([#2917])
 - Fixed paths (with spaces etc) handling while reporting annex error
   output ([#2892]), ([#2893])
 - `__del__` should not access `.repo` but `._repo` to avoid attempts
   for reinstantiation etc ([#2901])
 - Fix up submodule `.git` right in `GitRepo.add_submodule` to avoid
   added submodules being non git-annex friendly ([#2909]), ([#2904])
 - [run-procedure] ([#2905])
   - now will provide dataset into the procedure if called within dataset
   - will not crash if procedure is an executable without `.py` or `.sh`
     suffixes
 - Use centralized `.gitattributes` handling while setting annex backend
   ([#2912])
 - `GlobbedPaths.expand(..., full=True)` incorrectly returned relative
    paths when called more than once ([#2921])

 ### Enhancements and new features

 - Report progress on [clone] when installing from "smart" git servers
   ([#2876])
 - Stale/unused `sth_like_file_has_content` was removed ([#2860])
 - Enhancements to [search] to operate on "improved" metadata layouts
   ([#2878])
 - Output of `git annex init` operation is now logged ([#2881])
 - New
   - `GitRepo.cherry_pick` ([#2900])
   - `GitRepo.format_commit` ([#2902])
 - [run-procedure] ([#2905])
   - procedures can now recursively be discovered in subdatasets as well.
     The uppermost has highest priority
   - Procedures in user and system locations now take precedence over
     those in datasets.

* tag '0.11.0':
  Make it a 0.11.0 release since there were some API RFings
  REL: slight tune up to Changelog following the advices
  DOC: v0.10.4: Mention change in procedure precedence (a0cbcba)
  DOC: v0.10.4: Fix description of db715b7
  DOC: v0.10.4: Improve description of 6f615a4
  DOC: v0.10.4: Remove duplicate word
yarikoptic added a commit that referenced this pull request Oct 24, 2018
 ## 0.11.0 (Oct 23, 2018) -- Soon-to-be-perfect

 [git-annex] 6.20180913 (or later) is now required - provides a number of
 fixes for v6 mode operations etc.

 ### Major refactoring and deprecations

 - `datalad.consts.LOCAL_CENTRAL_PATH` constant was deprecated in favor
   of `datalad.locations.default-dataset` [configuration] variable
   ([#2835])

 ### Minor refactoring

 - `"notneeded"` messages are no longer reported by default results
   renderer
 - [run] no longer shows commit instructions upon command failure when
   `explicit` is true and no outputs are specified ([#2922])
 - `get_git_dir` moved into GitRepo ([#2886])
 - `_gitpy_custom_call` removed from GitRepo ([#2894])
 - `GitRepo.get_merge_base` argument is now called `commitishes` instead
   of `treeishes` ([#2903])

 ### Fixes

 - [update] should not leave the dataset in non-clean state ([#2858])
   and some other enhancements ([#2859])
 - Fixed chunking of the long command lines to account for decorators
   and other arguments ([#2864])
 - Progress bar should not crash the process on some missing progress
   information ([#2891])
 - Default value for `jobs` set to be `"auto"` (not `None`) to take
   advantage of possible parallel get if in `-g` mode ([#2861])
 - [wtf] must not crash if `git-annex` is not installed etc ([#2865]),
   ([#2865]), ([#2918]), ([#2917])
 - Fixed paths (with spaces etc) handling while reporting annex error
   output ([#2892]), ([#2893])
 - `__del__` should not access `.repo` but `._repo` to avoid attempts
   for reinstantiation etc ([#2901])
 - Fix up submodule `.git` right in `GitRepo.add_submodule` to avoid
   added submodules being non git-annex friendly ([#2909]), ([#2904])
 - [run-procedure] ([#2905])
   - now will provide dataset into the procedure if called within dataset
   - will not crash if procedure is an executable without `.py` or `.sh`
     suffixes
 - Use centralized `.gitattributes` handling while setting annex backend
   ([#2912])
 - `GlobbedPaths.expand(..., full=True)` incorrectly returned relative
    paths when called more than once ([#2921])

 ### Enhancements and new features

 - Report progress on [clone] when installing from "smart" git servers
   ([#2876])
 - Stale/unused `sth_like_file_has_content` was removed ([#2860])
 - Enhancements to [search] to operate on "improved" metadata layouts
   ([#2878])
 - Output of `git annex init` operation is now logged ([#2881])
 - New
   - `GitRepo.cherry_pick` ([#2900])
   - `GitRepo.format_commit` ([#2902])
 - [run-procedure] ([#2905])
   - procedures can now recursively be discovered in subdatasets as well.
     The uppermost has highest priority
   - Procedures in user and system locations now take precedence over
     those in datasets.

* tag '0.11.0':
  CHANGELOG adjusted to reflect new minimal version of git-annex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants