Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging PRs takes very long time for large repository #601

Closed
2 of 6 tasks
typeless opened this issue Jan 6, 2017 · 19 comments · Fixed by #4921
Closed
2 of 6 tasks

Merging PRs takes very long time for large repository #601

typeless opened this issue Jan 6, 2017 · 19 comments · Fixed by #4921
Labels
type/bug type/enhancement An improvement of existing functionality
Milestone

Comments

@typeless
Copy link
Contributor

typeless commented Jan 6, 2017

  • Gitea version (or commit ref):

Gitea Version: 1.0.0+76-geb9ce39

  • Git version:
$ git --version
git version 2.7.4
  • Operating system:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No, https://try.gitea.io will report out-of-memory error when pushing a large repository.
    • Not relevant

Description

When merging a PR, Gitea will clone the whole repository, which is unacceptably time-consuming when the repository is large.

See https://github.com/go-gitea/gitea/blob/master/models/pull.go#L285

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/40674560-merging-prs-takes-very-long-time-for-large-repository?utm_campaign=plugin&utm_content=tracker%2F47456670&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F47456670&utm_medium=issues&utm_source=github).
@lunny lunny added this to the 1.1.0 milestone Jan 6, 2017
@lunny lunny added type/bug type/enhancement An improvement of existing functionality labels Jan 6, 2017
@bkcsoft
Copy link
Member

bkcsoft commented Jan 9, 2017

just changing the clone-path to owner/repo.{{.Time}}.git instead would speed it up considerably as git clone on the same mount-point is just a hardlink, also slaping on --local could speed it up in some cases.

Further improvements include --single-branch or --depth 1 to only fetch the branch you're merging into

--local, -l
When the repository to clone from is on a local machine, this flag bypasses the normal "Git aware" transport mechanism and clones the repository by making a copy of HEAD and everything under objects and refs directories. The files under .git/objects/
directory are hardlinked to save space when possible.

--no-hardlinks
Force the cloning process from a repository on a local filesystem to copy the files under the .git/objects directory instead of using hardlinks. This may be desirable if you are trying to make a back-up of your repository.

@typeless
Copy link
Contributor Author

typeless commented Jan 10, 2017

http://stackoverflow.com/questions/10637584/how-to-do-a-non-fast-forward-git-merge-to-a-branch-that-isnt-checked-out might be useful as well.

Edit: I have a patch experimenting with that, will submit it later.

@bkcsoft
Copy link
Member

bkcsoft commented Jan 10, 2017

@typeless not sure if it works for bare repos though 🙁

@lunny lunny modified the milestones: 1.2.0, 1.1.0 Mar 6, 2017
@lunny lunny modified the milestones: 1.3.0, 1.2.0 May 21, 2017
@typeless
Copy link
Contributor Author

For reference: https://gist.github.com/cool-RR/6575042

@lafriks lafriks modified the milestones: 1.3.0, 1.4.0 Oct 25, 2017
@lafriks lafriks modified the milestones: 1.4.0, 1.5.0 Jan 14, 2018
@lunny lunny modified the milestones: 1.5.0, 1.6.0 May 11, 2018
@webjoel
Copy link
Contributor

webjoel commented May 15, 2018

In my case is a little slow too, repo is very big, an solution is add gauge/loading image for feedback to user. It's possible?

If yes, is good add too in apply merge pull request, what is very slow too e no exixsts feedback to user, only the loading in the browser title.

My repo size is: 1,8 GB.

@lafriks
Copy link
Member

lafriks commented May 15, 2018

If we could refactor so that there is no local clone operation on server when merging I think it would significantly decrease merging time. Or at least keep it permanently and just reset hard from origin when needed and in such case all its operations must be protected by locking

@typeless
Copy link
Contributor Author

I once attempted to fix the issue by combining some plumbing commands but that approach had some critical flaws. It seems that the better solution would be to fix it in the Git upstream.

See also https://githubengineering.com/move-fast/

@jonasfranz jonasfranz modified the milestones: 1.6.0, 1.7.0 Sep 14, 2018
@jonasfranz jonasfranz modified the milestones: 1.7.0, 1.8.0 Dec 27, 2018
@zeripath
Copy link
Contributor

zeripath commented Dec 27, 2018

I don't like that at present we're basically forced to create arbitrary paths on the server - it smells dangerous to me, even though we should have protected against climbing out of the root. I wouldn't be surprised if there were genuine issues on Windows regarding this e.g. reserved filenames, filename length, reserved characters and the like. I could also imagine a filename encoding problem to affect linux at some point. These problems similarly affect uploads.

I'm just gonna put some details about how git works down here so they don't disappear.

  • How does git commit work?
    • Basically it uses the index created using git add and the like
    • exports GIT_AUTHOR_*
    • git write-tree with GIT_INDEX_FILE set appropriately to write the index as a tree object to the git objects db saving the stdout'd $tree_id for the next step
    • git commit-tree $tree_id with the parent ids prepended with -p and commit message through stdin, outputs the $commit_id
    • git update-ref HEAD $commit_id $CURRENT_HEAD to ensure that when we change the HEAD it's actually not been changed under our feet
    • if successful git gc --auto and deal with post-commit and some info.
    • None of this actually depends on a working tree.
  • How does git add "$FILE" work?
    • The first step is to add the object to the db: git hash-object -w "$FILE" outputting the $BLOB_ID
    • This can be replicated by cat "$FILE_DATA" | git hash-object -w --path="$FILE_PATH" --stdin
    • You then need to add that object to the index, which is achieved by: git update-index --add --info-only "$FILE"
    • A functionally equivalent is: git update-index --add --cacheinfo $MODE,$BLOB_ID,$FILE_PATH - meaning you don't need the file to be at the supposed path.
    • (You should either git read-tree HEAD or git ls-tree -r $TREE | git update-index --index-info to make the index $TREE before doing anything, to set the index to map HEAD) - otherwise the tree you commit will only contain what you've added.
  • git rm is similar, and there's also --replace in the git update-index to force changes.

I need to look at how merge actually works - I think we should be able to generate the merged data directly.

Using plumbing it should be possible to actually not do a physical checkout at all and create an index in a bare repository that shares the object dbs of the repositories with the two HEADS we're merging from.

E.g. We create an index from HEAD_1. Then merge each changed file in HEAD_2 into HEAD_1s version in turn - hashing each new object and updating the index as appropriate. We finally write the index as a tree, commit the tree, and update the refs before pushing. The merging files don't need to be written to a working copy - just a temporary file. Similarly the shared repositories can be in a temp directory and the index is a temporary file too. At no point do we actually try to checkout a repo and if a user wants to make their own repo difficult by sticking things in .git/ they can (You can't commit any path with a .git in it from the git cli).


Finally got round to looking at the source code for git -- verify_path which is run on adding to index disallows the committing of any .git directories (https://github.com/git/git/blob/7f4e64169352e03476b0ea64e7e2973669e491a2/read-cache.c#L951) although I haven't checked completely I suspect a malicious repo that contained a tree with a ".git" directory would be similarly protected by verify_path.

@typeless
Copy link
Contributor Author

@zeripath Btw, I had a failed attempt following in this line of thinking #641.
Maybe it's actually feasible given sufficient knowledge of the git plumbing and internals.

@zeripath
Copy link
Contributor

I think your current approach should reduce the load considerably. Certainly the use of shared object databases is extremely important and would be responsible for improvements in load. It would be worth a pr in itself. In fact, if you don't think #4921 is going to be ready in time for 1.7 I'd recommend doing exactly that. (It's worth noting that most repos have many more objects than checkout files and although hard links are relatively cheap they still need to be created and still change inodes.)

Just thinking on, although you've shared the base repository object database, when you add the remote that doesn't get shared too. It is possible to set up a repo to share multiple object databases. Could it be that one of these repos could have a much smaller object db, so when you fetch on the "remote" you actually copy a lot of objects unintentionally? This is likely to be an extreme edge case though.

Although libgit2 would likely provide an improvement - it still requires in depth understanding of the plumbing and most of GitHubs improvements come from that deeper knowledge of plumbing. Yes they saw benefits descriptifying in the way that libgit2 allows them to do - we still need to move off the porcelain and have a deeper understanding of the plumbing before we can go do that approach. Your current PR is a definite step towards deeper plumbing knowledge. I'm working on file upload at present and have a working - albeit not very pretty - solution for this.

It's also worth noting that the unique thing that libgit2 provides is the in memory index - the merging strategies come from this, but as a default index is a file it may be that we don't need this.

@filipnavara
Copy link
Contributor

Certainly the use of shared object databases is extremely important and would be responsible for improvements in load.

We have deployed that bit on our server couple of months ago and it makes the difference between totally unusable (ie. the merge times out) and slow, but usuable.

The Gitea performance on Windows with our multi-gigabyte repositories is atrocious out of the box, but we run a forked version that makes it quite bearable. It uses go-git for accessing the repositories and often makes the pages load > 10x faster (in some cases > 30x). I was not brave enough to use go-git for any write operations, especially since the merging algorithm is quite complex, but using the shared object database in itself saved ~30 seconds on each merge on 3+ Gb repositories.

@lafriks
Copy link
Member

lafriks commented Dec 28, 2018

@filipnavara why no to share your changes to us back as I have also started to move parts of the read operations to go-git

@filipnavara
Copy link
Contributor

@lafriks I shared the links in other issues/PRs. https://github.com/filipnavara/gitea/commits/commitgraph (also the perf-read branch). I will rebase it when I get back after new year.

@lafriks
Copy link
Member

lafriks commented Dec 28, 2018

Nice, will take a look again. Did perf improvement got merged in go-git?

@filipnavara
Copy link
Contributor

filipnavara commented Dec 28, 2018

Yep, go-git has merged my performance fixes.

The branch with serialized commit tree is stale at the moment, but it is not that important anyway. We use it on our server, but the performance improvements are relatively small compared to the rest of the changes.

@typeless
Copy link
Contributor Author

Just thinking on, although you've shared the base repository object database, when you add the remote that doesn't get shared too. It is possible to set up a repo to share multiple object databases. Could it be that one of these repos could have a much smaller object db, so when you fetch on the "remote" you actually copy a lot of objects unintentionally? This is likely to be an extreme edge case though.

@zeripath Adding the path of the head repository to .git/objects/info/alternates of the staging repository can probably address the problem of redundant git-fetch.

@6543
Copy link
Member

6543 commented Jan 11, 2020

@guillep2k
Copy link
Member

there is a bounty on this issue: https://www.bountysource.com/issues/40674560-merging-prs-takes-very-long-time-for-large-repository

It's closed. It's from 2017. 😄

@6543
Copy link
Member

6543 commented Jan 11, 2020

it can be clamed ... that's what I like to remind

@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/bug type/enhancement An improvement of existing functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants