Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homebrew has a ~350mb footprint #10557

Closed
darajava opened this issue Feb 7, 2021 · 18 comments
Closed

Homebrew has a ~350mb footprint #10557

darajava opened this issue Feb 7, 2021 · 18 comments
Labels
features New features outdated PR was locked due to age

Comments

@darajava
Copy link

darajava commented Feb 7, 2021

Provide a detailed description of the proposed feature

Homebrew installation takes up 350mb of space and bandwidth. This is due (from what I see) to the git clone command cloning the entire repo including history.

git clone --depth 1 https://github.com/user/repo.git # (see https://stackoverflow.com/questions/1209999/using-git-to-get-just-the-latest-revision)

Updating to this command will only clone the most recent commit. Is there a reason we need this history?

I have tried to install brew this way (git clone https://github.com/Homebrew/brew ~/.linuxbrew/Homebrew --depth=1) and it seems to work without issue (And only costs 2.73MiB, about 128 times smaller).

What is the motivation for the feature?

For people on slow or limited connections, installing Homebrew is a pain and quite slow. I believe that similar git cloning (I mean including the history) is done for packages installed by Homebrew too, but I actually can't see the git command used for cloning.

It should also be a really easy and quick fix, as simply adding depth=1 to the git commands should fix it.

How will the feature be relevant to at least 90% of Homebrew users?

This should be relevant for 100% of Homebrew users (even ones on a fast connection) because all packages will be downloaded (much) faster and take up (much) less space.

What alternatives to the feature have been considered?

None that I'm aware of. Maybe I'm wrong about this and there is a reason you include history (I tried going through related issued but I couldn't see the reason for cloning history), but this is so painful to me working on a 80GB per month data allowance. Perhaps it could even be added as an option?

@darajava darajava added the features New features label Feb 7, 2021
@scpeters
Copy link
Member

scpeters commented Feb 7, 2021

the initial clone is smaller, but git fetch operations are more computationally expensive for the server, so GitHub has requested that we don't use shallow clones (see discussion in #9383)

@Rylan12
Copy link
Member

Rylan12 commented Feb 7, 2021

GitHub has asked that we don't use shallow clones as it is extremely taxing for their servers. As a result, we will not make a change to add --depth=1 to the git clone commands.

We are investigating other ways to improve this, though. We're looking at using blobless or treeless clones. These solutions are promising but not quite ready yet. GitHub is working to improve these clone options which we could eventually use in Homebrew. We are also looking at restructuring our repos to reduce the size of tree objects which would help to reduce the repository size.

@Rylan12 Rylan12 closed this as completed Feb 7, 2021
@darajava
Copy link
Author

darajava commented Feb 7, 2021

That's so interesting, thanks guys. I have no idea why a shallow clone should be so taxing but hopefully they will fix it soon.

@Rylan12
Copy link
Member

Rylan12 commented Feb 7, 2021

I'm no expert but I believe it has to do with the way "deltas" (i.e. the difference between files) are calculated.

In fetch from a full clone, the server can assume that the local repo has all the necessary files (even from previous points in history). Thus, to reduce the amount of data transferred, the server can provide "deltas" (i.e. diffs) between existing files and the new files. The local client can then "apply" those deltas to the appropriate files in order to get the new data. The server just has to calculate these deltas (which may already be cached) and, because there's the assumption that the clone is complete and has all files in the history, there's no question of "can I base a delta off of this file or does the client not have it."

In a shallow clone, though, the user has no files from the history. Therefore, the server has to do more expensive calculations to determine which files the user currently has, making new fetches more expensive to perform for the server.

Disclaimer: I'm nowhere near an expert in this so it's quite possible my explanation is flawed.

I don't see GitHub "fixing this issue" soon as there doesn't seem to be a way to resolve this easily.

@gromgit
Copy link
Member

gromgit commented Feb 8, 2021

I have no idea why a shallow clone should be so taxing

As I understand it, it's all in how Git does delta-chaining, which differed radically from other VCSes at the time. Linus Torvalds described the details here, but basically, most VCSes create deltas (diffs between adjacent versions) in a strict "chain" from a root object, while Git just selects whichever version of a file makes the most sense to diff from.

For instance, if you checked in a slew of changes (A1..27) as v2 of a.c, then decide to roll then all back and make a different change B instead to create v3, this is roughly what the VCS objects might look like:

Typical VCS:

v1 ---[add A1..27]---> v2 ---[roll back A1..27, add B]---> v3

Git:

v1 -+-[add A1..27]---> v2
    \-[add B]---> v3

This makes shallow clones of Git repos really expensive on the server side over time. Since you only have a limited set of objects on your local machine, updates force the server side to walk backwards through its (full) repo copy to figure out which dependent objects you don't already have, so that your local Git client can do the update properly. As time goes by, and more objects appear that are "linked" to other objects you also don't have, this burden (and the resulting slowdown in updates) just increases.

However, if you had a full copy, the server side would just send you all the objects committed after the latest object on your side, which obviously doesn't require as much computation.

@Rylan12
Copy link
Member

Rylan12 commented Feb 8, 2021

Thanks, @gromgit. That's a good explanation.

@carlocab
Copy link
Member

carlocab commented Feb 8, 2021

Does that mean that, in principle, if git could somehow guarantee that a repository only has a linear history (so a simple greedy algorithm suffices), then shallow clones should not impose as great a computational burden? (Since fetch no longer requires traversing the entire tree.)

@jonchang
Copy link
Contributor

jonchang commented Feb 8, 2021

It probably would be true. But I think for our current workflow that won't ever be the case, since each pull request on GitHub creates a new leaf node in the commit graph, so it'll require a full tree walk on the server anyway per fig 2 in gromgit's post.

@carlocab
Copy link
Member

carlocab commented Feb 8, 2021

Or maybe Git[Hub] should just be lazily fetching from remotes for commits that are far enough away from HEAD, but I guess they've already thought of that.

@darajava
Copy link
Author

darajava commented Feb 8, 2021

@carlocab, I'd say very few repos have a perfectly linear history. I don't understand why GH don't just cache a shallow copy on each commit pushed. I suppose it would be a lot of work on a project with such scale though.

@carlocab
Copy link
Member

carlocab commented Feb 8, 2021

True, but GitHub also requests very few repos to do full clones and fetches rather than shallow ones. Merge commits are not allowed on homebrew/core, so the master branch should have a linear history.

Caching a shallow copy won't help, since that doesn't solve the problem about computing deltas mentioned above. However, one thing I don't understand is why they don't just forego the computation and send a complete shallow clone each time you fetch from a shallow local repository. This way the delta computation can be done locally rather than on GitHub servers. But I'm guessing they've also already thought of that.

@darajava
Copy link
Author

darajava commented Feb 8, 2021

When I say caching a shallow copy I mean:

  • When the remote receives a git-pack, GH servers do git clone https://this-repo.git -depth=1
  • They save this as a bundle on their servers
  • If anyone (or Homebrew) requests a copy with depth=1, then GH returns the bundle already saved.

This would be entirely transparent to the user. The problem GH have is not that they can't handle shallow cloning, they just can't handle it thousands or millions of times.

Basically what I'm saying is that depth=1 is a special case and could be cached at a minor storage expense for GH. Anyway, not sure why I'm describing this as it's unlikely to change anything.

@carlocab
Copy link
Member

carlocab commented Feb 8, 2021

I don't think there's a need to cache this, as I don't think there's a lot of computation involved for a fresh shallow clone. We can cut out the caching step and computation and just send a complete shallow clone each time, which is what I suggested above.

Anyway, not sure why I'm describing this as it's unlikely to change anything.

Same, but I guess speculating about other people's work is sometimes more entertaining than doing our own 😅

@MikeMcQuaid
Copy link
Member

(GitHub employs several long-time Git maintainers and run Git at a scale which no-one else does. Chances are anything that's immediately obvious to you isn't quite as simple as you think it is...)

@carlocab
Copy link
Member

carlocab commented Feb 8, 2021

Yup, hence my two comments above, saying:

I guess they've already thought of that.

@darajava
Copy link
Author

darajava commented Feb 8, 2021

Yep, let's leave it to the grown ups

@tlk
Copy link
Contributor

tlk commented Feb 8, 2021

(GitHub employs several long-time Git maintainers and run Git at a scale which no-one else does. Chances are anything that's immediately obvious to you isn't quite as simple as you think it is...)

Nice. Do they have any suggestions on how to improve homebrew for the benefit of users (smaller footprint) and github?

@MikeMcQuaid
Copy link
Member

Nice. Do they have any suggestions on how to improve homebrew for the benefit of users (smaller footprint) and github?

Yes. Disabling shallow clones was the first of these.

@BrewTestBot BrewTestBot added the outdated PR was locked due to age label Mar 12, 2021
@Homebrew Homebrew locked as resolved and limited conversation to collaborators Mar 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
features New features outdated PR was locked due to age
Projects
None yet
Development

No branches or pull requests

9 participants