Homebrew has a ~350mb footprint #10557

darajava · 2021-02-07T16:41:24Z

Provide a detailed description of the proposed feature

Homebrew installation takes up 350mb of space and bandwidth. This is due (from what I see) to the git clone command cloning the entire repo including history.

git clone --depth 1 https://github.com/user/repo.git # (see https://stackoverflow.com/questions/1209999/using-git-to-get-just-the-latest-revision)

Updating to this command will only clone the most recent commit. Is there a reason we need this history?

I have tried to install brew this way (git clone https://github.com/Homebrew/brew ~/.linuxbrew/Homebrew --depth=1) and it seems to work without issue (And only costs 2.73MiB, about 128 times smaller).

What is the motivation for the feature?

For people on slow or limited connections, installing Homebrew is a pain and quite slow. I believe that similar git cloning (I mean including the history) is done for packages installed by Homebrew too, but I actually can't see the git command used for cloning.

It should also be a really easy and quick fix, as simply adding depth=1 to the git commands should fix it.

How will the feature be relevant to at least 90% of Homebrew users?

This should be relevant for 100% of Homebrew users (even ones on a fast connection) because all packages will be downloaded (much) faster and take up (much) less space.

What alternatives to the feature have been considered?

None that I'm aware of. Maybe I'm wrong about this and there is a reason you include history (I tried going through related issued but I couldn't see the reason for cloning history), but this is so painful to me working on a 80GB per month data allowance. Perhaps it could even be added as an option?

The text was updated successfully, but these errors were encountered:

scpeters · 2021-02-07T17:07:13Z

the initial clone is smaller, but git fetch operations are more computationally expensive for the server, so GitHub has requested that we don't use shallow clones (see discussion in #9383)

Rylan12 · 2021-02-07T17:08:58Z

GitHub has asked that we don't use shallow clones as it is extremely taxing for their servers. As a result, we will not make a change to add --depth=1 to the git clone commands.

We are investigating other ways to improve this, though. We're looking at using blobless or treeless clones. These solutions are promising but not quite ready yet. GitHub is working to improve these clone options which we could eventually use in Homebrew. We are also looking at restructuring our repos to reduce the size of tree objects which would help to reduce the repository size.

darajava · 2021-02-07T17:14:12Z

That's so interesting, thanks guys. I have no idea why a shallow clone should be so taxing but hopefully they will fix it soon.

Rylan12 · 2021-02-07T17:25:47Z

I'm no expert but I believe it has to do with the way "deltas" (i.e. the difference between files) are calculated.

In fetch from a full clone, the server can assume that the local repo has all the necessary files (even from previous points in history). Thus, to reduce the amount of data transferred, the server can provide "deltas" (i.e. diffs) between existing files and the new files. The local client can then "apply" those deltas to the appropriate files in order to get the new data. The server just has to calculate these deltas (which may already be cached) and, because there's the assumption that the clone is complete and has all files in the history, there's no question of "can I base a delta off of this file or does the client not have it."

In a shallow clone, though, the user has no files from the history. Therefore, the server has to do more expensive calculations to determine which files the user currently has, making new fetches more expensive to perform for the server.

Disclaimer: I'm nowhere near an expert in this so it's quite possible my explanation is flawed.

I don't see GitHub "fixing this issue" soon as there doesn't seem to be a way to resolve this easily.

gromgit · 2021-02-08T04:05:34Z

I have no idea why a shallow clone should be so taxing

As I understand it, it's all in how Git does delta-chaining, which differed radically from other VCSes at the time. Linus Torvalds described the details here, but basically, most VCSes create deltas (diffs between adjacent versions) in a strict "chain" from a root object, while Git just selects whichever version of a file makes the most sense to diff from.

For instance, if you checked in a slew of changes (A1..27) as v2 of a.c, then decide to roll then all back and make a different change B instead to create v3, this is roughly what the VCS objects might look like:

Typical VCS:

v1 ---[add A1..27]---> v2 ---[roll back A1..27, add B]---> v3

Git:

v1 -+-[add A1..27]---> v2
    \-[add B]---> v3

This makes shallow clones of Git repos really expensive on the server side over time. Since you only have a limited set of objects on your local machine, updates force the server side to walk backwards through its (full) repo copy to figure out which dependent objects you don't already have, so that your local Git client can do the update properly. As time goes by, and more objects appear that are "linked" to other objects you also don't have, this burden (and the resulting slowdown in updates) just increases.

However, if you had a full copy, the server side would just send you all the objects committed after the latest object on your side, which obviously doesn't require as much computation.

Rylan12 · 2021-02-08T05:26:07Z

Thanks, @gromgit. That's a good explanation.

carlocab · 2021-02-08T11:13:01Z

Does that mean that, in principle, if git could somehow guarantee that a repository only has a linear history (so a simple greedy algorithm suffices), then shallow clones should not impose as great a computational burden? (Since fetch no longer requires traversing the entire tree.)

jonchang · 2021-02-08T12:11:53Z

It probably would be true. But I think for our current workflow that won't ever be the case, since each pull request on GitHub creates a new leaf node in the commit graph, so it'll require a full tree walk on the server anyway per fig 2 in gromgit's post.

carlocab · 2021-02-08T12:47:56Z

Or maybe Git[Hub] should just be lazily fetching from remotes for commits that are far enough away from HEAD, but I guess they've already thought of that.

darajava · 2021-02-08T12:50:24Z

@carlocab, I'd say very few repos have a perfectly linear history. I don't understand why GH don't just cache a shallow copy on each commit pushed. I suppose it would be a lot of work on a project with such scale though.

carlocab · 2021-02-08T12:54:51Z

True, but GitHub also requests very few repos to do full clones and fetches rather than shallow ones. Merge commits are not allowed on homebrew/core, so the master branch should have a linear history.

Caching a shallow copy won't help, since that doesn't solve the problem about computing deltas mentioned above. However, one thing I don't understand is why they don't just forego the computation and send a complete shallow clone each time you fetch from a shallow local repository. This way the delta computation can be done locally rather than on GitHub servers. But I'm guessing they've also already thought of that.

darajava · 2021-02-08T13:21:18Z

When I say caching a shallow copy I mean:

When the remote receives a git-pack, GH servers do git clone https://this-repo.git -depth=1
They save this as a bundle on their servers
If anyone (or Homebrew) requests a copy with depth=1, then GH returns the bundle already saved.

This would be entirely transparent to the user. The problem GH have is not that they can't handle shallow cloning, they just can't handle it thousands or millions of times.

Basically what I'm saying is that depth=1 is a special case and could be cached at a minor storage expense for GH. Anyway, not sure why I'm describing this as it's unlikely to change anything.

carlocab · 2021-02-08T13:30:26Z

I don't think there's a need to cache this, as I don't think there's a lot of computation involved for a fresh shallow clone. We can cut out the caching step and computation and just send a complete shallow clone each time, which is what I suggested above.

Anyway, not sure why I'm describing this as it's unlikely to change anything.

Same, but I guess speculating about other people's work is sometimes more entertaining than doing our own 😅

MikeMcQuaid · 2021-02-08T13:58:39Z

(GitHub employs several long-time Git maintainers and run Git at a scale which no-one else does. Chances are anything that's immediately obvious to you isn't quite as simple as you think it is...)

carlocab · 2021-02-08T14:21:26Z

Yup, hence my two comments above, saying:

I guess they've already thought of that.

darajava · 2021-02-08T14:39:38Z

Yep, let's leave it to the grown ups

tlk · 2021-02-08T19:28:41Z

(GitHub employs several long-time Git maintainers and run Git at a scale which no-one else does. Chances are anything that's immediately obvious to you isn't quite as simple as you think it is...)

Nice. Do they have any suggestions on how to improve homebrew for the benefit of users (smaller footprint) and github?

MikeMcQuaid · 2021-02-09T11:02:00Z

Nice. Do they have any suggestions on how to improve homebrew for the benefit of users (smaller footprint) and github?

Yes. Disabling shallow clones was the first of these.

darajava added the features New features label Feb 7, 2021

Rylan12 closed this as completed Feb 7, 2021

BrewTestBot added the outdated PR was locked due to age label Mar 12, 2021

Homebrew locked as resolved and limited conversation to collaborators Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Homebrew has a ~350mb footprint #10557

Homebrew has a ~350mb footprint #10557

darajava commented Feb 7, 2021 •

edited

Loading

scpeters commented Feb 7, 2021

Rylan12 commented Feb 7, 2021

darajava commented Feb 7, 2021

Rylan12 commented Feb 7, 2021

gromgit commented Feb 8, 2021 •

edited

Loading

Rylan12 commented Feb 8, 2021

carlocab commented Feb 8, 2021

jonchang commented Feb 8, 2021

carlocab commented Feb 8, 2021

darajava commented Feb 8, 2021

carlocab commented Feb 8, 2021 •

edited

Loading

darajava commented Feb 8, 2021 •

edited

Loading

carlocab commented Feb 8, 2021

MikeMcQuaid commented Feb 8, 2021

carlocab commented Feb 8, 2021

darajava commented Feb 8, 2021

tlk commented Feb 8, 2021

MikeMcQuaid commented Feb 9, 2021

Homebrew has a ~350mb footprint #10557

Homebrew has a ~350mb footprint #10557

Comments

darajava commented Feb 7, 2021 • edited Loading

scpeters commented Feb 7, 2021

Rylan12 commented Feb 7, 2021

darajava commented Feb 7, 2021

Rylan12 commented Feb 7, 2021

gromgit commented Feb 8, 2021 • edited Loading

Rylan12 commented Feb 8, 2021

carlocab commented Feb 8, 2021

jonchang commented Feb 8, 2021

carlocab commented Feb 8, 2021

darajava commented Feb 8, 2021

carlocab commented Feb 8, 2021 • edited Loading

darajava commented Feb 8, 2021 • edited Loading

carlocab commented Feb 8, 2021

MikeMcQuaid commented Feb 8, 2021

carlocab commented Feb 8, 2021

darajava commented Feb 8, 2021

tlk commented Feb 8, 2021

MikeMcQuaid commented Feb 9, 2021

darajava commented Feb 7, 2021 •

edited

Loading

gromgit commented Feb 8, 2021 •

edited

Loading

carlocab commented Feb 8, 2021 •

edited

Loading

darajava commented Feb 8, 2021 •

edited

Loading