Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloated git repository #2458

Closed
billsacks opened this issue Apr 12, 2018 · 8 comments
Closed

Bloated git repository #2458

billsacks opened this issue Apr 12, 2018 · 8 comments

Comments

@billsacks
Copy link
Member

I noticed that recent clones of cime are much bigger than they used to be. I tracked the problem back to 92ccb83 - Merge pull request #2312 from ESMCI/jgfouca/branch-for-acme-split-2018-02-22 - which added 5,296 commits to history. #2357 also added a lot of commits, and there was a brief discussion in that PR about that issue. It looks like the two most recent e3sm splits (#2406 and #2433) only added a small number of commits. Maybe the changes with #2367 fixed this?

There probably isn't much we can do about this at this point, but I wanted to make sure that the e3sm split process will involve relatively few added commits moving forward so we don't experience runaway repository bloat. @jgfouca ?

@billsacks
Copy link
Member Author

In case it's of interest, here's what I found in terms of repository growth over the last couple of years. These are the data sizes pulled down with git clone -b TAGNAME --single-branch git@github.com:ESMCI/cime.git

4.6.0 was 13.55 MiB

5.0.0 was 17.79 MiB

5.4.0-alpha01 was 26.98 MiB

5.4.0-alpha16 was 29.35 MiB

5.4.0-alpha22 was 30.33 MiB

5.4.0-alpha24 was 30.50 MiB

5.4.0-alpha25 was 69.55 MiB

5.4.0-alpha26 was 70.47 MiB

master is 70.71 MiB

@jgfouca
Copy link
Contributor

jgfouca commented Apr 12, 2018

@billsacks , thanks for this investigation. I will keep an eye on the acme splits to make sure they aren't adding too many commits. Maybe we should go back to squashing to reduce history bloat?

@rljacob
Copy link
Member

rljacob commented Apr 12, 2018

Yes do squash commits each way.

@jgfouca
Copy link
Contributor

jgfouca commented Apr 12, 2018

I think @billsacks asked for the full history at one point.

@billsacks
Copy link
Member Author

To clarify, I said that squashing didn't seem ideal, but I was okay with it (#2177 (comment) and #2177 (comment)).

Based on a few spot-checks from the last two e3sm split PRs, it looks like the only commits being added to history are ones that actually touched cime – which seems like the right behavior. I'd be concerned if somehow all of the e3sm commits were being added to history, and/or if the number of commits coming from the e3sm splits were, say, an order of magnitude larger than the number of commits being added directly to cime.

To summarize: I'm fine with the status quo as long as you keep an eye on this, like you suggest @jgfouca . But I'm also fine with having you squash them if you prefer that.

@billsacks
Copy link
Member Author

I'll go ahead and close this because I think we've resolved it enough; feel free to reopen if you want to discuss further.

jgfouca added a commit that referenced this issue Aug 8, 2018
Currently MPI task to compute node mapping information is
output in two locations, once in CAM, where it is
truncated after the first 256 MPI tasks, and once in CLM,
where it is truncated after the first 100 MPI tasks,
both only for these two components. This is not useful in current
production runs. The use of environment variables, such as
MPICH_CPUMASK_DISPLAY on Cray systems, generate data that are
unnecessarily verbose for our needs. Here a share routine is
introduced that writes out one line per compute node. Each line
contains the compute node name and the list of MPI tasks assigned
to that node for a given communicator. This is then called
in the driver and writes out the task-to-node mapping for the
entire coupled model. Separate branches will then introduce
this into the individual components, replacing the current logic
in both CAM and CLM, for example.

The share routine also optionally returns the number of compute
nodes and the task-to-node mapping, which is needed in the
internal CAM load balancing.

With the call to the shr_taskmap_write routine in the
driver, the mapping data generated by the system when setting
the corresponding environment variable is redundant. This
is removed for the systems currently setting the variable.

Fixes #2457

BFB

* origin/worleyph/cime/taskmap:
  Avoid empty env blocks
  Remove unnecessary white space in task-to-node map output
  Modify driver output format
  Uncomment MV2_CPU_MAPPING definition for Anvil
  Modify task map output format
  Unset environment variables to output task-to-node mapping
  Output MPI task to compute node mapping
@billsacks
Copy link
Member Author

This issue of repository bloat came up again on today's CIME call.

Current repository size:

  • Full repository: 85.59 MiB
  • --single-branch master: 80.16 MiB (growth of a little less than 10 MiB since April, 2018).

@jgfouca raised the idea of, at some point, cutting off history older than some point and force pushing to master. (We could still keep the old history in an archived repo somewhere.) We didn't decide if this is worth doing; we can revisit this later.

@jgfouca
Copy link
Contributor

jgfouca commented Oct 30, 2019

Just a little additional info, as of today, it takes 12s to clone the repo on a local file system with a fast network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants