Bloated git repository #2458

billsacks · 2018-04-12T13:48:21Z

I noticed that recent clones of cime are much bigger than they used to be. I tracked the problem back to 92ccb83 - Merge pull request #2312 from ESMCI/jgfouca/branch-for-acme-split-2018-02-22 - which added 5,296 commits to history. #2357 also added a lot of commits, and there was a brief discussion in that PR about that issue. It looks like the two most recent e3sm splits (#2406 and #2433) only added a small number of commits. Maybe the changes with #2367 fixed this?

There probably isn't much we can do about this at this point, but I wanted to make sure that the e3sm split process will involve relatively few added commits moving forward so we don't experience runaway repository bloat. @jgfouca ?

The text was updated successfully, but these errors were encountered:

billsacks · 2018-04-12T13:50:09Z

In case it's of interest, here's what I found in terms of repository growth over the last couple of years. These are the data sizes pulled down with git clone -b TAGNAME --single-branch git@github.com:ESMCI/cime.git

4.6.0 was 13.55 MiB

5.0.0 was 17.79 MiB

5.4.0-alpha01 was 26.98 MiB

5.4.0-alpha16 was 29.35 MiB

5.4.0-alpha22 was 30.33 MiB

5.4.0-alpha24 was 30.50 MiB

5.4.0-alpha25 was 69.55 MiB

5.4.0-alpha26 was 70.47 MiB

master is 70.71 MiB

jgfouca · 2018-04-12T17:19:37Z

@billsacks , thanks for this investigation. I will keep an eye on the acme splits to make sure they aren't adding too many commits. Maybe we should go back to squashing to reduce history bloat?

rljacob · 2018-04-12T17:31:31Z

Yes do squash commits each way.

jgfouca · 2018-04-12T17:40:09Z

I think @billsacks asked for the full history at one point.

billsacks · 2018-04-12T18:54:08Z

To clarify, I said that squashing didn't seem ideal, but I was okay with it (#2177 (comment) and #2177 (comment)).

Based on a few spot-checks from the last two e3sm split PRs, it looks like the only commits being added to history are ones that actually touched cime – which seems like the right behavior. I'd be concerned if somehow all of the e3sm commits were being added to history, and/or if the number of commits coming from the e3sm splits were, say, an order of magnitude larger than the number of commits being added directly to cime.

To summarize: I'm fine with the status quo as long as you keep an eye on this, like you suggest @jgfouca . But I'm also fine with having you squash them if you prefer that.

billsacks · 2018-04-12T18:54:29Z

I'll go ahead and close this because I think we've resolved it enough; feel free to reopen if you want to discuss further.

Currently MPI task to compute node mapping information is output in two locations, once in CAM, where it is truncated after the first 256 MPI tasks, and once in CLM, where it is truncated after the first 100 MPI tasks, both only for these two components. This is not useful in current production runs. The use of environment variables, such as MPICH_CPUMASK_DISPLAY on Cray systems, generate data that are unnecessarily verbose for our needs. Here a share routine is introduced that writes out one line per compute node. Each line contains the compute node name and the list of MPI tasks assigned to that node for a given communicator. This is then called in the driver and writes out the task-to-node mapping for the entire coupled model. Separate branches will then introduce this into the individual components, replacing the current logic in both CAM and CLM, for example. The share routine also optionally returns the number of compute nodes and the task-to-node mapping, which is needed in the internal CAM load balancing. With the call to the shr_taskmap_write routine in the driver, the mapping data generated by the system when setting the corresponding environment variable is redundant. This is removed for the systems currently setting the variable. Fixes #2457 BFB * origin/worleyph/cime/taskmap: Avoid empty env blocks Remove unnecessary white space in task-to-node map output Modify driver output format Uncomment MV2_CPU_MAPPING definition for Anvil Modify task map output format Unset environment variables to output task-to-node mapping Output MPI task to compute node mapping

billsacks · 2019-10-30T20:03:01Z

This issue of repository bloat came up again on today's CIME call.

Current repository size:

Full repository: 85.59 MiB
--single-branch master: 80.16 MiB (growth of a little less than 10 MiB since April, 2018).

@jgfouca raised the idea of, at some point, cutting off history older than some point and force pushing to master. (We could still keep the old history in an archived repo somewhere.) We didn't decide if this is worth doing; we can revisit this later.

jgfouca · 2019-10-30T21:24:26Z

Just a little additional info, as of today, it takes 12s to clone the repo on a local file system with a fast network.

billsacks added the ty: Discussion label Apr 12, 2018

billsacks closed this as completed Apr 12, 2018

billsacks mentioned this issue Nov 19, 2019

Pushed old gh-pages documentation history as a tag #3310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bloated git repository #2458

Bloated git repository #2458

billsacks commented Apr 12, 2018

billsacks commented Apr 12, 2018

jgfouca commented Apr 12, 2018

rljacob commented Apr 12, 2018

jgfouca commented Apr 12, 2018

billsacks commented Apr 12, 2018

billsacks commented Apr 12, 2018

billsacks commented Oct 30, 2019

jgfouca commented Oct 30, 2019

Bloated git repository #2458

Bloated git repository #2458

Comments

billsacks commented Apr 12, 2018

billsacks commented Apr 12, 2018

jgfouca commented Apr 12, 2018

rljacob commented Apr 12, 2018

jgfouca commented Apr 12, 2018

billsacks commented Apr 12, 2018

billsacks commented Apr 12, 2018

billsacks commented Oct 30, 2019

jgfouca commented Oct 30, 2019