Skip to content

History cache per partes #3589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 69 commits into from
May 19, 2021
Merged

Conversation

vladak
Copy link
Member

@vladak vladak commented May 18, 2021

This change addresses long standing [1] problem of the indexer of storing complete History in memory when updating history cache. This is done by splitting the history to chunks (per partes comes from the latin expresion for the math technique of integration in parts). This is basically as if the indexing was done by repeated syncing part of the history and indexing incrementally.

Out of the repositories that are based on changesets this only supports Git, although the other VCS types such as Mercurial can be easily adopted I think.

With this change it is finally possible to create history cache for repositories such as Linux from scratch with limited heap. This is partial capture (maybe one fifth) of history cache creation (renamed file handling on, merge changesets on) of the linux-master repository when running the indexer with 8 GB of heap:

opengrok - per partes

This change also removes top level directory history cache since it requires for the whole history to be held in memory when merging old and new history. For individual files the history is not so long/big, at least that is the arguably big assumption of this change in overall.

[1] I hit that problem in production last fall after Git started supporting merge changesets

@vladak
Copy link
Member Author

vladak commented May 19, 2021

Just thinking out loud, with your solution we iterate over the commits twice. Once to generate the chunks and then the second time to actually process the chunks. Could not we split it into chunks directly while iterating over the RevWalk? Or that would be negligible?

Do you mean something like this: use getHistory() as a vehicle for actually storing the cache, i.e. introduce some sort of callback to store the history once the traversal accumulates sufficient amount of changesets.

@ahornace
Copy link
Contributor

Do you mean something like this: use getHistory() as a vehicle for actually storing the cache, i.e. introduce some sort of callback to store the history once the traversal accumulates sufficient amount of changesets.

Yep, something like that.

Now we are basically reopening the jgit repository for every chunk, right? How does the lookupCommit() perform in such a case, is it O(1)?

Either way, I'd assume that we need to do some unnecessary IO that was already done in the first traversal.

Copy link
Contributor

@ahornace ahornace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the actual impact on the indexing time? Is it much slower now or is it comparable?

@vladak
Copy link
Member Author

vladak commented May 19, 2021

What's the actual impact on the indexing time? Is it much slower now or is it comparable?

Creating history cache from scratch for https://github.com/openssl/openssl/ (master branch, currently has some 28k changesets when listing with plain git log) with renamed files on, merge changesets on and 8 GB heap finishes in some 11 and half minutes on average (tried 3 times). Indexing the same with the per partes changes (using the chunks with 512 changesets max) finishes in 10 minutes on average (tried also 3 times) so this seems to be actually faster.

@vladak
Copy link
Member Author

vladak commented May 19, 2021

Do you mean something like this: use getHistory() as a vehicle for actually storing the cache, i.e. introduce some sort of callback to store the history once the traversal accumulates sufficient amount of changesets.

Yep, something like that.

Now we are basically reopening the jgit repository for every chunk, right? How does the lookupCommit() perform in such a case, is it O(1)?

Either way, I'd assume that we need to do some unnecessary IO that was already done in the first traversal.

I am not too worried about I/O or complexity of the initial traversal, at least for modern VCS. Even for the Linux Git repo with 998k changesets getting the boundary changesets takes some 13 seconds (ext4 on SSD) which is negligible compared to the overall history cache creation time.

It's more question of architecture. The current solution requires repositories to implement accept() and getHistory(file, sinceRevision, tillRevision). With the alternative solution the repositories would have to supply getHistory() variant with a callback. Also, there would have to be some changes to how the list of renamed files is passed - currently they are passed via History object. I'd need to think about this more. The advantage of the current implementation is that it allows to observe the progress.

@ahornace
Copy link
Contributor

Creating history cache from scratch for https://github.com/openssl/openssl/ (master branch, currently has some 28k changesets when listing with plain git log) with renamed files on, merge changesets on and 8 GB heap finishes in some 11 and half minutes on average (tried 3 times). Indexing the same with the per partes changes (using the chunks with 512 changesets max) finishes in 10 minutes on average (tried also 3 times) so this seems to be actually faster.

That's really cool! I don't know why I was expecting it to be a little slower.

@vladak
Copy link
Member Author

vladak commented May 19, 2021

That's really cool! I don't know why I was expecting it to be a little slower.

Next to lower memory requirements, this might be caused by some changes I did along the way, e.g. GitRepository#getHistory() no longer retrieves lists of files when getting the history for individual files (i.e. renamed files).

@vladak vladak merged commit b29f860 into oracle:master May 19, 2021
@vladak
Copy link
Member Author

vladak commented May 19, 2021

thanks @ahornace !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants