-
Notifications
You must be signed in to change notification settings - Fork 779
History cache per partes #3589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
History cache per partes #3589
Conversation
also renamed the abstract class
it is not needed after the recent JGit changes
the date parsing is no longer needed because the commit date is not parsed from git log output anymore
it does not work with lists containing null elements
Do you mean something like this: use |
Yep, something like that. Now we are basically reopening the jgit repository for every chunk, right? How does the Either way, I'd assume that we need to do some unnecessary IO that was already done in the first traversal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the actual impact on the indexing time? Is it much slower now or is it comparable?
Creating history cache from scratch for https://github.com/openssl/openssl/ ( |
I am not too worried about I/O or complexity of the initial traversal, at least for modern VCS. Even for the Linux Git repo with 998k changesets getting the boundary changesets takes some 13 seconds (ext4 on SSD) which is negligible compared to the overall history cache creation time. It's more question of architecture. The current solution requires repositories to implement |
That's really cool! I don't know why I was expecting it to be a little slower. |
Next to lower memory requirements, this might be caused by some changes I did along the way, e.g. |
thanks @ahornace ! |
This change addresses long standing [1] problem of the indexer of storing complete History in memory when updating history cache. This is done by splitting the history to chunks (per partes comes from the latin expresion for the math technique of integration in parts). This is basically as if the indexing was done by repeated syncing part of the history and indexing incrementally.
Out of the repositories that are based on changesets this only supports Git, although the other VCS types such as Mercurial can be easily adopted I think.
With this change it is finally possible to create history cache for repositories such as Linux from scratch with limited heap. This is partial capture (maybe one fifth) of history cache creation (renamed file handling on, merge changesets on) of the linux-master repository when running the indexer with 8 GB of heap:
This change also removes top level directory history cache since it requires for the whole history to be held in memory when merging old and new history. For individual files the history is not so long/big, at least that is the arguably big assumption of this change in overall.
[1] I hit that problem in production last fall after Git started supporting merge changesets