History cache per partes #3589

vladak · 2021-05-18T14:28:45Z

This change addresses long standing [1] problem of the indexer of storing complete History in memory when updating history cache. This is done by splitting the history to chunks (per partes comes from the latin expresion for the math technique of integration in parts). This is basically as if the indexing was done by repeated syncing part of the history and indexing incrementally.

Out of the repositories that are based on changesets this only supports Git, although the other VCS types such as Mercurial can be easily adopted I think.

With this change it is finally possible to create history cache for repositories such as Linux from scratch with limited heap. This is partial capture (maybe one fifth) of history cache creation (renamed file handling on, merge changesets on) of the linux-master repository when running the indexer with 8 GB of heap:

This change also removes top level directory history cache since it requires for the whole history to be held in memory when merging old and new history. For individual files the history is not so long/big, at least that is the arguably big assumption of this change in overall.

[1] I hit that problem in production last fall after Git started supporting merge changesets

fixes oracle#3243

also renamed the abstract class

it is not needed after the recent JGit changes

the date parsing is no longer needed because the commit date is not parsed from git log output anymore

it does not work with lists containing null elements

vladak · 2021-05-19T10:59:46Z

Just thinking out loud, with your solution we iterate over the commits twice. Once to generate the chunks and then the second time to actually process the chunks. Could not we split it into chunks directly while iterating over the RevWalk? Or that would be negligible?

Do you mean something like this: use getHistory() as a vehicle for actually storing the cache, i.e. introduce some sort of callback to store the history once the traversal accumulates sufficient amount of changesets.

ahornace · 2021-05-19T12:12:20Z

Do you mean something like this: use getHistory() as a vehicle for actually storing the cache, i.e. introduce some sort of callback to store the history once the traversal accumulates sufficient amount of changesets.

Yep, something like that.

Now we are basically reopening the jgit repository for every chunk, right? How does the lookupCommit() perform in such a case, is it O(1)?

Either way, I'd assume that we need to do some unnecessary IO that was already done in the first traversal.

ahornace

What's the actual impact on the indexing time? Is it much slower now or is it comparable?

vladak · 2021-05-19T15:03:32Z

What's the actual impact on the indexing time? Is it much slower now or is it comparable?

Creating history cache from scratch for https://github.com/openssl/openssl/ (master branch, currently has some 28k changesets when listing with plain git log) with renamed files on, merge changesets on and 8 GB heap finishes in some 11 and half minutes on average (tried 3 times). Indexing the same with the per partes changes (using the chunks with 512 changesets max) finishes in 10 minutes on average (tried also 3 times) so this seems to be actually faster.

vladak · 2021-05-19T15:19:38Z

Do you mean something like this: use getHistory() as a vehicle for actually storing the cache, i.e. introduce some sort of callback to store the history once the traversal accumulates sufficient amount of changesets.

Yep, something like that.

Now we are basically reopening the jgit repository for every chunk, right? How does the lookupCommit() perform in such a case, is it O(1)?

Either way, I'd assume that we need to do some unnecessary IO that was already done in the first traversal.

I am not too worried about I/O or complexity of the initial traversal, at least for modern VCS. Even for the Linux Git repo with 998k changesets getting the boundary changesets takes some 13 seconds (ext4 on SSD) which is negligible compared to the overall history cache creation time.

It's more question of architecture. The current solution requires repositories to implement accept() and getHistory(file, sinceRevision, tillRevision). With the alternative solution the repositories would have to supply getHistory() variant with a callback. Also, there would have to be some changes to how the list of renamed files is passed - currently they are passed via History object. I'd need to think about this more. The advantage of the current implementation is that it allows to observe the progress.

ahornace · 2021-05-19T15:53:36Z

Creating history cache from scratch for https://github.com/openssl/openssl/ (master branch, currently has some 28k changesets when listing with plain git log) with renamed files on, merge changesets on and 8 GB heap finishes in some 11 and half minutes on average (tried 3 times). Indexing the same with the per partes changes (using the chunks with 512 changesets max) finishes in 10 minutes on average (tried also 3 times) so this seems to be actually faster.

That's really cool! I don't know why I was expecting it to be a little slower.

vladak · 2021-05-19T16:38:03Z

That's really cool! I don't know why I was expecting it to be a little slower.

Next to lower memory requirements, this might be caused by some changes I did along the way, e.g. GitRepository#getHistory() no longer retrieves lists of files when getting the history for individual files (i.e. renamed files).

vladak · 2021-05-19T18:15:05Z

thanks @ahornace !

Vladimir Kotal added 30 commits April 16, 2021 21:53

proof of concept: split history cache generation into chunks

61a50bc

fixes oracle#3243

cleanup: refreshing latest version in the cache should not be necessary

2efe930

also renamed the abstract class

add repository to the log message

f6142a0

introduce getPerPartesCount()

5f5fe33

update comment

fcb55a5

finish removal of per directory cache

fb94b74

introduce boundary changesets

8edaf1e

restore renamed file handling

f445546

restore renamed file handling

33ec202

cleanup history remnants

159aa8c

remove unused import

29b43c8

address Windows

10c1cdf

add basic test for boundary changesets

e010100

also test non null sinceRevision

50a1f9b

add from,to reivision to the log entry

6152897

cleanup unused

e7cc3e2

logging changes

7e03023

add getHistory() test w.r.t. boundary changesets

564c243

add TODO

c6d71d5

history of renamed files has to be handled specially for per partes

7722e30

avoid warning

36275ac

add test for renamed file handling with per partes

7eaaf0d

remove unused import

e914fb9

make sure the renamed was detected

0a28130

do not produce list of files for history of single file

3bd931f

add tests for GitRepository.getHistory(file, sinceRev, tillRev)

dd391e5

update Javadoc

1f89bfb

cleanup renamed file handling remnants

ce2e23b

remove unused import

a4318dd

cleanup

91d3199

Vladimir Kotal added 18 commits May 19, 2021 10:15

restore final for renamedFiles

d3aaf8b

refactor getFilesForCommit(), add javadoc

28e0c85

use Arrays.asList(), fix javadoc

380f2dd

rename interface name

16a9285

fix javadoc

2c423cb

use Consumer instead of the visitor interface

2e64b55

remove unused import

ed04e63

cleanup git version check

f4ccdec

it is not needed after the recent JGit changes

more JGit related cleanup

a891ef4

the date parsing is no longer needed because the commit date is not parsed from git log output anymore

remove unused import

7672dc8

add missing @throws to javadoc

e90f4dc

fix javadoc

8f7ac7f

remove stale reference from javadoc

345832e

use List.copyOf() to return results

1b97072

revert List.copyOf()

8e67531

it does not work with lists containing null elements

refactor cache creation to avoid instanceOf

692b14e

remove unused import

e102e45

List.copyOf() redux

0d4e166

ahornace approved these changes May 19, 2021

View reviewed changes

vladak merged commit b29f860 into oracle:master May 19, 2021

vladak mentioned this pull request May 21, 2021

Mercurial history per partes #3601

Merged

vladak mentioned this pull request Jun 23, 2021

indexer CPU usage increased by factor 5 after 1.5 to 1.7 upgrade #3585

Closed

vladak mentioned this pull request Nov 3, 2022

different serialization scheme for history #3539

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

History cache per partes #3589

History cache per partes #3589

Uh oh!

vladak commented May 18, 2021 •

edited

Loading

Uh oh!

vladak commented May 19, 2021

Uh oh!

ahornace commented May 19, 2021

Uh oh!

ahornace left a comment

Uh oh!

vladak commented May 19, 2021 •

edited

Loading

Uh oh!

vladak commented May 19, 2021 •

edited

Loading

Uh oh!

ahornace commented May 19, 2021

Uh oh!

vladak commented May 19, 2021

Uh oh!

vladak commented May 19, 2021

Uh oh!

Uh oh!

History cache per partes #3589

History cache per partes #3589

Uh oh!

Conversation

vladak commented May 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vladak commented May 19, 2021

Uh oh!

ahornace commented May 19, 2021

Uh oh!

ahornace left a comment

Choose a reason for hiding this comment

Uh oh!

vladak commented May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vladak commented May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahornace commented May 19, 2021

Uh oh!

vladak commented May 19, 2021

Uh oh!

vladak commented May 19, 2021

Uh oh!

Uh oh!

vladak commented May 18, 2021 •

edited

Loading

vladak commented May 19, 2021 •

edited

Loading

vladak commented May 19, 2021 •

edited

Loading