Skip to content

Maven plugin unnecessarily slow when using ratcheting #701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eriklumme opened this issue Sep 21, 2020 · 5 comments
Closed

Maven plugin unnecessarily slow when using ratcheting #701

eriklumme opened this issue Sep 21, 2020 · 5 comments

Comments

@eriklumme
Copy link
Contributor

Running e.g. spotless:check with ratcheting enabled takes almost 20 seconds on one of my projects, even when no files have been modified.

As it runs other projects much quicker, it leads me to believe that the number of files is the issue.

I believe it currently gathers all files matching the include patterns first, and afterwards matches against what has actually been modified. I'd like being able to configure it to do the opposite, i.e. start with the files that have been modified, and filter them based on the include patterns.

Perhaps this would be an issue when a large number of files that do not match the include patterns have been modified.

@nedtwigg
Copy link
Member

You are exactly correct about how it works. We get all the files, and then we filter them using git.

List<File> files = collectFiles(formatterFactory);
FormatterConfig config = getFormatterConfig();
Optional<String> ratchetFrom = formatterFactory.ratchetFrom(config);
Iterable<File> toFormat;
if (!ratchetFrom.isPresent()) {
toFormat = files;
} else {
toFormat = Iterables.filter(files, GitRatchetMaven.instance().isGitDirty(baseDir, ratchetFrom.get()));
}

I would be surprised but not shocked if it were faster to swap the order. git status has to look at the contents of every file, but enumerating a file pattern src/** does not. So it should be that by doing the pattern first, you are speeding up git status, but the internal workings of org.codehaus.plexus.util.FileUtils is a complete mystery to me, and I would not be shocked if it is wildly slow.

Some stats which would be interesting:

  • how long does git status take
  • how long does git diff --name-status origin/main take
  • how long does spotless:check take with ratcheting (sounds like 20 seconds)
  • how long does spotless:check take without ratcheting (I would expect much longer than 20 seconds)
    • you might have to do spotless:apply first, since it changes the baseline to not do ratcheting, but we only care about performance for this

One thing that git status has going for it is native filesystem access. Gradle has used native code to do up-to-date and incremental checks for the last release or two, and it was a noticeable improvement. To get this benefit in spotless maven, it might be beneficial to shell out to git diff --name-status <RATCHET>, parse the result, and then filter it against the includes/excludes.

Regardless, PRs welcome :)

@eriklumme
Copy link
Contributor Author

Thanks for the reply, took a look at the code after posting (better late than never...) and realized it was indeed the case.

If this was to be changed, the files would be retrieved from the GitRatchet and compared to the includes/excludes using SelectorUtils#matchPath, which is what FileUtils eventually delegates to, at least in the version I happen to have on my classpath.

I'll get back to this with some stats when I have the time, I've currently been running it in a remote VM, but I'd like to try it locally too, maybe even do a bit of profiling if I can get such a setup to work.

@nedtwigg
Copy link
Member

the files would be retrieved from the GitRatchet

I agree conceptually. In terms of implementation, I suspect it would be better performance and easier to build if you dropped GitRatchet entirely, and instead used git diff --name-status <RATCHET_REF>. GitRatchet was written for the gradle plugin, and then refactored for use by the maven plugin. Since gradle provides fast up-to-date checking out of the box, but maven doesn't, it might be hard to get one hunk of code to serve different models. But I have no strong opinion, and I'd be happy to merge any PR which made the maven plugin faster.

@nedtwigg
Copy link
Member

Something which hadn't occurred to me: it's not actually git diff --name-status <RATCHET_REF>, it's actually <MERGE_BASE>, which gets calculated here:

RevCommit ratchetFrom = revWalk.parseCommit(commitSha);
RevCommit head = revWalk.parseCommit(repo.resolve(Constants.HEAD));
revWalk.setRevFilter(RevFilter.MERGE_BASE);
revWalk.markStart(ratchetFrom);
revWalk.markStart(head);
RevCommit mergeBase = revWalk.next();
treeSha = Optional.ofNullable(mergeBase).orElse(ratchetFrom).getTree();

If you don't do the diff relative to the merge base, you will bring back an old bug, #627. Luckily, you could call GitRatchet. rootTreeShaOf, and pass the result to git diff. So part of GitRatchet is still helpful for the design discussed above.

@nedtwigg
Copy link
Member

nedtwigg commented Oct 5, 2020

Thanks to @eriklumme for a great PR! Released in plugin-maven 2.4.2.

@nedtwigg nedtwigg closed this as completed Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants