Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-28836 Parallelize the file archival to improve the split times #6243

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -439,7 +439,7 @@ private static List<File> resolveAndArchive(FileSystem fs, Path baseArchiveDir,

List<File> failures = new ArrayList<>();
String startTime = Long.toString(start);
for (File file : toArchive) {
toArchive.parallelStream().forEach(file -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to check the difference between the parallelStream() and stream().parallel(). Java doc say that for parallelStream "It is allowable for this method to return a sequential stream."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if toArchive is small, there are chances that making it parallel can take more time than doing the same sequentially.

Copy link
Contributor Author

@mnpoonia mnpoonia Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't be the case as it is an IO operation (and diff wouldn't be noticable). So parallelizing it would improve it over what we already have. In pricinple i agree about what you are saying

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought parallelStream (and its underlying ForkJoinPool) is not really for I/O blocking operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's another point to consider

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the access to ArrayList failures thread safe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are IO operations in the code block so usually it is not safe to use parallelStream...

AFAIK parallelStream will make use of a common thread pool(ForkJoinPool.commonPool IIRC) which could be used by lots of other operations in java, so it is easy to introduce strange dead locks.

If we want to do it in parallel, we'd better introduce our own thread pool.

// if its a file archive it
try {
LOG.trace("Archiving {}", file);
Expand All @@ -463,7 +463,7 @@ private static List<File> resolveAndArchive(FileSystem fs, Path baseArchiveDir,
LOG.warn("Failed to archive {}", file, e);
failures.add(file);
}
}
});
return failures;
}

Expand Down