-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28836 Parallelize the file archival to improve the split times #6243
Conversation
… object store (s3)
@apurtell FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -439,7 +439,7 @@ private static List<File> resolveAndArchive(FileSystem fs, Path baseArchiveDir, | |||
|
|||
List<File> failures = new ArrayList<>(); | |||
String startTime = Long.toString(start); | |||
for (File file : toArchive) { | |||
toArchive.parallelStream().forEach(file -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to check the difference between the parallelStream() and stream().parallel(). Java doc say that for parallelStream "It is allowable for this method to return a sequential stream.
"
@@ -439,7 +439,7 @@ private static List<File> resolveAndArchive(FileSystem fs, Path baseArchiveDir, | |||
|
|||
List<File> failures = new ArrayList<>(); | |||
String startTime = Long.toString(start); | |||
for (File file : toArchive) { | |||
toArchive.parallelStream().forEach(file -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if toArchive is small, there are chances that making it parallel can take more time than doing the same sequentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wouldn't be the case as it is an IO operation (and diff wouldn't be noticable). So parallelizing it would improve it over what we already have. In pricinple i agree about what you are saying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought parallelStream
(and its underlying ForkJoinPool
) is not really for I/O blocking operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's another point to consider
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Unit test failures seems to be unrelated.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is good to have parallel stream, it still relies on CPU cores to distribute the task for the underlying fork-join pool right? Why not create our own thread-pool to execute the tasks to achieve deterministic behavior independent of CPU cores?
@@ -439,7 +439,7 @@ private static List<File> resolveAndArchive(FileSystem fs, Path baseArchiveDir, | |||
|
|||
List<File> failures = new ArrayList<>(); | |||
String startTime = Long.toString(start); | |||
for (File file : toArchive) { | |||
toArchive.parallelStream().forEach(file -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought parallelStream
(and its underlying ForkJoinPool
) is not really for I/O blocking operations.
@@ -439,7 +439,7 @@ private static List<File> resolveAndArchive(FileSystem fs, Path baseArchiveDir, | |||
|
|||
List<File> failures = new ArrayList<>(); | |||
String startTime = Long.toString(start); | |||
for (File file : toArchive) { | |||
toArchive.parallelStream().forEach(file -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the access to ArrayList failures
thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is HFileArchiver in the path for Split? It should be background.
Since both region close as well as CompactedHFilesDischarger chore performs archival, we can keep Jira/PR title generic i.e. "Parallelize the file archival for performance improvement" or "Parallelize the file archival" (by removing both "split times" and "object stores" from it because the change is generic and not specific to object stores as per the PR change, and Jira description is good to explain why we need this specifically for S3). |
…times in object store (s3)" This reverts commit be6e570.
@@ -439,7 +439,7 @@ private static List<File> resolveAndArchive(FileSystem fs, Path baseArchiveDir, | |||
|
|||
List<File> failures = new ArrayList<>(); | |||
String startTime = Long.toString(start); | |||
for (File file : toArchive) { | |||
toArchive.parallelStream().forEach(file -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are IO operations in the code block so usually it is not safe to use parallelStream...
AFAIK parallelStream will make use of a common thread pool(ForkJoinPool.commonPool IIRC) which could be used by lots of other operations in java, so it is easy to introduce strange dead locks.
If we want to do it in parallel, we'd better introduce our own thread pool.
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
nit: Parallize -> Parallelize |
I have added the logic for executor service doing the cleanup. I am still looking at executor service creation and config part. Basically trying to figure out the right config. Currently i am thinking of using |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Created a new PR because of messed up local git. |
No description provided.