Offers new ways of computing bulk load plans #4898

keith-turner · 2024-09-17T22:22:53Z

Two new ways of computing bulk import load plans are offered in these change. First the RFile API was modified to support computing a LoadPlan as the RFile is written. Second a new LoadPlan.compute() method was added that creates a LoadPlan from an existing RFile. In addition to these changes methods were added to LoadPlan that support serializing and deserializing load plans to/from json.

All of these changes together support the use case of computing load plans in a distributed manner. For example, with a bulk import directory with N files the following use case is now supported.

For eack file a task is spun up on a remote server that calls the new LoadPlan.compute() API to determine what tablets the file overlaps. Then the new LoadPlan.toJson() method is called to serialize the load plan and send it to a central place.
All the load plans from the remote servers are deserialized calling the new LoadPlan.fromJson() method and merged into a single load plan that is used to do the bulk import.

Another use case these new APIs could support is running this new code in the map reduce job that generates bulk import data.

In each reducer as it writes to an rfile it could also be building a LoadPlan. A load plan can be obtained from the Rfile after closing it and serialized using LoadPlan.toJson() and the result saved to a file. So after the map reduce job completes each rfile would have corresponding file with a load plan for that file.
Another process that runs after the map reduce job can load all the load plans from files and merge them using the new LoadPlan.fromJson() method. Then the merged LoadPlan can be used to do the bulk import.

Both of these use cases avoid doing the analysis of files on a single machine doing the bulk import. Bulk import V1 had this functionality and would ask random tservers to do the file analysis. This could cause unexpected load on those tservers. Bulk V1 would interleave analyzing files and adding them to tablets. This could lead to odd situations where files are partially imported to some tablets and analysis fails, leaving the file partially imported. Bulk v2 does all analysis before any files are added to tablets, however it lacks this distributed analysis capability. These changes provide the building blocks to do the distributed analysis that bulk v1 did for bulk v2.

core/src/main/java/org/apache/accumulo/core/data/LoadPlan.java

core/src/main/java/org/apache/accumulo/core/client/rfile/RFileWriter.java

ddanielr

Needs some minor formatting changes to pass build checks

core/src/main/java/org/apache/accumulo/core/data/LoadPlan.java

core/src/test/java/org/apache/accumulo/core/data/LoadPlanTest.java

Two new ways of computing bulk import load plans are offered in these changes. First the RFile API was modified to support computing a LoadPlan as the RFile is written. Second a new LoadPlan.compute() method was added that creates a LoadPlan from an existing RFile. In addition to these changes methods were added to LoadPlan that support serializing and deserializing load plans to/from json. All of these changes together support the use case of computing load plans in a distributed manner. For example, with a bulk import directory with N files the following use case is now supported. For eack file a task is spun up on a remote server that calls the new LoadPlan.compute() API to determine what tablets the file overlaps. Then the new LoadPlan.toJson() method is called to serialize the load plan and send it to a central place. All the load plans from the remote servers are deserialized calling the new LoadPlan.fromJson() method and merged into a single load plan that is used to do the bulk import. Another use case these new APIs could support is running this new code in the map reduce job that generates bulk import data. In each reducer as it writes to an rfile it could also be building a LoadPlan. A load plan can be obtained from the Rfile after closing it and serialized using LoadPlan.toJson() and the result saved to a file. So after the map reduce job completes each rfile would have corresponding file with a load plan for that file. Another process that runs after the map reduce job can load all the load plans from files and merge them using the new LoadPlan.fromJson() method. Then the merged LoadPlan can be used to do the bulk import. Both of these use cases avoid doing the analysis of files on a single machine doing the bulk import. Bulk import V1 had this functionality and would ask random tservers to do the file analysis. This could cause unexpected load on those tservers. Bulk V1 would interleave analyzing files and adding them to tablets. This could lead to odd situations where files are partially imported to some tablets and analysis fails, leaving the file partially imported. Bulk v2 does all analysis before any files are added to tablets, however it lacks this distributed analysis capability. These changes provide the building blocks to do the distributed analysis that bulk v1 did for bulk v2. Co-authored-by: Daniel Roberts <ddanielr@gmail.com>

keith-turner · 2025-01-16T21:56:24Z

I squashed all of the commit that were on this branch and pulled a few changes from #4933 that were not present into this branch. Also copied the description from #4933 as the description for this PR. Took this out of draft and closed #4933.

ctubbsii

Overall, looks good. I have minor quality suggestions. The main thing I'm concerned about is the semver breakage. I think it's probably okay in this case, but definitely should be called out as a semver violation in the release notes, in case anybody starts using this and wants to downgrade back to 2.1.3 for any reason.

core/src/main/java/org/apache/accumulo/core/client/rfile/RFile.java

core/src/main/java/org/apache/accumulo/core/client/rfile/RFileWriter.java

core/src/main/java/org/apache/accumulo/core/client/rfile/RFileWriterBuilder.java

core/src/main/java/org/apache/accumulo/core/client/rfile/RFileWriter.java

core/src/main/java/org/apache/accumulo/core/data/LoadPlan.java

core/src/main/java/org/apache/accumulo/core/file/FileOperations.java

…WriterBuilder.java Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>

…Writer.java Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>

Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>

…s.java Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>

Lowers the log level on a message from debug to trace to reduce logging spam. Adds a test case for LoadPlan.compute to ensure that the correct filesystem is being chosen for a given file URI.

ddanielr

Tested this code and the functionality works.
Added a test case for specific URI checks.

I think all the feedback is addressed and this PR is good to merge.

keith-turner commented Sep 18, 2024

View reviewed changes

core/src/main/java/org/apache/accumulo/core/data/LoadPlan.java Outdated Show resolved Hide resolved

keith-turner mentioned this pull request Sep 30, 2024

Offers new ways to compute bulk import load plans. #4933

Closed

ddanielr reviewed Sep 30, 2024

View reviewed changes

core/src/main/java/org/apache/accumulo/core/client/rfile/RFileWriter.java Outdated Show resolved Hide resolved

ddanielr reviewed Sep 30, 2024

View reviewed changes

dlmarion added the blocker This issue blocks any release version labeled on it. label Jan 16, 2025

dlmarion added this to the 2.1.4 milestone Jan 16, 2025

keith-turner force-pushed the bulk-load-improvement branch from faf86a4 to e1bcc9d Compare January 16, 2025 21:18

keith-turner added 2 commits January 16, 2025 21:33

pulled some changes from apache#4933

7d75095

pulled some changes from apache#4933

7db9d81

keith-turner changed the title ~~prototype of bulk import v2 distributed file examination~~ Offers new ways of computing bulk load plans Jan 16, 2025

keith-turner marked this pull request as ready for review January 16, 2025 21:54

ctubbsii requested changes Jan 16, 2025

View reviewed changes

keith-turner and others added 8 commits January 17, 2025 10:17

Update core/src/main/java/org/apache/accumulo/core/client/rfile/RFile…

a2650aa

…WriterBuilder.java Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>

Update core/src/main/java/org/apache/accumulo/core/client/rfile/RFile…

18dbb8d

…WriterBuilder.java Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>

Update core/src/main/java/org/apache/accumulo/core/client/rfile/RFile…

83467b6

…Writer.java Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>

Update core/src/main/java/org/apache/accumulo/core/data/LoadPlan.java

fadbd88

Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>

Update core/src/main/java/org/apache/accumulo/core/file/FileOperation…

f566427

…s.java Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>

fix build

5249719

code review update

84142cd

code review update

720baee

keith-turner mentioned this pull request Jan 17, 2025

Explore optimizing RFile LoadPlan computation #5272

Open

keith-turner and others added 4 commits January 17, 2025 17:00

remove TODO

15bd969

Merge branch '2.1' into bulk-load-improvement

2260afe

Merge branch '2.1' into bulk-load-improvement

0641b17

Lowers log level and adds load plan compute test

65f721c

Lowers the log level on a message from debug to trace to reduce logging spam. Adds a test case for LoadPlan.compute to ensure that the correct filesystem is being chosen for a given file URI.

ddanielr approved these changes Jan 31, 2025

View reviewed changes

Use URL instead of File.toURI

7e3e230

keith-turner merged commit 7a27d40 into apache:2.1 Feb 1, 2025
8 checks passed

keith-turner deleted the bulk-load-improvement branch February 1, 2025 20:52

ddanielr mentioned this pull request Feb 3, 2025

WIP adapt DW PR#2568 to use accumulo PR#4898 NationalSecurityAgency/datawave#2582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offers new ways of computing bulk load plans #4898

Offers new ways of computing bulk load plans #4898

keith-turner commented Sep 17, 2024 •

edited

Loading

ddanielr left a comment

keith-turner commented Jan 16, 2025 •

edited

Loading

ctubbsii left a comment

ddanielr left a comment

Offers new ways of computing bulk load plans #4898

Offers new ways of computing bulk load plans #4898

Conversation

keith-turner commented Sep 17, 2024 • edited Loading

ddanielr left a comment

Choose a reason for hiding this comment

keith-turner commented Jan 16, 2025 • edited Loading

ctubbsii left a comment

Choose a reason for hiding this comment

ddanielr left a comment

Choose a reason for hiding this comment

keith-turner commented Sep 17, 2024 •

edited

Loading

keith-turner commented Jan 16, 2025 •

edited

Loading