Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offers new ways of computing bulk load plans #4898

Merged
merged 16 commits into from
Feb 1, 2025

Conversation

keith-turner
Copy link
Contributor

@keith-turner keith-turner commented Sep 17, 2024

Two new ways of computing bulk import load plans are offered in these change. First the RFile API was modified to support computing a LoadPlan as the RFile is written. Second a new LoadPlan.compute() method was added that creates a LoadPlan from an existing RFile. In addition to these changes methods were added to LoadPlan that support serializing and deserializing load plans to/from json.

All of these changes together support the use case of computing load plans in a distributed manner. For example, with a bulk import directory with N files the following use case is now supported.

  1. For eack file a task is spun up on a remote server that calls the new LoadPlan.compute() API to determine what tablets the file overlaps. Then the new LoadPlan.toJson() method is called to serialize the load plan and send it to a central place.
  2. All the load plans from the remote servers are deserialized calling the new LoadPlan.fromJson() method and merged into a single load plan that is used to do the bulk import.

Another use case these new APIs could support is running this new code in the map reduce job that generates bulk import data.

  1. In each reducer as it writes to an rfile it could also be building a LoadPlan. A load plan can be obtained from the Rfile after closing it and serialized using LoadPlan.toJson() and the result saved to a file. So after the map reduce job completes each rfile would have corresponding file with a load plan for that file.
  2. Another process that runs after the map reduce job can load all the load plans from files and merge them using the new LoadPlan.fromJson() method. Then the merged LoadPlan can be used to do the bulk import.

Both of these use cases avoid doing the analysis of files on a single machine doing the bulk import. Bulk import V1 had this functionality and would ask random tservers to do the file analysis. This could cause unexpected load on those tservers. Bulk V1 would interleave analyzing files and adding them to tablets. This could lead to odd situations where files are partially imported to some tablets and analysis fails, leaving the file partially imported. Bulk v2 does all analysis before any files are added to tablets, however it lacks this distributed analysis capability. These changes provide the building blocks to do the distributed analysis that bulk v1 did for bulk v2.

Copy link
Contributor

@ddanielr ddanielr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some minor formatting changes to pass build checks

@dlmarion dlmarion added the blocker This issue blocks any release version labeled on it. label Jan 16, 2025
@dlmarion dlmarion added this to the 2.1.4 milestone Jan 16, 2025
Two new ways of computing bulk import load plans are offered in these
changes. First the RFile API was modified to support computing a LoadPlan
as the RFile is written. Second a new LoadPlan.compute() method was
added that creates a LoadPlan from an existing RFile. In addition to
these changes methods were added to LoadPlan that support serializing
and deserializing load plans to/from json.

All of these changes together support the use case of computing load
plans in a distributed manner. For example, with a bulk import directory
with N files the following use case is now supported.

For eack file a task is spun up on a remote server that calls the new
LoadPlan.compute() API to determine what tablets the file overlaps. Then
the new LoadPlan.toJson() method is called to serialize the load plan
and send it to a central place.  All the load plans from the remote
servers are deserialized calling the new LoadPlan.fromJson() method and
merged into a single load plan that is used to do the bulk import.
Another use case these new APIs could support is running this new code
in the map reduce job that generates bulk import data.

In each reducer as it writes to an rfile it could also be building a
LoadPlan. A load plan can be obtained from the Rfile after closing it
and serialized using LoadPlan.toJson() and the result saved to a file.
So after the map reduce job completes each rfile would have
corresponding file with a load plan for that file.  Another process that
runs after the map reduce job can load all the load plans from files and
merge them using the new LoadPlan.fromJson() method. Then the merged
LoadPlan can be used to do the bulk import.  Both of these use cases
avoid doing the analysis of files on a single machine doing the bulk
import. Bulk import V1 had this functionality and would ask random
tservers to do the file analysis. This could cause unexpected load on
those tservers. Bulk V1 would interleave analyzing files and adding them
to tablets. This could lead to odd situations where files are partially
imported to some tablets and analysis fails, leaving the file partially
imported. Bulk v2 does all analysis before any files are added to
tablets, however it lacks this distributed analysis capability. These
changes provide the building blocks to do the distributed analysis that
bulk v1 did for bulk v2.

Co-authored-by: Daniel Roberts <ddanielr@gmail.com>
@keith-turner keith-turner force-pushed the bulk-load-improvement branch from faf86a4 to e1bcc9d Compare January 16, 2025 21:18
@keith-turner keith-turner changed the title prototype of bulk import v2 distributed file examination Offers new ways of computing bulk load plans Jan 16, 2025
@keith-turner keith-turner marked this pull request as ready for review January 16, 2025 21:54
@keith-turner
Copy link
Contributor Author

keith-turner commented Jan 16, 2025

I squashed all of the commit that were on this branch and pulled a few changes from #4933 that were not present into this branch. Also copied the description from #4933 as the description for this PR. Took this out of draft and closed #4933.

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. I have minor quality suggestions. The main thing I'm concerned about is the semver breakage. I think it's probably okay in this case, but definitely should be called out as a semver violation in the release notes, in case anybody starts using this and wants to downgrade back to 2.1.3 for any reason.

keith-turner and others added 8 commits January 17, 2025 10:17
…WriterBuilder.java

Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
…WriterBuilder.java

Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
…Writer.java

Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
…s.java

Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
keith-turner and others added 4 commits January 17, 2025 17:00
Lowers the log level on a message from debug to trace to reduce logging
spam.

Adds a test case for LoadPlan.compute to ensure that the correct
filesystem is being chosen for a given file URI.
Copy link
Contributor

@ddanielr ddanielr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this code and the functionality works.
Added a test case for specific URI checks.

I think all the feedback is addressed and this PR is good to merge.

@keith-turner keith-turner merged commit 7a27d40 into apache:2.1 Feb 1, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker This issue blocks any release version labeled on it.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants