-
Notifications
You must be signed in to change notification settings - Fork 453
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Offers new ways of computing bulk load plans (#4898)
Two new ways of computing bulk import load plans are offered in these change. First the RFile API was modified to support computing a LoadPlan as the RFile is written. Second a new LoadPlan.compute() method was added that creates a LoadPlan from an existing RFile. In addition to these changes methods were added to LoadPlan that support serializing and deserializing load plans to/from json. All of these changes together support the use case of computing load plans in a distributed manner. For example, with a bulk import directory with N files the following use case is now supported. 1. For eack file a task is spun up on a remote server that calls the new LoadPlan.compute() API to determine what tablets the file overlaps. Then the new LoadPlan.toJson() method is called to serialize the load plan and send it to a central place. 2. All the load plans from the remote servers are deserialized calling the new LoadPlan.fromJson() method and merged into a single load plan that is used to do the bulk import. Another use case these new APIs could support is running this new code in the map reduce job that generates bulk import data. 1. In each reducer as it writes to an rfile it could also be building a LoadPlan. A load plan can be obtained from the Rfile after closing it and serialized using LoadPlan.toJson() and the result saved to a file. So after the map reduce job completes each rfile would have corresponding file with a load plan for that file. 2. Another process that runs after the map reduce job can load all the load plans from files and merge them using the new LoadPlan.fromJson() method. Then the merged LoadPlan can be used to do the bulk import. Both of these use cases avoid doing the analysis of files on a single machine doing the bulk import. Bulk import V1 had this functionality and would ask random tservers to do the file analysis. This could cause unexpected load on those tservers. Bulk V1 would interleave analyzing files and adding them to tablets. This could lead to odd situations where files are partially imported to some tablets and analysis fails, leaving the file partially imported. Bulk v2 does all analysis before any files are added to tablets, however it lacks this distributed analysis capability. These changes provide the building blocks to do the distributed analysis that bulk v1 did for bulk v2. Co-authored-by: Daniel Roberts <ddanielr@gmail.com> Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
- Loading branch information
1 parent
51e152a
commit 7a27d40
Showing
12 changed files
with
792 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
131 changes: 131 additions & 0 deletions
131
core/src/main/java/org/apache/accumulo/core/client/rfile/LoadPlanCollector.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* https://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
package org.apache.accumulo.core.client.rfile; | ||
|
||
import java.util.HashSet; | ||
import java.util.Set; | ||
|
||
import org.apache.accumulo.core.data.Key; | ||
import org.apache.accumulo.core.data.LoadPlan; | ||
import org.apache.accumulo.core.data.TableId; | ||
import org.apache.accumulo.core.dataImpl.KeyExtent; | ||
import org.apache.hadoop.io.Text; | ||
|
||
import com.google.common.base.Preconditions; | ||
|
||
class LoadPlanCollector { | ||
|
||
private final LoadPlan.SplitResolver splitResolver; | ||
private boolean finished = false; | ||
private Text lgFirstRow; | ||
private Text lgLastRow; | ||
private Text firstRow; | ||
private Text lastRow; | ||
private Set<KeyExtent> overlappingExtents; | ||
private KeyExtent currentExtent; | ||
private long appended = 0; | ||
|
||
LoadPlanCollector(LoadPlan.SplitResolver splitResolver) { | ||
this.splitResolver = splitResolver; | ||
this.overlappingExtents = new HashSet<>(); | ||
} | ||
|
||
LoadPlanCollector() { | ||
splitResolver = null; | ||
this.overlappingExtents = null; | ||
|
||
} | ||
|
||
private void appendNoSplits(Key key) { | ||
if (lgFirstRow == null) { | ||
lgFirstRow = key.getRow(); | ||
lgLastRow = lgFirstRow; | ||
} else { | ||
var row = key.getRow(); | ||
lgLastRow = row; | ||
} | ||
} | ||
|
||
private static final TableId FAKE_ID = TableId.of("123"); | ||
|
||
private void appendSplits(Key key) { | ||
var row = key.getRow(); | ||
if (currentExtent == null || !currentExtent.contains(row)) { | ||
var tableSplits = splitResolver.apply(row); | ||
var extent = new KeyExtent(FAKE_ID, tableSplits.getEndRow(), tableSplits.getPrevRow()); | ||
Preconditions.checkState(extent.contains(row), "%s does not contain %s", tableSplits, row); | ||
if (currentExtent != null) { | ||
overlappingExtents.add(currentExtent); | ||
} | ||
currentExtent = extent; | ||
} | ||
} | ||
|
||
public void append(Key key) { | ||
if (splitResolver == null) { | ||
appendNoSplits(key); | ||
} else { | ||
appendSplits(key); | ||
} | ||
appended++; | ||
} | ||
|
||
public void startLocalityGroup() { | ||
if (lgFirstRow != null) { | ||
if (firstRow == null) { | ||
firstRow = lgFirstRow; | ||
lastRow = lgLastRow; | ||
} else { | ||
// take the minimum | ||
firstRow = firstRow.compareTo(lgFirstRow) < 0 ? firstRow : lgFirstRow; | ||
// take the maximum | ||
lastRow = lastRow.compareTo(lgLastRow) > 0 ? lastRow : lgLastRow; | ||
} | ||
lgFirstRow = null; | ||
lgLastRow = null; | ||
} | ||
} | ||
|
||
public LoadPlan getLoadPlan(String filename) { | ||
Preconditions.checkState(finished, "Attempted to get load plan before closing"); | ||
|
||
if (appended == 0) { | ||
return LoadPlan.builder().build(); | ||
} | ||
|
||
if (splitResolver == null) { | ||
return LoadPlan.builder().loadFileTo(filename, LoadPlan.RangeType.FILE, firstRow, lastRow) | ||
.build(); | ||
} else { | ||
var builder = LoadPlan.builder(); | ||
overlappingExtents.add(currentExtent); | ||
for (var extent : overlappingExtents) { | ||
builder.loadFileTo(filename, LoadPlan.RangeType.TABLE, extent.prevEndRow(), | ||
extent.endRow()); | ||
} | ||
return builder.build(); | ||
} | ||
} | ||
|
||
public void close() { | ||
finished = true; | ||
// compute the overall min and max rows | ||
startLocalityGroup(); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.