-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRILL-7191 / DRILL-7026]: RM state blob persistence in Zookeeper and Integration of Distributed queue configuration with Planner #1762
base: master
Are you sure you want to change the base?
Conversation
55e5e15
to
130252a
Compare
…tion with Simple Parallelizer. Refactor existing ZK based queue to accommodate new Distributed queue for RM. Refactor and rename the existing memory allocation utilities to ZKQueueMemoryAllocationUtilities and DefaultMemoryAllocationUtilities. Parallelizer code is changed to accommodate the memory adjustment for the operators during parallelization phase. With this change, there are 3 different implementation of SimpleParallelizer; they are ZKQueueParallelizer, DistributedQueueParallelizer and DefaultParallelizer which will be used by ZK based RM, Distributed RM and Non RM configuration.
UUID support for DrillbitEndpoint RMState Blobs definition, serialization and deserialization, Zookeeper client support for transactions ZookeeperPersistentTransactional Store and RMStateBlobManager to do updates under lock Protect running and waiting queries map in WorkerBee
…tion with Simple Parallelizer. Integration changes with new DistributedRM queue configuration. a) Remove the redundant NodeResource and merge the additional member functions with the NodeResources class. b) Added new UUID logic and selection of a queue based on the memory requirement during parallelization phase. c) Changed proto definitions to set the UUID of a drillbit. d) Implementation of new DrillNode Wrapper over DrillbitEndpoint to fix the equality comparisions between DrillbitEndpoints.
130252a
to
02402ed
Compare
de9e5f7
to
e9b4fa5
Compare
Added stubs for QueryResourceManager exit and wait/cleanup thread Update MemoryCalculator to use DrillNode instead of DrillbitEndpoint Changes to support localbit resource registration to cluster state blob using DrillbitStatusListener Support ThrottledResourceManager via ResourceManagerBuilder Add some E2E tests and RMStateBlobs tests along with some bug fixes Fix TestRMConfigLoad tests to handle case where ZKQueues are explicitly enabled
…tion with Simple Parallelizer. Changes to set the memory allocation per operator in query profile. Addressing an memory minimization logic was not considering non-buffered operators. Handling error cases when memory requirements for buffered or non-buffered cannot be reduced.
e9b4fa5
to
1517a87
Compare
|
||
public OpProfileDef(int operatorId, int operatorType, int incomingCount) { | ||
public OpProfileDef(int operatorId, int operatorType, int incomingCount, long optimalMemoryAllocation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will all the creator of OpProfileDef
always pass MaxAllocation
for optimalMemoryAllocation
?
@@ -88,7 +88,7 @@ public OperatorStats(OperatorStats original, boolean isClean) { | |||
} | |||
|
|||
@VisibleForTesting | |||
public OperatorStats(int operatorId, int operatorType, int inputCount, BufferAllocator allocator) { | |||
public OperatorStats(int operatorId, int operatorType, int inputCount, BufferAllocator allocator, long initialAllocation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to rename initialAllocation
to optimalMemoryAllocation
private final QueryContext queryContext; | ||
private final long MINIMUM_MEMORY_FOR_BUFFER_OPERS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upper case variable name should be used only for constants, please change it to lower case.
endpoint.getUserPort() == otherEndpoint.getUserPort() && | ||
endpoint.getControlPort() == otherEndpoint.getControlPort() && | ||
endpoint.getDataPort() == otherEndpoint.getDataPort() && | ||
endpoint.getVersion().equals(otherEndpoint.getVersion()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like all the fields in DrillbitEndpoint
are optional, so we should check if the field is present or not before calling equals on it. Just like done for hashCode()
below or refer equals
in generated file for DrillbitEndpoint.
.append(endpoint.getAddress()) | ||
.append("endpoint user port: ") | ||
.append(endpoint.getUserPort()).toString(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if field is present or not before accessing it.
queryContext.getSession(), queryContext.getQueryContextInfo()); | ||
planner.visitPhysicalPlan(queryWorkUnit); | ||
// planner.visitPhysicalPlan(queryWorkUnit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this commented line
@@ -79,6 +79,7 @@ message PlanFragment { | |||
optional string options_json = 15; | |||
optional QueryContextInformation context = 16; | |||
repeated Collector collector = 17; | |||
optional string endpointUUID = 18; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to rename this to assignedEndpointUUID
since there are 2 endpoints as part of PlanFragment: assignedEndpoint
and foremanEndpoint
* fragment is based on the cluster state and provided queue configuration. | ||
*/ | ||
public class DistributedQueueParallelizer extends SimpleParallelizer { | ||
static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(DistributedQueueParallelizer.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private
else { | ||
return operator.getMaxAllocation(); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about below ? Also would be good to add a comment why memory for buffered operator is not retrieved from the PreCostEstimates
.
return (endpoint, operator) -> {
long operatorMemory = operator.getMaxAllocation();
if (!planHasMemory) {
final DrillNode drillEndpointNode = DrillNode.create(endpoint);
if (operator.isBufferedOperator(queryContext)) {
operatorMemory = operators.get(drillEndpointNode).get(operator);
} else {
operatorMemory = (long)operator.getCost().getMemoryCost();
}
}
logger.debug(" Memory requirement for the operator {} in endpoint {} is {}", operator, endpoint, operatorMemory);
return operatorMemory;
};
|
||
return nodeMap; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of things:
- Why not construct
Map<String, Collection<PhysicalOperator>>
to start with instead ofMultimap<DrillbitEndpoint, PhysicalOperator>
and then converting it. bufferedOperators.asMap()
is not guaranteed to work in the same way as intended since DrillbitEndpointequals
andhashCode
methods are not reliable which is why we createdDrillNode
.
38e9a73
to
51cf5a4
Compare
51cf5a4
to
a3f8b36
Compare
} | ||
|
||
@Override | ||
public Map<DrillbitEndpoint, String> getOnlineEndpointsUUID() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is duplicate code similar to that of the LocalClusterCoordinator. Can you move this function into a common place and use it in both the places?
getCache().rebuildNode(target); | ||
} catch (final Exception e) { | ||
throw new DrillRuntimeException("unable to put ", e); | ||
} | ||
} | ||
|
||
public void createAsTransaction(List<String> paths) { | ||
Preconditions.checkNotNull(paths, "no paths provided to create"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it also better to check for empty list of paths? Also it might be good to add a comment for this function.
* @param pathsWithData - map of blob paths to update and the final data | ||
* @param version - version holder | ||
*/ | ||
public void putAsTransaction(Map<String, byte[]> pathsWithData, DataChangeVersion version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we currently use non null version. If not then can you please mention it in the comment that this is needed for future use.
String queryId, String foremanNode) throws Exception { | ||
// Looks like leader hasn't changed yet so let's try to reserve the resources | ||
// See if the call is to reserve or free up resources | ||
Map<String, NodeResources> resourcesMap = queryResourceAssignment; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be changed to the following code.
Map<String, NodeResources> resourcesMap = queryResourceAssignment.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey,
(x) -> new NodeResources(x.getValue().getVersion(),
-x.getValue().getMemoryInBytes(),
-x.getValue().getNumVirtualCpu())));
public class RMConsistentBlobStoreManager implements RMBlobStoreManager { | ||
private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(RMConsistentBlobStoreManager.class); | ||
|
||
private static final String RM_BLOBS_ROOT = "rm/blobs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to start this path with "/" ?
totalDataBytes += entry.getValue().length; | ||
} | ||
|
||
// If total set operator payload is greater than 1MB then curator set operation will fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
greater than or equal to?
throw ex; | ||
} finally { | ||
// Check if the caller has acquired the mutex | ||
if (globalBlobMutex.isAcquiredInThisProcess()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check required. Shouldn't be the case that we should acquire the lock if we are here?
This PR contains changes for the support of RM Framework both on execution and planning side, tracked by JIRA's DRILL-7191 and DRILL-7026.
Refactoring existing ZK based queue to accommodate new Distributed queue for RM. Moved QueryResourceAllocators memory allocation code to utility classes like ZKQueueMemoryAllocationUtilities and DefaultMemoryAllocationUtilities. Refactored the Parallelizer code to accommodate the memory adjustment for the operators during parallelization phase. There are 3 different implementation of SimpleParallelizer such as ZKQueueParallelizer, DistributedQueueParallelizer and DefaultParallelizer which will be used by ZK based RM, Distributed RM and Non RM configuration.
Planner integration with RM to select queue and reduce query level memory to be within queue limits. Changes to handle scenarios where buffered operator are at least getting minimum required memory allocation. Based on the calculated memory for each operator within each fragment it’s initial and maximum memory allocation is set which is later consumed by execution layer to enforce memory limits.
Introduced new DrillNode class to deal with issues when DrillbitEndpoint is searched in a map using some of it’s field.
Changes to support storing UUID for each Drillbit Service Instance locally to be used by planner and execution layer. This UUID is used to uniquely identify a Drillbit and register Drillbit information in the RM StateBlobs. Introduced a PersistentStore named ZookeeperTransactionalPersistenceStore with Transactional capabilities using Zookeeper Transactional API’s. This is used for updating RM State blobs as all the updates need to happen in transactional manner. Added RMStateBlobs definition and support for serde to Zookeeper. Implementation for DistributedRM and its corresponding QueryRM apis.
Updated the state management of Query in Foreman so that same Foreman object can be submitted multiple times. Also introduced concept of 2 maps keeping track of waiting and running queries. These were done to support for async admit protocol which will be needed with Distributed RM.
Support for serde of optimalMemoryAllocation for each operator in each minor fragment in QueryProfile. This is needed to verify the optimalMemory calculated by planner is correct.