-
Notifications
You must be signed in to change notification settings - Fork 980
Operator Creation
Operators are transform nodes within the execution graph that perform a single operation on an incoming tuple set to produce an outgoing tuple set. Operators are defined using an operator definition materialized in Java from a JSON description of the operator. The following discusses the process within a fragment of creating an operator from an operator definition.
- The operator definition is deserialized from a JSON payload sent from (the foreman?).
- The class of the operator definition is a key into a map of definition classes to operator creator classes.
- The operator creator takes a pair of (definition, incoming operator) and produces the operator class.
- The opeator class is created and initialized.
- Upon receipt of the first record schema, the operator generates the code that performs the actual operator functionality.
The operator definition is a POJO deserilzed from JSON. There is no "operator definition" base class.
Operators are not created directly. Instead, each operator has an associate factory class (termed a Creator
in Drill.) For example, the FilterRecordBatch
operator has a corresponding FilterBatchCreator
. Drill must somehow find the correct factory for each operator. Drill starts with a definition. It falls to the OperatorCreatorRegistry
to map from definition to factory class.
The registry is intitialized using a class scan created elsewhere. The scan enumerates all class visible to Drill (that is, all classes in all jars on the Drill class path.) The registry looks for all classes which extend BatchCreator
. To identify the associated definition, the registry uses Java introspection to look at the type of the second argument to the getBatch
method of the factory class. For example:
public class FilterBatchCreator implements BatchCreator<Filter>{
public FilterRecordBatch getBatch(FragmentContext context, Filter config, List<RecordBatch> children)
The Filter config
argument above gives the definition class type.
Recall that a "record batch" acts both as the runtime instance for an operator, and the tuple set produced by that operator. As an operator, the record batch must be given the incoming (upstream, child) operator that produces its input. As a tuple set, the operator requires a schema. The operator determines the schema when it receives the schema from the incoming operator.
The operator also needs an implementation of the actual operation. Recall that value vectors are strongly typed, leading to over 100 different classes. The operator uses code generation to implement the actual operation. (Code generation avoids the overhead of interpreting the operation.) The generated code is based on a complex system (described elsewhere). The code implements an interface specific to the opeator. The operator then delegates to the generated code for each tuple set.