-
Notifications
You must be signed in to change notification settings - Fork 980
UDF Theory
Drill is designed for blinding speed. The original architects chose very sophisticated and complex mechanisms that, they believed, would provide every extra ounce of performance. UDFs, because they use the same function framework as built-in functions, are no exception.
Here we examine how Drill generates code. This is important because, as we'll see, Drill uses the source code our our UDF to generate code; Drill does not actually call our compiled Java class.
It well known in the database community that optimal query performance is attained when generating the code needed to run a query. Some tools generate the entire query. (Hive is one.) Others generate significant parts of the code for each relational operator (several commercial engines do this.)
Drill follows this pattern and divides query execution into two parts:
- The boilerplate code common to all queries.
- The expression and other code unique to each query.
Suppose we want execute this query:
SELECT a, b + c AS d FROM `foo.json`
Drill will create a scan operator to fetch the data, a project operator to perform the calculation, and a screen operator to send the results to the client. (There may be others, such as exchanges, but we'll skip over those here.) In fact, you can easily see the structure in the query profile portion of the Drill web console.
The scan operator is generally interpreted (it uses only boilerplate code without any generated code.) In the above query, Drill must compute b + c
, which is just one of millions of possible expressions. Interpretation is too slow. So, Drill generates Java code to perform the computation. Specifically, Drill generates the code in the project operator. (That is, we project a new column d
to hold the value of b + c
.)
Let's now focus in on the b + c AS d
expression. Suppose we were to write code by hand to do this work. What would we have to do? Let's start by remembering that Drill stores values as columns in a structure called a value vector. If we simplify quite a bit, value vectors permit two operations:
- Read values from an existing vector.
- Write value to a new vector.
In our project operator code, we must read values from the existing b
and c
vectors, then write their sum to the d
vector which the project operator has created for us. Vectors are indexed by the row index within a batch. So, we might think we can simply write:
d.set(rowIndex, b.get(rowIndex) + c.get(rowIndex))
If only it were that simple! Drill must deal with a large amount of complexity and quite a few special cases. Let's list some:
- Values may be nullable, so we have to consider how to handle nulls.
- The proposed code works for simple Java primitives, but not so well for things like strings.
- Drill has 30+ data types (not all of which are used or even supported).
- Calculations can be much more involved than the simple example above.
Let's tackle these one-by-one.
Drill is a schema-free query engine, which means that each data file may have a different schema. In our query, one file might have columns a
and b
, but not c
(perhaps c
was added later, so older files don't contain it.) However, internally, Drill is strongly typed: Drill imposes a strict schema on data once it is in value vectors. Since Drill cannot predict the future, it does not know if new columns will appear (in, say, a JSON file, or in later files in the same scan.) So, Drill tends to prefer to use nullable types for most values. (In Drill's internal terminology, a nullable type is called the "OPTIONAL
data mode".)
This means we must add logic to our simple example to handle nulls. A very common rule is: "if any input value is null, then the output is also null." This makes sense for addition: 10 + NULL = NULL
. So, let's add the required null handling:
if (b.isNull(rowIndex) || c.isNull(rowIndex)) {
d.setNull(rowIndex);
} else {
d.set(rowIndex, b.get(rowIndex) + c.get(rowIndex));
}
Drill data types divide into two broad categories: fixed-width types (which often correspond to Java primitive types) and variable-width types (which are just a run of bytes of some offset and length.) The code above assumed that columns can be represented as simple Java primitives, such as an int
. (The mapping is presented in detail in the UDF Semantics section.)
Some data types do not have an equivalent Java primitive. Drill's (mostly unsupported) decimal types are a good example: either the code must work with Drill's low-level representation, or must convert the data to the Java BigDecimal
type. While BigDecimal
is convenient, the cost of conversion goes against the "speed at any price" design philosophy, so functions often want to work with the internal representation.
Similarly, variable-width types, such as VARCHAR
, are represented as a byte range stored in direct memory; so there is no Java (heap-based) representation available.
Given all of this, we can't simply assume that every value vector can return or accept its value as a Java primitive. Instead, we need a place to store the (possibly complex) value. Drill calls this place a "holder." So, our code now must:
- Copy data from a vector to a holder in a type-specific manner.
- Perform calculations using holders.
- Copy data from the result holder into the destination value vector.
Some key things to know about holders:
- There is one holder type for each combination of Drill type and cardinality.
- Each holder has a
value
field (for primitives) or individual fields (for structured types). - Nullable holders have an
isSet
field that is set to 0 (NULL
) or 1 (NOT NULL
).
Here, "cardinality" means (number of items.) Drill defines three kinds of cardinality for columns:
- Required: Cardinality of exactly 1. Also called "non-nullable."
- Optional: Cardinality of (0, 1). Also called "nullable."
- Repeated: Cardinality of (0, n). Also called "repeated" or "array."
Since we are querying JSON, our generated code will use Drill's NullableBigIntHolder
. Here is a simplified view of that class:
public final class NullableBigIntHolder implements ValueHolder{
public int isSet;
public long value;
So, we don't just get a Java long
when we ask for the value of the b
and c
vectors, we get a holder. Then we retieve the actual long
value from the value
field. Simple enough. Given this, we can now write a simplified version of the code we must generate:
NullableBitIntVector bVector = ...;
NullableBitIntVector cVector = ...;
NullableBitIntVector dVector = ...;
NullableBigIntHolder bHolder = new NullableBigIntHolder();
NullableBigIntHolder cHolder = new NullableBigIntHolder();
NullableBigIntHolder dHolder = new NullableBigIntHolder();
bVector.get(rowIndex, bHolder);
cVector.get(rowIndex, cHolder);
if (bHolder.isSet == 0 || cHolder.isSet == 0) {
dHolder.isSet = 0;
else {
dHolder.isSet = 1;
dHolder.value = bHolder.value + cHolder.value;
}
dVector.set(rowIndex, dHolder);
(While methods exist in Drill to get/set vector values using holders, they are not used in the generated code for reasons explained in the Understanding UDF Holder Scalar Replacement section.)
The above is beginning to look like the actual generated code. We just need to understand a few more details of how Drill works.
- Vectors don't actually provide
get()
methods as shown above, instead they provide anAccessor
class that provides the methods. - Similarly, vectors don't provide a
set()
method. Instead they provide aMutator
that provides the method. - The setup of the vectors is a one-time event. So, we can divide the work into a one-time
setup()
method and a per-roweval()
method. - We've shown the addition as using the Java
+
operator. But, in general, Drill uses its own functions so that it can handle not just Java primitives, but also decimal types, date/time types and so on.
So, let's take yet another pass, including these details.
public class ProjectorGen {
NullableBitIntVector bVector;
NullableBitIntVector cVector;
NullableBitIntVector dVector;
public void setup() {
bVector = ...;
cVector = ...;
dVector = ...;
}
public void eval(int rowIndex) {
NullableBigIntHolder bHolder = new NullableBigIntHolder();
NullableBigIntHolder cHolder = new NullableBigIntHolder();
NullableBigIntHolder dHolder = new NullableBigIntHolder();
bVector.getAccessor().get(rowIndex, bHolder);
cVector.getAccessor().get(rowIndex, cHolder);
if (bHolder.isSet == 0 || cHolder.isSet == 0) {
dHolder.isSet = 0;
else {
dHolder.isSet = 1;
dHolder.value = bigIntBigIntAdd(bHolder.value, cHolder.value);
}
dVector.getMutator().set(rowIndex, dHolder);
}
By now you might want to see the actual generated code to compare it with our simplified version. Drill provides a number of ways for you to capture and examine the generated code. The easiest is simply to turn on DEBUG
level logging as explained in View the Generated Code for a UDF. The code will be written to the Drill log file. For the project operator, look for class ProjectorGen
. For your convenience, the code for our test case is presented on that page as well.
Recall that Drill's primary goal is performance. In our latest example, we introduce a function that does the addition: bigIntBigIntAdd
. The function, basically, looks like this:
public long BigIntBigIntAdd(long x, long y) {
return x + y;
}
The Drill architects were concerned about the cost of an additional Java function call in the per-row execution path. (This concern is likely ill-founded; modern JVM's are quite good at simple method calls and inlining. But, the point is, the architects believed they could generate code which is faster than what the JVM could produce.)
To avoid the cost of the function call, we have to inline the function. But, how can we do that? The answer turned out to be: by writing our own version of the JVM optimizer, but one that works at a level above byte codes.
First, to avoid the need to parse the function signature, we need a way to identify the input parameters and the output return value. Here is where Drill defined a DSL based on selected bits of Java. Drill represents the function as a Java class, represents the arguments and return values as fields, and uses annotations to identify the purpose of each:
public static class BigIntBigIntAdd implements DrillSimpleFunc {
@Param BigIntHolder in1;
@Param BigIntHolder in2;
@Output BigIntHolder out;
public void eval() {
out.value = (long) (in1.value + in2.value);
}
}
(The above is a simplified version of the actual Drill function class to add two BIGINT
types.)
We currently have this:
dHolder.value = bigIntBigIntAdd(bHolder.value, cHolder.value);
We need a way to extract the source code from the eval()
method. Drill uses a third-party package for that. Now we have the parameters, the return value, and the body of the eval()
method. We just need to plug them into our generated code in place of the bigIntBigIntAdd()
function:
BigIntHolder in1 = bHolder;
BigIntHolder in2 = cHolder;
BigIntHolder out = dHolder;
{ // Inlined BigIntBigIntAdd
out.value = (long) (in1.value + in2.value);
}
(Since we are interested only in the concepts, the above glosses over the fact that the in1
and bHolder
objects have different cardinality.)
The above is nearly a complete picture, but one loose end remains: how did Drill know to use the BigIntBigIntAdd
class for our expression? The name provides a hint, but surely that is too subtle for Drill?
Also, how did Drill know to handle nulls itself, passing only non-null values to the eval()
method?
Indeed, Drill uses another annotation for the class itself, let's show that:
@FunctionTemplate(name = "add",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class BigIntBigIntAdd implements DrillSimpleFunc {
The key thing to note is that the method name is add
. As we'll discuss elsewhere ((need link)), Drill uses type matching to figure out which add
to use: the one that takes two BIGINT
holders. Similarly, we'll discuss the other annotation properties later as well.
The alert reader who is experienced in Java will have wondered about the use of the holder objects. Isn't it inefficient to create a holder for each parameter and output value for each function for each row? Indeed it is. To address this issue, Drill implements a very sophisticated scalar replacement mechanism. This mechanism takes the compiled generated code as input, then rewrites the Java byte codes to replace the holder objects with simple Java primitives. The details are explained in Understanding UDF Holder Scalar Replacement.
For our purposes here, to ensure that scalar replacement works as designed, follow these two simple rules:
- Do not call methods on the holder objects.
- Do not pass holder objects to other methods.
That is, all references to holders must be in your UDF (only) and must only reference holder member variables.
Suppose you write a query that involves only constants using a UDF (such as this one we'll use as an example later):
SELECT log2(4) FROM (VALUES (1))
Drill will generate code for your function as we've discussed. But, in addition, the planner will also evaluate your function, presumably to determine if it can be replaced with a constant. In this case, the planner will call your actual Java code; not the runtime code generated from your template.
The interpreted environment has slightly different semantics that we'll see later when we [[discuss VARCHAR
arguments|Working with VARCHAR Data in UDFs]].
(The interpreted environment may provide an opportunity to debug your code by stepping through it; something that you cannot do with runtime code generation.)
Drill will detect (via the mechanism just described) expressions which consume only constants and are, therefore, themselves only constants. If Drill detects that your function is used in a constant expression (such as the one shown above), Drill will generate the code for your function in the setup portion of the generated function rather than the eval portion. The semantics of the two are the same. This is just something to keep in mind if you examine the generated code.
Voila! There we have it: the full structure needed to get values from vectors, inline the target function, and write the result to the output vector. It should now be clear why we say that Drill UDFs are not Java, but instead is a Java-based DSL.
If you are familiar with Drill's UDFs, the above should go a long way to explain the magic incantations you've been using. In fact, the above shows that when you write UDFs, you use exactly the same mechanism that Drill uses internally. So, our job is to figure out how to work with that mechanism, rather than fight against it.