-
Notifications
You must be signed in to change notification settings - Fork 980
EVF Memory
The Introduction section explained the structure of value vectors. As it turns out, vectors introduce interesting problems when considering memory management.
Drill is a concurrent, distributed SQL query engine. This means that each query runs multiple fragments across many nodes, and that multiple queries run concurrently on the same node. When you run Drill with just a few concurrent queries, against known data, you can manage memory simply by tweaking various memory parameters such as the amount of off-heap memory given to each node.
But, as load and query variety increases, it becomes increasingly important for Drill to manage memory so that each query works within its memory "budget." Resource management is a complex topic; here we simply assume that Drill has assigned a fragment a particular amount of memory that it must stay within. To be concrete, assume that you are writing a scan operator and it must use no more than M MB of memory (where M is, say, 20 or 40.)
The EVF (specifically, the ResultSetLoader
mechanism) assists with memory management in three ways:
- Limit overall record batch size to some specified maximum.
- Avoid memory fragmentation due to vectors larger than 16 MB.
- Minimize internal fragmentation of the largest vector.
Suppose you read records that contain just fixed-size fields, such as integers. It is very easy to predict how much memory you should need. Suppose you want to read N rows, and you know that every integer takes 4 bytes, and you have 10 columns. Then you need 4 * 10 * N bytes of memory. Simple enough.
Suppose, however, that you have a variable-width column (a VarChar). How much memory do you need per value? Typically you just don't know: you haven't yet looked at the data. If you've gathered statistics (stats) on the column, then perhaps you can use the average column width. But, suppose you running the command that gathers stats, so that you don't yet know the number: what do you do?
Drill typically just assumes each column is 50 bytes. This rough-and-ready estimate works, again, when running just a few queries with sufficient memory: there is plenty of "fudge factor" for wider columns. But, in truth, you just don't know.
Since Drill does not know ahead of time the width of (most) columns, and generally does not consider the number of columns, most readers use a rough-and-ready estimate of the number of rows to read. Some choose 1000, others 1K, 4000, 4K. The maximum number of records per batch is 64K. Again, all of these values work with a lightly-loaded system.
Suppose that you are reading records that represent blog posts. Columns could be very wide: perhaps a 1 MB in size. Suppose we read 4K records per batch. We'll need 4 GB of memory for the data. This seems rather excessive, which suggests that using a fixed record count, independent of data shape, may not be the optimal solution.
Instead, what we really want to so read data until we hit our memory limit M. That is, let memory determine row count, not the other way around. This, then, is a core goal of the ResultSetLoader
class within the extended vector framework (EVF.) The code tracks the space taken by an input batch and tells the reader to stop adding new records once the memory limit is reached. Because of this limit, "downstream" operators have a much easier task to manage their memory: they know that incoming batches will be no larger than our limit M.
Drill uses the Netty memory allocator which is based on a binary-buddy system with 16 MB "slabs." If code requests, say, 3000 bytes, Netty rounds the request to the next power of 2 (here, 4096 or 4K), and allocates a buffer of that size, which we then use for our value vector. As we write to the vector, the vector code will detect when we fill the 4K and will move to the next available size: the next power of 2, or 8K. This continues as we add more data.
At some point, we have a 16 MB buffer, and want to write even more data. The next size is 32 MB. But, this is larger than Netty's slab size. So, Netty simply bypasses its own free memory list (which is a chain of 16 MB slabs), and requests the 32 MB directly from the OS.
After Drill runs for a while, under heavy load, all of the available system memory may end up on the Netty free list. The next time we request a 32 MB buffer, the request will fail with an Out-of-Memory (OOM) error, even if Netty has lots of free memory (in the form of 16 MB slabs.) The technical term for this is "fragmentation."
Many solutions are possible. The solution adopted by the EVF is simply to never let any vector exceed 16 MB in size. (This does mean that the largest string Drill can handle is 16 MB in size when using the EVF.)
We've discussed how Drill allocates memory in power-of-2 buffers. Suppose our reader, like many existing readers, simply reads some fixed number of records, say 1000. The result can be "internal fragmentation" wasted space within a vector. Why?
Vectors are doubled as they grow. Because of this, we know that the first half of the vector must be full (or it would not have doubled to its current size.) The second half, if we simply fill an arbitrary number of rows, will, on average, be half filled. (Technically, the vector data "payload" size across batches will be uniformly distributed between 1/2x and 1x of the vector size.) This means that, on average, the vector will be 25% unused space. If we limit our memory use to M MB, then data will account for only 75% of that space.
The EVF attempts to minimize internal fragmentation, at least for the largest vector. The ResultSetLoader
will fill the largest vector up to the point where doubling it would exceed our total memory limit M or the 16 MB per-vector limit. (The other vectors may still be 75% full, but at least the largest vector is near 100% full.)
The EVF enforces all these limits automatically. Without the EVF, you'd have to write code that checks vector allocations, tracks memory usage, and deals with variable-width vectors, repeated vectors and so on. With EVF, you simply write rows into the batch until the EVF says it can't take any more. (In case you were wondering, the EVF takes care of the "overflow row": the one that causes the vector to become overly full. The EVF simply saves that row for the next batch. It is all transparent to your code.)
ResultSetLoader rsl = ...
RowSetLoader writer = rsl.writer();
while (! writer.isFull()) {
// Write data to the row
}