-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) #2875
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work! It is not at all easy to track down these issues.
It would be hard to add a unit test for this case. In your "Testing" section, please explain the test you ran. You have a query that failed with a memory leak. If you run that same query with this fix, it will still fail. But, after the fix, did the memory leak disappear?
} catch (OutOfMemoryException oom) { | ||
close(); | ||
throw UserException.memoryError(oom) | ||
.message("OutOfMemory while allocate memory for hash partition.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested: "Failed to allocate hash partition."
The memoryError()
already indicates that it is an OOM error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i resubmit pr and supply test step
@@ -1312,7 +1313,9 @@ private void cleanup() { | |||
} | |||
// clean (and deallocate) each partition, and delete its spill file | |||
for (HashPartition partn : partitions) { | |||
partn.close(); | |||
if (Objects.nonNull(partn)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler if (partn != null) {
31ca454
to
7b7ba97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix and making changes.
…on with OutOfMemoryException (apache#2874)
7b7ba97
to
122cd35
Compare
An unsued import crept in, could you remove it please? |
removed it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes. Looks good.
I looked more closely at the surrounding code, and have two questions. We don't have to answer them, but it would be good to better understand the source of the error this PR fixes.
I see this has already been merged, perhaps a bit too quickly. The questions above should be researched at some future point if anyone messes with this operator again.
if (numPartitions > 1) { | ||
allocateNewCurrentBatchAndHV(); | ||
} catch (OutOfMemoryException oom) { | ||
close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call is probably fine. However, the design is that if any operator fails, the entire operator stack is closed. So, close()
should be called by the fragment executor. There is probably no harm in calling close()
here, as long as the close()
method is safe to call twice.
If the fragment executor does not call close when the failure occurs during setup, then there is a bug since failing to call close()
results in just this kind of error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. fix idea
The design is any operator fails, the entire operator stack is closed. but partitions is array which is initialed by null。if hashPartition object is not created successfully, it throw exception. so partitions array data after index which is null。
for (int part = 0; part < numPartitions; part++) { partitions[part] = new HashPartition(context, allocator, baseHashTable, buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, part, spilledState.getCycle(), numPartitions); }
for example
partitions array length is 32, numPartitions =32 when numPartitions =10 ,throw except. partitions[11-31] will be null
object which index numPartitions =10 was created failed ,but it had allocater memory.
when calling close() , hashpartion object which numPartitions =10 can not call close,beacause it is null。
2. another fix idea
we do not throw exception and do not call close, but catch. we can create hash partiotn object . thus when calling close() , we can release。
//1. add isException parameter when construct HashPartition object
HashPartition(FragmentContext context, BufferAllocator allocator, ChainedHashTable baseHashTable,
RecordBatch buildBatch, RecordBatch probeBatch, boolean semiJoin,
int recordsPerBatch, SpillSet spillSet, int partNum, int cycleNum, int numPartitions , boolean **isException** )
//2. catch except to ensure HashPartition object has been created
} catch (OutOfMemoryException oom) {
//do not call close ,do not throw except
isException =true;
}
//3.deal with exception
AbstractHashBinaryRecordBatch#initializeBuild
boolean isException = false;
try {
for (int part = 0; part < numPartitions; part++) {
if (isException) {
break;
}
partitions[part] = new HashPartition(context, allocator, baseHashTable,
buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, part,
spilledState.getCycle(), numPartitions,**isException** );
}
} catch (Exception e) {
isException = true;
}
if (isException ){
throw UserException.memoryError(exceptions[0])
.message("Failed to allocate hash partition.")
.build(logger);
}
partn.close(); | ||
if (partn != null) { | ||
partn.close(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above is OK as a work-around. I wonder, however, where the code added a null pointer to the partition list. That should never happen. If it does, it should be fixed at the point where the null pointer is added to the list. Fixing it here is incomplete: there are other places where we loop through the list, and those will also fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(partn != null) are necessary ,see Comment above on 1. fix idea
DRILL-8478: HashPartition memory leak when OutOfMemoryException is encountered
Description
https://issues.apache.org/jira/browse/DRILL-8478
when allocating memory for hashParttion with OutOfMemoryException,it cause memory leak.
beacuase hashpartiton object cannot be created successfully, so it cannot be cleaned up In the closing phase.
Documentation
(Please describe user-visible changes similar to what should appear in the Drill documentation.)
Testing
(Please describe how this PR has been tested.)
1. TPCH test condition
(1) script
i run sql8 (sql detail as Additional context) with 20 concurrent
tpch test script
(2) parameter
(3) sql8
select o_year, sum(case when nation = 'CHINA' then volume else 0 end) / sum(volume) as mkt_share from ( select extract(year from o_orderdate) as o_year, l_extendedprice * 1.0 as volume, n2.n_name as nation from hive.tpch1s.part, hive.tpch1s.supplier, hive.tpch1s.lineitem, hive.tpch1s.orders, hive.tpch1s.customer, hive.tpch1s.nation n1, hive.tpch1s.nation n2, hive.tpch1s.region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 'ASIA' and s_nationkey = n2.n_nationkey and o_orderdate between date '1995-01-01' and date '1996-12-31' and p_type = 'LARGE BRUSHED BRASS') as all_nations group by o_year order by o_year
2. test step
(1) run script
(2) when the log contains exception information ,stop script
Caused by: org.apache.drill.exec.exception.OutOfMemoryException: (op:5:1:1:HashJoinPOP) Unable to allocate buffer of size 4096 due to memory limit (41943040). Current allocation: 6166528
(3) there is no sql running,but memory is not 0.
(4) leak info from log
3. fixed