DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) #2875

shfshihuafeng · 2024-01-23T06:55:14Z

DRILL-8478: HashPartition memory leak when OutOfMemoryException is encountered

Description

https://issues.apache.org/jira/browse/DRILL-8478
when allocating memory for hashParttion with OutOfMemoryException,it cause memory leak.
beacuase hashpartiton object cannot be created successfully, so it cannot be cleaned up In the closing phase.

Documentation

(Please describe user-visible changes similar to what should appear in the Drill documentation.)

Testing

(Please describe how this PR has been tested.)

1. TPCH test condition

(1) script
i run sql8 (sql detail as Additional context) with 20 concurrent
tpch test script

fileName=/data/drill/tpch_sql/1s/shf.txt

random_sql(){
#for i in `seq 1 3`
while true
do

  num=$((RANDOM%22+1))
  if [ -f $fileName ]; then
  echo "$fileName" " is exit"
  exit 0
  else
          $jupiter_home/sqlline -u \"jdbc:drill:zk=ip:2181/drill/jupiterbits1_performance_test_shf\" -f tpch_sql8.sql >> log_wubq/tpch1s_sql${num}.log 2>&1
  fi
done
}

(2) parameter

{DRILL_MAX_DIRECT_MEMORY:-"5G"}
open debug : set drill.memory.debug.allocator =true (Check for memory leaks )

(3) sql8

select o_year, sum(case when nation = 'CHINA' then volume else 0 end) / sum(volume) as mkt_share from ( select extract(year from o_orderdate) as o_year, l_extendedprice * 1.0 as volume, n2.n_name as nation from hive.tpch1s.part, hive.tpch1s.supplier, hive.tpch1s.lineitem, hive.tpch1s.orders, hive.tpch1s.customer, hive.tpch1s.nation n1, hive.tpch1s.nation n2, hive.tpch1s.region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 'ASIA' and s_nationkey = n2.n_nationkey and o_orderdate between date '1995-01-01' and date '1996-12-31' and p_type = 'LARGE BRUSHED BRASS') as all_nations group by o_year order by o_year

2. test step

(1) run script
(2) when the log contains exception information ,stop script

Caused by: org.apache.drill.exec.exception.OutOfMemoryException: (op:5:1:1:HashJoinPOP) Unable to allocate buffer of size 4096 due to memory limit (41943040). Current allocation: 6166528

(3) there is no sql running,but memory is not 0.

(4) leak info from log

2024-01-16 17:28:18,760 [1a59b3e9-6936-1560-e8e5-3ad6a66925b1:frag:5:1] WARN  o.a.drill.exec.memory.BaseAllocator - Closed child allocator[op:5:1:1:HashJoinPOP] on parent allocator[frag:5:1]'s child list.
Allocator(frag:5:1) 5000000/1000000/31035392/40041943040 (res/actual/peak/limit)
  child allocators: 1
    Allocator(op:5:1:1:HashJoinPOP) 1000000/98304/22822912/41943040 (res/actual/peak/limit)
      child allocators: 0
      ledgers: 11
        ledger[193578] allocator: op:5:1:1:HashJoinPOP), isOwning: true, size: 8192, references: 2, life: 4048793488977884..0, allocatorManager: [168316, life: 4048793488974969..0] holds 4 buffers.
            DrillBuf[198176], udle: [168315 959..8192]

3. fixed

 Run the same test  with the fixed code。there is no sql runing , i find  dir memory is 0 and i can not find leak log like (3)

paul-rogers

Very nice work! It is not at all easy to track down these issues.

It would be hard to add a unit test for this case. In your "Testing" section, please explain the test you ran. You have a query that failed with a memory leak. If you run that same query with this fix, it will still fail. But, after the fix, did the memory leak disappear?

paul-rogers · 2024-01-23T07:18:25Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java

+    } catch (OutOfMemoryException oom) {
+      close();
+      throw UserException.memoryError(oom)
+          .message("OutOfMemory while allocate memory for hash partition.")


Suggested: "Failed to allocate hash partition."

The memoryError() already indicates that it is an OOM error.

i resubmit pr and supply test step

paul-rogers · 2024-01-23T07:18:49Z

...ec/src/main/java/org/apache/drill/exec/physical/impl/join/AbstractHashBinaryRecordBatch.java

@@ -1312,7 +1313,9 @@ private void cleanup() {
    }
    // clean (and deallocate) each partition, and delete its spill file
    for (HashPartition partn : partitions) {
-      partn.close();
+      if (Objects.nonNull(partn)) {


Simpler if (partn != null) {

jnturton

Thanks for the fix and making changes.

…on with OutOfMemoryException (apache#2874)

jnturton · 2024-01-23T09:02:08Z

An unsued import crept in, could you remove it please?

shfshihuafeng · 2024-01-23T09:02:58Z

An unsued import crept in, could you remove it please?

removed it

…countered (apache#2874) (apache#2875)

paul-rogers

Thanks for the fixes. Looks good.

I looked more closely at the surrounding code, and have two questions. We don't have to answer them, but it would be good to better understand the source of the error this PR fixes.

I see this has already been merged, perhaps a bit too quickly. The questions above should be researched at some future point if anyone messes with this operator again.

paul-rogers · 2024-01-23T19:54:03Z

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java

-    if (numPartitions > 1) {
-      allocateNewCurrentBatchAndHV();
+    } catch (OutOfMemoryException oom) {
+      close();


This call is probably fine. However, the design is that if any operator fails, the entire operator stack is closed. So, close() should be called by the fragment executor. There is probably no harm in calling close() here, as long as the close() method is safe to call twice.

If the fragment executor does not call close when the failure occurs during setup, then there is a bug since failing to call close() results in just this kind of error.

1. fix idea

The design is any operator fails, the entire operator stack is closed. but partitions is array which is initialed by null。if hashPartition object is not created successfully, it throw exception. so partitions array data after index which is null。

for (int part = 0; part < numPartitions; part++) { partitions[part] = new HashPartition(context, allocator, baseHashTable, buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, part, spilledState.getCycle(), numPartitions); }

for example

partitions array length is 32, numPartitions =32 when numPartitions =10 ,throw except. partitions[11-31] will be null
object which index numPartitions =10 was created failed ,but it had allocater memory.

when calling close() , hashpartion object which numPartitions =10 can not call close,beacause it is null。

2. another fix idea

we do not throw exception and do not call close, but catch. we can create hash partiotn object . thus when calling close() , we can release。

//1. add isException parameter when construct HashPartition object HashPartition(FragmentContext context, BufferAllocator allocator, ChainedHashTable baseHashTable, RecordBatch buildBatch, RecordBatch probeBatch, boolean semiJoin, int recordsPerBatch, SpillSet spillSet, int partNum, int cycleNum, int numPartitions , boolean **isException** ) //2. catch except to ensure HashPartition object has been created } catch (OutOfMemoryException oom) { //do not call close ,do not throw except isException =true; } //3.deal with exception AbstractHashBinaryRecordBatch#initializeBuild boolean isException = false; try { for (int part = 0; part < numPartitions; part++) { if (isException) { break; } partitions[part] = new HashPartition(context, allocator, baseHashTable, buildBatch, probeBatch, semiJoin, RECORDS_PER_BATCH, spillSet, part, spilledState.getCycle(), numPartitions,**isException** ); } } catch (Exception e) { isException = true; } if (isException ){ throw UserException.memoryError(exceptions[0]) .message("Failed to allocate hash partition.") .build(logger); }

paul-rogers · 2024-01-23T19:57:07Z

...ec/src/main/java/org/apache/drill/exec/physical/impl/join/AbstractHashBinaryRecordBatch.java

-      partn.close();
+      if (partn != null) {
+        partn.close();
+      }


The above is OK as a work-around. I wonder, however, where the code added a null pointer to the partition list. That should never happen. If it does, it should be fixed at the point where the null pointer is added to the list. Fixing it here is incomplete: there are other places where we loop through the list, and those will also fail.

(partn != null) are necessary ,see Comment above on 1. fix idea

paul-rogers reviewed Jan 23, 2024

View reviewed changes

shfshihuafeng force-pushed the hashleak branch from 31ca454 to 7b7ba97 Compare January 23, 2024 07:50

jnturton added bug backport-to-stable This bug fix is applicable to the latest stable release and should be considered for inclusion there labels Jan 23, 2024

jnturton approved these changes Jan 23, 2024

View reviewed changes

DRILL-8478. HashPartition memory leak when it allocate memory excepti…

122cd35

…on with OutOfMemoryException (apache#2874)

shfshihuafeng force-pushed the hashleak branch from 7b7ba97 to 122cd35 Compare January 23, 2024 09:01

jnturton merged commit 051d059 into apache:master Jan 23, 2024
7 of 8 checks passed

jnturton pushed a commit to jnturton/drill that referenced this pull request Jan 23, 2024

DRILL-8478. HashPartition memory leak when OutOfMemoryException is en…

5e44632

…countered (apache#2874) (apache#2875)

cgivre assigned shfshihuafeng Jan 23, 2024

paul-rogers reviewed Jan 23, 2024

View reviewed changes

shfshihuafeng mentioned this pull request Jan 24, 2024

DRILL-8479: Merge Join Leak When Depleting Incoming Batches Throw Exception #2878

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) #2875

DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) #2875

shfshihuafeng commented Jan 23, 2024 •

edited by jnturton

Loading

paul-rogers left a comment

paul-rogers Jan 23, 2024 •

edited

Loading

shfshihuafeng Jan 23, 2024

paul-rogers Jan 23, 2024

jnturton left a comment

jnturton commented Jan 23, 2024

shfshihuafeng commented Jan 23, 2024

paul-rogers left a comment •

edited

Loading

paul-rogers Jan 23, 2024

shfshihuafeng Jan 24, 2024 •

edited

Loading

paul-rogers Jan 23, 2024

shfshihuafeng Jan 24, 2024

DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) #2875

DRILL-8478. HashPartition memory leak when it allocate memory exception with OutOfMemoryException (#2874) #2875

Conversation

shfshihuafeng commented Jan 23, 2024 • edited by jnturton Loading

DRILL-8478: HashPartition memory leak when OutOfMemoryException is encountered

Description

Documentation

Testing

1. TPCH test condition

2. test step

3. fixed

paul-rogers left a comment

Choose a reason for hiding this comment

paul-rogers Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

shfshihuafeng Jan 23, 2024

Choose a reason for hiding this comment

paul-rogers Jan 23, 2024

Choose a reason for hiding this comment

jnturton left a comment

Choose a reason for hiding this comment

jnturton commented Jan 23, 2024

shfshihuafeng commented Jan 23, 2024

paul-rogers left a comment • edited Loading

Choose a reason for hiding this comment

paul-rogers Jan 23, 2024

Choose a reason for hiding this comment

shfshihuafeng Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

1. fix idea

2. another fix idea

paul-rogers Jan 23, 2024

Choose a reason for hiding this comment

shfshihuafeng Jan 24, 2024

Choose a reason for hiding this comment

shfshihuafeng commented Jan 23, 2024 •

edited by jnturton

Loading

paul-rogers Jan 23, 2024 •

edited

Loading

paul-rogers left a comment •

edited

Loading

shfshihuafeng Jan 24, 2024 •

edited

Loading