Skip to content

Commit

Permalink
[query] prevent sudden unceremonious death of driver JVM (hail-is#14066)
Browse files Browse the repository at this point in the history
CHANGELOG: Since 0.2.110, `hailctl dataproc` set the heap size of the
driver JVM dangerously high. It is now set to an appropriate level. This
issue manifests in a variety of inscrutable ways including
RemoteDisconnectedError and socket closed. See issue hail-is#13960 for details.

In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced
["memory
protection"](https://cloud.google.com/dataproc/docs/support/troubleshoot-oom-errors#memory_protection)
which is a euphemism for a newly aggressive OOMKiller. When the
OOMKiller kills the JVM driver process, there is no hs_err_pid...log
file, no exceptional log statements, and no clean shutdown of any
sockets. The process is simply SIGTERM'ed and then SIGKILL'ed.

From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was
pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, `hailctl dataproc`,
by default, reserves 80% of the advertised memory of the driver node for
the use of the Hail Query Driver JVM process. For example, Google
advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets the
`spark:spark.driver.memory` property to 41g (we always round down).
Before aggressive memory protection, this setting was sufficient to
protect the driver from starving itself of memory.

Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled
"memory protection". Moreover, in the years since Hail 0.2.15, the
memory in use by system processes on Dataproc driver nodes appears to
have increased. Due to these two circumstances, the driver VM's memory
usage can grow high enough to trigger the OOMKiller before the JVM
triggers a GC. Consider, for example, these slices of the syslog of the
n1-highmem-8 driver VM of a Dataproc cluster:

```
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: earlyoom v1.6.2
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem total: 52223 MiB, swap total:    0 MiB
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM when mem <=  0.12% and swap <=  1.00%,
Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]:         SIGKILL when mem <=  0.06% and swap <=  0.50%
...
Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: + echo 'All done'
Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: All done
Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free:    0 of    0 MiB ( 0.00%)
```

Notice:

1. The total memory available on the machine is less than 52 GiB (=
53,248 MiB), indeed it is a full 1025 MiB below the advertised amount.

2. Once all the components of the Dataproc cluster have started (but
before any Hail Query jobs are submitted) the total memory available is
already depleted to 42760 MiB. Recall that Hail allocates 41 GiB (=
41,984 MiB) to its JVM. This leaves the Python process and all other
daemons on the system only 776 MiB of excess RAM. For reference python3
-c 'import hail' needs 206 MiB.

This PR modifies `hailctl dataproc start` and the meaning of
`--master-memory-fraction`. Now, `--master-memory-fraction` is the
precentage of the memory available to the master node after accounting
for the missing 1GiB and the system daemons. We also increase the
default memory fraction to 90%.

For an n1-highmem-8, the driver has 36 GiB instead of 41 GiB. An
n1-highmem-16 is unchanged at 83 GiB.
  • Loading branch information
danking authored Dec 4, 2023
1 parent 6f5c4fd commit 42930ec
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 4 deletions.
2 changes: 1 addition & 1 deletion hail/python/hailtop/hailctl/dataproc/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ def start(
Opt(
help='Fraction of master memory allocated to the JVM. Use a smaller value to reserve more memory for Python.'
),
] = 0.8,
] = 0.9,
master_boot_disk_size: Ann[int, Opt(help='Disk size of master machine, in GB')] = 100,
num_master_local_ssds: Ann[int, Opt(help='Number of local SSDs to attach to the master machine.')] = 0,
num_secondary_workers: NumSecondaryWorkersOption = 0,
Expand Down
12 changes: 9 additions & 3 deletions hail/python/hailtop/hailctl/dataproc/start.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,12 +307,18 @@ def disk_size(size):
size = max(size, 200)
return str(size) + 'GB'

def jvm_heap_size_gib(machine_type: str, memory_fraction: float) -> int:
advertised_memory_gib = MACHINE_MEM[machine_type]
# 1. GCE only provides 51 GiB for an n1-highmem-8 (advertised as 52 GiB)
# 2. System daemons use ~10 GiB based on syslog "earlyoom" log statements during VM startup
actual_available_memory_gib = advertised_memory_gib - 11
jvm_heap_size = actual_available_memory_gib * memory_fraction
return int(jvm_heap_size)

conf.extend_flag(
'properties',
{
"spark:spark.driver.memory": "{driver_memory}g".format(
driver_memory=str(int(MACHINE_MEM[master_machine_type] * master_memory_fraction))
)
"spark:spark.driver.memory": f"{jvm_heap_size_gib(master_machine_type, master_memory_fraction)}g"
},
)
conf.flags['master-machine-type'] = master_machine_type
Expand Down

0 comments on commit 42930ec

Please sign in to comment.