[query] prevent sudden unceremonious death of driver JVM (hail-is#14066)

CHANGELOG: Since 0.2.110, `hailctl dataproc` set the heap size of the driver JVM dangerously high. It is now set to an appropriate level. This issue manifests in a variety of inscrutable ways including RemoteDisconnectedError and socket closed. See issue hail-is#13960 for details. In Dataproc versions 1.5.74, 2.0.48, and 2.1.0, Dataproc introduced ["memory protection"](https://cloud.google.com/dataproc/docs/support/troubleshoot-oom-errors#memory_protection) which is a euphemism for a newly aggressive OOMKiller. When the OOMKiller kills the JVM driver process, there is no hs_err_pid...log file, no exceptional log statements, and no clean shutdown of any sockets. The process is simply SIGTERM'ed and then SIGKILL'ed. From Hail 0.2.83 through Hail 0.2.109 (released February 2023), Hail was pinned to Dataproc 2.0.44. From Hail 0.2.15 onwards, `hailctl dataproc`, by default, reserves 80% of the advertised memory of the driver node for the use of the Hail Query Driver JVM process. For example, Google advertises that an n1-highmem-8 has 52 GiB of RAM, so Hail sets the `spark:spark.driver.memory` property to 41g (we always round down). Before aggressive memory protection, this setting was sufficient to protect the driver from starving itself of memory. Unfortunately, Hail 0.2.110 upgraded to Dataproc 2.1.2 which enabled "memory protection". Moreover, in the years since Hail 0.2.15, the memory in use by system processes on Dataproc driver nodes appears to have increased. Due to these two circumstances, the driver VM's memory usage can grow high enough to trigger the OOMKiller before the JVM triggers a GC. Consider, for example, these slices of the syslog of the n1-highmem-8 driver VM of a Dataproc cluster: ``` Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: earlyoom v1.6.2 Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem total: 52223 MiB, swap total: 0 MiB Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: sending SIGTERM when mem <= 0.12% and swap <= 1.00%, Nov 22 14:26:51 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: SIGKILL when mem <= 0.06% and swap <= 0.50% ... Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: + echo 'All done' Nov 22 14:30:05 vds-cluster-91f3f4c1-b737-m post-hdfs-startup-script[7747]: All done Nov 22 14:30:06 vds-cluster-91f3f4c1-b737-m earlyoom[4115]: mem avail: 42760 of 52223 MiB (81.88%), swap free: 0 of 0 MiB ( 0.00%) ``` Notice: 1. The total memory available on the machine is less than 52 GiB (= 53,248 MiB), indeed it is a full 1025 MiB below the advertised amount. 2. Once all the components of the Dataproc cluster have started (but before any Hail Query jobs are submitted) the total memory available is already depleted to 42760 MiB. Recall that Hail allocates 41 GiB (= 41,984 MiB) to its JVM. This leaves the Python process and all other daemons on the system only 776 MiB of excess RAM. For reference python3 -c 'import hail' needs 206 MiB. This PR modifies `hailctl dataproc start` and the meaning of `--master-memory-fraction`. Now, `--master-memory-fraction` is the precentage of the memory available to the master node after accounting for the missing 1GiB and the system daemons. We also increase the default memory fraction to 90%. For an n1-highmem-8, the driver has 36 GiB instead of 41 GiB. An n1-highmem-16 is unchanged at 83 GiB.
jigold · Dec 4, 2023 · 42930ec · 42930ec
1 parent 6f5c4fd
commit 42930ec
Show file tree

Hide file tree

Showing 2 changed files with 10 additions and 4 deletions.
diff --git a/hail/python/hailtop/hailctl/dataproc/cli.py b/hail/python/hailtop/hailctl/dataproc/cli.py
@@ -84,7 +84,7 @@ def start(
         Opt(
             help='Fraction of master memory allocated to the JVM. Use a smaller value to reserve more memory for Python.'
         ),
-    ] = 0.8,
+    ] = 0.9,
     master_boot_disk_size: Ann[int, Opt(help='Disk size of master machine, in GB')] = 100,
     num_master_local_ssds: Ann[int, Opt(help='Number of local SSDs to attach to the master machine.')] = 0,
     num_secondary_workers: NumSecondaryWorkersOption = 0,

diff --git a/hail/python/hailtop/hailctl/dataproc/start.py b/hail/python/hailtop/hailctl/dataproc/start.py
@@ -307,12 +307,18 @@ def disk_size(size):
             size = max(size, 200)
         return str(size) + 'GB'
 
+    def jvm_heap_size_gib(machine_type: str, memory_fraction: float) -> int:
+        advertised_memory_gib = MACHINE_MEM[machine_type]
+        # 1. GCE only provides 51 GiB for an n1-highmem-8 (advertised as 52 GiB)
+        # 2. System daemons use ~10 GiB based on syslog "earlyoom" log statements during VM startup
+        actual_available_memory_gib = advertised_memory_gib - 11
+        jvm_heap_size = actual_available_memory_gib * memory_fraction
+        return int(jvm_heap_size)
+
     conf.extend_flag(
         'properties',
         {
-            "spark:spark.driver.memory": "{driver_memory}g".format(
-                driver_memory=str(int(MACHINE_MEM[master_machine_type] * master_memory_fraction))
-            )
+            "spark:spark.driver.memory": f"{jvm_heap_size_gib(master_machine_type, master_memory_fraction)}g"
         },
     )
     conf.flags['master-machine-type'] = master_machine_type