Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] eliminate optimization that can blow RAM #13619

Merged

Conversation

danking
Copy link
Contributor

@danking danking commented Sep 13, 2023

CHANGELOG: On some pipelines, since at least 0.2.58 (commit 23813af), Hail could use essentially unbounded amounts of memory. This change removes "optimization" rules that accidentally caused that.

Closes #13606

CHANGELOG: On some pipelines, since at least 0.2.58 (commit 23813af), Hail could use essentially unbounded amounts of memory. This change removes "optimization" rules that accidentally caused that.
Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These optimizations are too important to just disable. I think a better quickfix is to do

    case ToStream(ToArray(s), false) if s.typ.isInstanceOf[TStream] => s

    case ToStream(Let(name, value, ToArray(x)), false) if x.typ.isInstanceOf[TStream] =>
      Let(name, value, x)

The right fix is probably to add a requiresMemoryManagementPerElement flag to StreamMap---which could be generally useful when the producer didn't care about memory management, but the map body allocates a lot and wants to free after each row---and then use that to make these smarter:

    case ToStream(ToArray(s), false) if s.typ.isInstanceOf[TStream] => s
    case ToStream(ToArray(s), true) if s.typ.isInstanceOf[TStream] =>
      StreamMap(s, uid, Ref(uid), requiresMemoryManagementPerElement = true)

    case ToStream(Let(name, value, ToArray(x)), false) if x.typ.isInstanceOf[TStream] =>
      Let(name, value, x)
    case ToStream(Let(name, value, ToArray(x)), false) if x.typ.isInstanceOf[TStream] =>
      Let(name, value, StreamMap(x, uid, Ref(uid), requiresMemoryManagementPerElement = true))

And I still don't fully understand why these were deoptimizing the gnomad pipeline.

@danking
Copy link
Contributor Author

danking commented Sep 14, 2023

FWIW, I finally found a simpler reproducer. It really takes some doing to convince the simplifier to apply this rule.

This operation should use a constant ~1GiB of RAM (in reality, in a non-broken pipeline it uses closer to 8GiB, but, still, a constant amount of RAM), but in reality memory use grows with each row processed

import hail as hl
ht = hl.utils.range_table(1)
ht = ht.key_by()
ht = ht.select(rows = hl.range(10))
ht = ht.explode('rows')
ht = ht.annotate(garbage=hl.range(1024 ** 3))
ht.write('/tmp/foo.ht', overwrite=True)

The simplifier cannot simplify the pipeline if the key is still present so this pipeline is sufficient to restore normal memory usage:

import hail as hl
ht = hl.utils.range_table(1)
ht = ht.select(rows = hl.range(10))
ht = ht.explode('rows')
ht = ht.annotate(garbage=hl.range(1024 ** 3))
ht.write('/tmp/foo.ht', overwrite=True)

The "bad" WritePartition body IR looks like this:

(StreamFlatMap __iruid_447
  (StreamRange -1 True
    (GetField start (Ref __iruid_446))
    (GetField end (Ref __iruid_446))
    (I32 1))
  (StreamMap __iruid_448
    (StreamRange 1 False (I32 0) (I32 10) (I32 1))
    (InsertFields
      (Literal Struct{} <literal value>)
      ("rows" "garbage")
      (rows (Ref __iruid_448))
      (garbage
        (ToArray
          (StreamRange 2 False
            (I32 0)
            (I32 1073741824)
            (I32 1)))))))

The "good" IR looks like this:

(StreamFlatMap __iruid_480
  (StreamRange -1 True
    (GetField start (Ref __iruid_479))
    (GetField end (Ref __iruid_479))
    (I32 1))
  (Let __iruid_481
    (MakeStruct
      (idx (Ref __iruid_480))
      (rows
        (ToArray
          (StreamRange 1 False (I32 0) (I32 10) (I32 1)))))
    (StreamMap __iruid_482
      (ToStream True (GetField rows (Ref __iruid_481)))
      (InsertFields
        (Ref __iruid_481)
        ("idx" "rows" "garbage")
        (rows (Ref __iruid_482))
        (garbage
          (ToArray
            (StreamRange 2 False
              (I32 0)
              (I32 1073741824)
              (I32 1))))))))

Notice, in particular, that the StreamMap inside the StreamFlatMap uses memory management because the originating ToStream uses memory management.

@danking
Copy link
Contributor Author

danking commented Sep 14, 2023

Issue for StreamMap memory management: #13623

@danking
Copy link
Contributor Author

danking commented Sep 14, 2023

@patrick-schultz bump for tomorrow

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very clear example, thanks!

@danking danking merged commit 43c7597 into hail-is:main Sep 15, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[query] gnomAD pipeline blows RAM if using split_multi
2 participants