[query] Call-Cache `CollectDistributedArray` (rfc-0000) #12954

ehigham · 2023-05-01T17:26:27Z

Adds the ability to rerun/retry queries from the nearest CollectDistributedArray (CDA) IR site.

Computes a "Semantic Hash" of the top-level IR which is used to generate a key for the various constituent CDA calls in a query. The implementation for CDA, BackendUtils.collectDArray, uses that key to look into an the execution cache for the results of each partition for that call and uses/updates the cache with successful partition computations.

The nature of the staged- lower and execute model means we don't know how many CDA calls that will be generated ahead of time. Thus we treat the "Semantic Hash" in a similar way to an RNG state variable and generate a key from the Semantic Hash every time every time we encounter a CDA.

The execution cache is implemented on-top of a local or remote filesystem (configurable via the HAIL_CACHE_DIR environment variable). This defaults to {tmpdir}/hail/{pip-version}.

Return `(Option[Throwable], IndexedSeq[(Int, Array[Byte])]`, where Option[Throwable]: exception that was raised while computing partitions IndexedSeq[(Int, Array[Byte])]: partition index -> result

Read type of execution cache from flags

…collect-distributed-array

ehigham · 2023-07-31T19:02:01Z

hail/src/main/scala/is/hail/expr/ir/analyses/SemanticHash.scala

+  def getFileHash(fs: FS)(path: String): Array[Byte] =
+    fs.eTag(path) match {
+      case Some(etag) =>
+        etag.getBytes
+      case None =>
+        path.getBytes ++ Bytes.fromLong(fs.fileStatus(path).getModificationTime)
+    }


Something to call out here:
I'm just using the file's etag if the filesystem supports them and NOT the path.
I think that the etag is unique on azure (their doc is quite hard to navigate so I'm finding it hard to know for sure - welcome help!).
GCS has the nice property that copying the file preserves the etag and it only changes after modification.

patrick-schultz

Overall, this is really great work! I have some small requests, some of which we discussed last week. I haven't finished looking at all the hash cases yet, or the tests. With the hash cases, in general I think by default anything that we print, we should include in the hash. In some cases it might not be necessary, but I'd rather be conservative.

hail/src/main/scala/is/hail/backend/BackendUtils.scala

patrick-schultz · 2023-08-02T19:04:25Z

hail/src/main/scala/is/hail/backend/service/ServiceBackend.scala

      }
-      jobs(i) = JObject(
+
+      JObject(
        "always_run" -> JBool(false),
        "job_id" -> JInt(i + 1),


@danking This is using job_ids from the original job, while n is the number of partitions currently being retried. So it's possible to have job_id >= n. Could that cause any issues?

hail/src/main/scala/is/hail/expr/ir/analyses/SemanticHash.scala

hail/src/main/scala/is/hail/backend/BackendUtils.scala

hail/src/main/scala/is/hail/backend/ExecuteContext.scala

hail/src/main/scala/is/hail/expr/ir/analyses/SemanticHash.scala

hail/src/main/scala/is/hail/utils/package.scala

…collect-distributed-array

ehigham · 2023-08-11T21:22:50Z

hail/src/main/scala/is/hail/expr/ir/analyses/SemanticHash.scala

+  object CodeGenSupport {
+    def lift(hash: SemanticHash.Type): Option[SemanticHash.Type] =
+      Some(hash)
+  }


I was having a horrible time trying to generate code to lift an Int into an Option and kept getting NoSuchMethodErrors when trying to call the constructor of Some or call Option.apply, presumably because Option is parameterised and thus needs a reference type in its jvm implementation.
I rage quit and wrote this instead.

…collect-distributed-array

patrick-schultz

I'm feeling good about merging this as an experimental flag, disabled by default. Just one last comment.

hail/src/main/scala/is/hail/utils/richUtils/RichVal.scala

…collect-distributed-array

removed richval

…collect-distributed-array

ehigham force-pushed the ehigham/call-cache-collect-distributed-array branch 2 times, most recently from 6affd38 to b0aee67 Compare May 3, 2023 18:34

ehigham added 6 commits May 3, 2023 16:04

[query] Call-Cache CollectDistributedArray (rfc-0000)

8cea3f1

Refactor parallelizeAndComputeWithIndex to support caching successes

ca706de

Return `(Option[Throwable], IndexedSeq[(Int, Array[Byte])]`, where Option[Throwable]: exception that was raised while computing partitions IndexedSeq[(Int, Array[Byte])]: partition index -> result

Lookup and Put collectDArray results in the ExecutionCache

ebc1f2e

staticComponent <> Hash(dynamicId)

de0da5d

compute semhash in emitcontext

76509e3

base64 encode FSCache entries. Set backend ctx

e332df7

ehigham force-pushed the ehigham/call-cache-collect-distributed-array branch from b0aee67 to e332df7 Compare May 3, 2023 21:02

ehigham added 21 commits May 3, 2023 17:34

renumber successful results

74a7ea3

more semhash definitions

8eab6ee

Expose use_fast_restarts flag to python

15b4321

Read type of execution cache from flags

restore check for if backend can execute tasks

eab7ee1

more semhash definitions

d76f5de

Use MurmurHash3 as semantic hash algorithm

b2100d1

optimize imports

8cc7ef4

remove randomness from partition writers

fbd68dc

get pc-relate to work

4544553

Merge remote-tracking branch 'upstream/main' into ehigham/call-cache-…

b94dac8

…collect-distributed-array

Get or compute file checksums

68b4728

semhash ignores struct field names

e2810cc

Merge remote-tracking branch 'upstream/main' into ehigham/call-cache-…

ea006d5

…collect-distributed-array

private[this] for fields

d284e2e

thread semhash through lowering pipelines

4c3e549

semhash top ir node only

116edeb

concordance benchmark runs in 5 seconds with call caching

22ee6ca

compute semhash in constant space

c41b187

moar hashes

3a6d3ac

annotate ExprSemanticHash in lowering pass (unused)

ddb930e

add semhash to ExecuteContext

f7f164a

ehigham added 9 commits July 27, 2023 16:36

encode ir as byte string then hash

8827f22

no need to use streams

ba614c1

detect ir handedness

939b9b1

tidy up

092268f

add dev-doc, thinned version of rfc

607c431

fix build failures

3b44da7

lets not use products of primes

8edca7e

fix emit fail

fb7273b

use Int, not BigInt

7c3b934

ehigham commented Jul 31, 2023

View reviewed changes

patrick-schultz reviewed Aug 7, 2023

View reviewed changes

ehigham added 6 commits August 7, 2023 17:51

first round of suggestions from @patrick-schultz

d1102cb

im an idiot

43b9f5b

wip

491558e

Merge remote-tracking branch 'upstream/main' into ehigham/call-cache-…

1733dae

…collect-distributed-array

simplify + add tests related to let-bound names

0681437

test tree structure

de607de

ehigham requested a review from patrick-schultz August 11, 2023 17:50

emit is horrid

81613dc

ehigham commented Aug 11, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into ehigham/call-cache-…

c2a53e4

…collect-distributed-array

patrick-schultz previously requested changes Sep 8, 2023

View reviewed changes

hail/src/main/scala/is/hail/utils/richUtils/RichVal.scala Outdated Show resolved Hide resolved

ehigham added 2 commits September 13, 2023 10:37

Merge remote-tracking branch 'upstream/main' into ehigham/call-cache-…

db464c4

…collect-distributed-array

remove richval

854f4fa

ehigham requested a review from patrick-schultz September 13, 2023 14:57

patrick-schultz approved these changes Sep 15, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into ehigham/call-cache-…

254f81f

…collect-distributed-array

danking merged commit 7a7bf28 into hail-is:main Sep 15, 2023

ehigham deleted the ehigham/call-cache-collect-distributed-array branch September 15, 2023 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query] Call-Cache `CollectDistributedArray` (rfc-0000) #12954

[query] Call-Cache `CollectDistributedArray` (rfc-0000) #12954

ehigham commented May 1, 2023 •

edited

Loading

ehigham Jul 31, 2023

patrick-schultz left a comment

patrick-schultz Aug 2, 2023

ehigham Aug 11, 2023

patrick-schultz left a comment

[query] Call-Cache CollectDistributedArray (rfc-0000) #12954

[query] Call-Cache CollectDistributedArray (rfc-0000) #12954

Conversation

ehigham commented May 1, 2023 • edited Loading

ehigham Jul 31, 2023

Choose a reason for hiding this comment

patrick-schultz left a comment

Choose a reason for hiding this comment

patrick-schultz Aug 2, 2023

Choose a reason for hiding this comment

ehigham Aug 11, 2023

Choose a reason for hiding this comment

patrick-schultz left a comment

Choose a reason for hiding this comment

[query] Call-Cache `CollectDistributedArray` (rfc-0000) #12954

[query] Call-Cache `CollectDistributedArray` (rfc-0000) #12954

ehigham commented May 1, 2023 •

edited

Loading