Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Call-Cache CollectDistributedArray (rfc-0000) #12954

Merged

Conversation

ehigham
Copy link
Member

@ehigham ehigham commented May 1, 2023

Adds the ability to rerun/retry queries from the nearest CollectDistributedArray (CDA) IR site.

Computes a "Semantic Hash" of the top-level IR which is used to generate a key for the various constituent CDA calls in a query. The implementation for CDA, BackendUtils.collectDArray, uses that key to look into an the execution cache for the results of each partition for that call and uses/updates the cache with successful partition computations.

The nature of the staged- lower and execute model means we don't know how many CDA calls that will be generated ahead of time. Thus we treat the "Semantic Hash" in a similar way to an RNG state variable and generate a key from the Semantic Hash every time every time we encounter a CDA.

The execution cache is implemented on-top of a local or remote filesystem (configurable via the HAIL_CACHE_DIR environment variable). This defaults to {tmpdir}/hail/{pip-version}.

@ehigham ehigham force-pushed the ehigham/call-cache-collect-distributed-array branch 2 times, most recently from 6affd38 to b0aee67 Compare May 3, 2023 18:34
@ehigham ehigham force-pushed the ehigham/call-cache-collect-distributed-array branch from b0aee67 to e332df7 Compare May 3, 2023 21:02
Comment on lines 346 to 352
def getFileHash(fs: FS)(path: String): Array[Byte] =
fs.eTag(path) match {
case Some(etag) =>
etag.getBytes
case None =>
path.getBytes ++ Bytes.fromLong(fs.fileStatus(path).getModificationTime)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to call out here:
I'm just using the file's etag if the filesystem supports them and NOT the path.
I think that the etag is unique on azure (their doc is quite hard to navigate so I'm finding it hard to know for sure - welcome help!).
GCS has the nice property that copying the file preserves the etag and it only changes after modification.

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is really great work! I have some small requests, some of which we discussed last week. I haven't finished looking at all the hash cases yet, or the tests. With the hash cases, in general I think by default anything that we print, we should include in the hash. In some cases it might not be necessary, but I'd rather be conservative.

hail/src/main/scala/is/hail/backend/BackendUtils.scala Outdated Show resolved Hide resolved
}
jobs(i) = JObject(

JObject(
"always_run" -> JBool(false),
"job_id" -> JInt(i + 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danking This is using job_ids from the original job, while n is the number of partitions currently being retried. So it's possible to have job_id >= n. Could that cause any issues?

hail/src/main/scala/is/hail/backend/BackendUtils.scala Outdated Show resolved Hide resolved
hail/src/main/scala/is/hail/backend/ExecuteContext.scala Outdated Show resolved Hide resolved
hail/src/main/scala/is/hail/utils/package.scala Outdated Show resolved Hide resolved
Comment on lines +401 to +404
object CodeGenSupport {
def lift(hash: SemanticHash.Type): Option[SemanticHash.Type] =
Some(hash)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having a horrible time trying to generate code to lift an Int into an Option and kept getting NoSuchMethodErrors when trying to call the constructor of Some or call Option.apply, presumably because Option is parameterised and thus needs a reference type in its jvm implementation.
I rage quit and wrote this instead.

Copy link
Collaborator

@patrick-schultz patrick-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling good about merging this as an experimental flag, disabled by default. Just one last comment.

hail/src/main/scala/is/hail/utils/richUtils/RichVal.scala Outdated Show resolved Hide resolved
@danking danking merged commit 7a7bf28 into hail-is:main Sep 15, 2023
@ehigham ehigham deleted the ehigham/call-cache-collect-distributed-array branch September 15, 2023 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants