Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Avoid Class A Operations. #13885

Merged
merged 31 commits into from
Jan 9, 2024
Merged

Conversation

danking
Copy link
Contributor

@danking danking commented Oct 23, 2023

CHANGELOG: Hail Query-on-Batch previously used Class A Operations for all interaction with blobs. This change ensures that QoB only uses Class A Operations when necessary.

Inspired by @jigold 's file system improvement campaign, I pursued the avoidance of "list" operations. I anticipate this reduces flakiness in Azure (which is tracked in #13351) and cost in Azure.

I enforced aiotools.fs terminology on hail.fs and Scala:

  1. FileStatus. Metadata about a blob or file. It does not know if a directory exists at this path.

  2. FileListEntry. Metadata from a list operation. It knows if a directory exists at this path.

Variable names were updated to reflect this distinction:

  1. fileStatus / fileStatuses

  2. fle/ fles / fileListEntry / fileListEntries, respectively.

listStatus renamed to listDirectory for clarity.

In both Azure and Google, fileStatus does not use a list operation.

fileListEntry can be used when we must know if a directory exists. I just rewrote this from first principles because:

  1. In neither Google nor Azure did it check if the path was a directory and a file.
  2. In Google, if the directory entry wasn't in the first page, it would fail (NB: there are fifteen non-control characters in ASCII before /, if the page size is 15 or fewer, we'd miss the first entry with a / at the end).
  3. In Azure, we issued both a get and a list.

There are now unit tests for this method.


  1. copyMerge and concatenateFiles previously used O(N_FILES) list operations, they now use O(N_FILES) get operations.
  2. Writers that used exists to check for a _SUCCESS file now use a get operation.
  3. Index readers, import BGEN, and import plink all now check file size with a get operation.

That said, overall, the bulk of our Class A Operations are probably writes.

CHANGELOG: Hail Query-on-Batch previously used Class A Operations for all interaction with blobs. This change ensures that QoB only uses Class A Operations when necessary.

Inspired by @jigold 's file system improvement campaign, I pursued the avoidance of "list"
operations. I anticipate this reduces flakiness in Azure (which is tracked in hail-is#13351) and cost in
Azure.

I enforced aiotools.fs terminology on hail.fs and Scala:

1. `FileStatus`. Metadata about a blob or file. It does not know if a directory exists at this path.

2. `FileListEntry`. Metadata from a list operation. It knows if a directory exists at this path.

Variable names were updated to reflect this distinction:

1. `fileStatus` / `fileStatuses`

2. `fle`/ `fles` / `fileListEntry` / `fileListEntries`, respectively.

`listStatus` renamed to `listDirectory` for clarity.

In both Azure and Google, `fileStatus` does not use a list operation.

`fileListEntry` can be used when we must know if a directory exists. I just rewrote this from
first principles because:
1. In neither Google nor Azure did it check if the path was a directory and a file.
2. In Google, if the directory entry wasn't in the first page, it would fail (NB: there are fifteen
   non-control characters in ASCII before `/`, if the page size is 15 or fewer, we'd miss the first
   entry with a `/` at the end).
3. In Azure, we issued both a get and a list.

There are now unit tests for this method.

---

1. `copyMerge` and `concatenateFiles` previously used `O(N_FILES)` list operations, they now use
   `O(N_FILES)` get operations.
2. Writers that used `exists` to check for a _SUCCESS file now use a get operation.
3. Index readers, import BGEN, and import plink all now check file size with a get operation.

That said, overall, the bulk of our Class A Operations are probably writes.
danking pushed a commit to danking/hail that referenced this pull request Oct 23, 2023
CHANGELOG: Introduce `hl.fs.fast_stat` and `hl.hadoop_fast_stat` which use cheaper Class B Operations in Google Cloud Storage rather than Class A Operations. Users of `hl.hadoop_stat` and `hl.fs.stat` should consider switching.

This PR extends hail-is#13885 into the public API.
danking pushed a commit to danking/hail that referenced this pull request Oct 23, 2023
CHANGELOG: Introduce `hl.fs.fast_stat` and `hl.hadoop_fast_stat` which use cheaper Class B Operations in Google Cloud Storage rather than Class A Operations. Users of `hl.hadoop_stat` and `hl.fs.stat` should consider switching.

This PR extends hail-is#13885 into the public API.
danking pushed a commit to danking/hail that referenced this pull request Oct 23, 2023
CHANGELOG: Introduce `hl.fs.fast_stat` and `hl.hadoop_fast_stat` which use cheaper Class B Operations in Google Cloud Storage rather than Class A Operations. Users of `hl.hadoop_stat` and `hl.fs.stat` should consider switching.

This PR extends hail-is#13885 into the public API.
@danking
Copy link
Contributor Author

danking commented Dec 5, 2023

OK, I think this is finally ready for review. It can wait until January if necessary.

jigold
jigold previously requested changes Dec 6, 2023
Copy link
Contributor

@jigold jigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome change!

@@ -252,7 +252,7 @@ object GenericLines {

def read(
fs: FS,
fileListEntries0: IndexedSeq[FileListEntry],
fileListEntries0: IndexedSeq[_ <: FileStatus],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the variable name here to fileStatuses0?

hail/src/main/scala/is/hail/io/bgen/LoadBgen.scala Outdated Show resolved Hide resolved
hail/python/test/hail/utils/test_utils.py Show resolved Hide resolved
dirFle = fle
}

continue = it.hasNext && (fle.getActualUrl <= withSlash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a comment here on why you can check for fle.getActualUrl <= withSlash?

new BlobStorageFileListEntry(
s"gs://${ blob.getBucket }/$name",
s"gs://${ blob.getBucket }/${blob.getName}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s"gs://${ blob.getBucket }/${blob.getName}",
s"gs://${ blob.getBucket }/${ blob.getName }",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change.

We do use ${url} pervasively. Should that be ${ url }?

if (url.path == "")
return new BlobStorageFileListEntry(s"gs://${url.bucket}", null, 0, true)
return GoogleStorageFileListEntry.dir(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the return type here match up between GoogleStorageFileListEntry.dir and FileStatus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A FileListEntry is a subtype of a FileStatus.

A FileStatus is a thing about which we know the name, the size, the creation time, etc. A FileListEntry is something we know all of that and also we know for sure if there is a directory with this name.

@@ -171,6 +174,14 @@ class HadoopFS(private[this] var conf: SerializableHadoopConfiguration) extends
files.map(fileListEntry => new HadoopFileListEntry(fileListEntry))
}

override def fileStatus(url: URL): FileStatus = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return types don't match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, FileListEntry is a sub-type of FileStatus. The compiler wouldn't compile this code otherwise.

@danking
Copy link
Contributor Author

danking commented Dec 7, 2023

Thanks for the review! I think I've addressed everything in this commit: 1bac0f1

jigold
jigold previously requested changes Jan 5, 2024
nPartitions: Option[Int],
blockSizeInMB: Option[Int],
minPartitions: Option[Int],
gzAsBGZ: Boolean,
allowSerialRead: Boolean,
filePerPartition: Boolean = false
): GenericLines = {
val fileListEntries = fileListEntries0.zipWithIndex.filter(_._1.getLen > 0)
val fileListEntries = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this variable be named fileStatuses now instead of fileListEntries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

using(fs.open(metadataFile)) { in => parse(in) }
} catch {
case exc: FileNotFoundException =>
if (fs.isFile(path)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit weird to me that we're catching a FileNotFoundException and then testing whether it's a file or a directory. Are we catching the wrong exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that we're catching FileNotFoundException for the metadataFile, but then checking if the path (i.e. the matrix table path) is a file or a directory.

@Test def testFileListEntryOnFile(): Unit = {
// file
val f = r("/a")
val s = fs.fileListEntry(f)
assert(s.getPath == f)
assert(s.isFile)
assert(!s.isDirectory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you get rid of these tests here? They seem valuable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mistake, restored.

@Test def testFileStatusOnDirIsFailure(): Unit = {
val f = r("/dir")
TestUtils.interceptException[FileNotFoundException](r("/dir"))(
fs.fileStatus(r("/dir"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you could create a fileStatus object for a directory. It just doesn't have the directory information as a FLE would have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can only create a fileStatus for a file/object not a directory/prefix. In particular, no single Class B Operation can reliably determine if a path is a directory. For example, if a GET for the object gs://bucket/dir/a returns 200, I know that gs://bucket/dir is a directory but if it returns 404, I don't know if that directory does or does not exist.

@Test def testFileListEntryOnDir(): Unit = {
// file
val f = r("/dir")
val s = fs.fileListEntry(f)
assert(s.getPath == f)
assert(!s.isFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about removing these tests and whether these are still valuable. Same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored and restored in the next test.

jigold
jigold previously requested changes Jan 8, 2024
@@ -260,8 +260,8 @@ object GenericLines {
allowSerialRead: Boolean,
filePerPartition: Boolean = false
): GenericLines = {
val fileListEntries = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0)
val totalSize = fileListEntries.map(_._1.getLen).sum
val fileSTatuses = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val fileSTatuses = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0)
val fileStatuses = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0)

val fileListEntries = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0)
val totalSize = fileListEntries.map(_._1.getLen).sum
val fileSTatuses = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0)
val totalSize = fileSTatuses.map(_._1.getLen).sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val totalSize = fileSTatuses.map(_._1.getLen).sum
val totalSize = fileStatuses.map(_._1.getLen).sum

@@ -276,7 +276,7 @@ object GenericLines {
case None =>
}

val contexts = fileListEntries.flatMap { case (fileListEntry, fileNum) =>
val contexts = fileSTatuses.flatMap { case (fileListEntry, fileNum) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val contexts = fileSTatuses.flatMap { case (fileListEntry, fileNum) =>
val contexts = fileStatuses.flatMap { case (fileListEntry, fileNum) =>

@danking danking dismissed jigold’s stale review January 9, 2024 16:16

casing fixed

@danking danking merged commit 3ddbf5f into hail-is:main Jan 9, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants