-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] Avoid Class A Operations. #13885
Conversation
CHANGELOG: Hail Query-on-Batch previously used Class A Operations for all interaction with blobs. This change ensures that QoB only uses Class A Operations when necessary. Inspired by @jigold 's file system improvement campaign, I pursued the avoidance of "list" operations. I anticipate this reduces flakiness in Azure (which is tracked in hail-is#13351) and cost in Azure. I enforced aiotools.fs terminology on hail.fs and Scala: 1. `FileStatus`. Metadata about a blob or file. It does not know if a directory exists at this path. 2. `FileListEntry`. Metadata from a list operation. It knows if a directory exists at this path. Variable names were updated to reflect this distinction: 1. `fileStatus` / `fileStatuses` 2. `fle`/ `fles` / `fileListEntry` / `fileListEntries`, respectively. `listStatus` renamed to `listDirectory` for clarity. In both Azure and Google, `fileStatus` does not use a list operation. `fileListEntry` can be used when we must know if a directory exists. I just rewrote this from first principles because: 1. In neither Google nor Azure did it check if the path was a directory and a file. 2. In Google, if the directory entry wasn't in the first page, it would fail (NB: there are fifteen non-control characters in ASCII before `/`, if the page size is 15 or fewer, we'd miss the first entry with a `/` at the end). 3. In Azure, we issued both a get and a list. There are now unit tests for this method. --- 1. `copyMerge` and `concatenateFiles` previously used `O(N_FILES)` list operations, they now use `O(N_FILES)` get operations. 2. Writers that used `exists` to check for a _SUCCESS file now use a get operation. 3. Index readers, import BGEN, and import plink all now check file size with a get operation. That said, overall, the bulk of our Class A Operations are probably writes.
CHANGELOG: Introduce `hl.fs.fast_stat` and `hl.hadoop_fast_stat` which use cheaper Class B Operations in Google Cloud Storage rather than Class A Operations. Users of `hl.hadoop_stat` and `hl.fs.stat` should consider switching. This PR extends hail-is#13885 into the public API.
CHANGELOG: Introduce `hl.fs.fast_stat` and `hl.hadoop_fast_stat` which use cheaper Class B Operations in Google Cloud Storage rather than Class A Operations. Users of `hl.hadoop_stat` and `hl.fs.stat` should consider switching. This PR extends hail-is#13885 into the public API.
CHANGELOG: Introduce `hl.fs.fast_stat` and `hl.hadoop_fast_stat` which use cheaper Class B Operations in Google Cloud Storage rather than Class A Operations. Users of `hl.hadoop_stat` and `hl.fs.stat` should consider switching. This PR extends hail-is#13885 into the public API.
This reverts commit ab91750.
OK, I think this is finally ready for review. It can wait until January if necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an awesome change!
@@ -252,7 +252,7 @@ object GenericLines { | |||
|
|||
def read( | |||
fs: FS, | |||
fileListEntries0: IndexedSeq[FileListEntry], | |||
fileListEntries0: IndexedSeq[_ <: FileStatus], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the variable name here to fileStatuses0
?
dirFle = fle | ||
} | ||
|
||
continue = it.hasNext && (fle.getActualUrl <= withSlash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put a comment here on why you can check for fle.getActualUrl <= withSlash
?
new BlobStorageFileListEntry( | ||
s"gs://${ blob.getBucket }/$name", | ||
s"gs://${ blob.getBucket }/${blob.getName}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s"gs://${ blob.getBucket }/${blob.getName}", | |
s"gs://${ blob.getBucket }/${ blob.getName }", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this change.
We do use ${url}
pervasively. Should that be ${ url }
?
if (url.path == "") | ||
return new BlobStorageFileListEntry(s"gs://${url.bucket}", null, 0, true) | ||
return GoogleStorageFileListEntry.dir(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the return type here match up between GoogleStorageFileListEntry.dir
and FileStatus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A FileListEntry is a subtype of a FileStatus.
A FileStatus is a thing about which we know the name, the size, the creation time, etc. A FileListEntry is something we know all of that and also we know for sure if there is a directory with this name.
@@ -171,6 +174,14 @@ class HadoopFS(private[this] var conf: SerializableHadoopConfiguration) extends | |||
files.map(fileListEntry => new HadoopFileListEntry(fileListEntry)) | |||
} | |||
|
|||
override def fileStatus(url: URL): FileStatus = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return types don't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above, FileListEntry is a sub-type of FileStatus. The compiler wouldn't compile this code otherwise.
Thanks for the review! I think I've addressed everything in this commit: 1bac0f1 |
nPartitions: Option[Int], | ||
blockSizeInMB: Option[Int], | ||
minPartitions: Option[Int], | ||
gzAsBGZ: Boolean, | ||
allowSerialRead: Boolean, | ||
filePerPartition: Boolean = false | ||
): GenericLines = { | ||
val fileListEntries = fileListEntries0.zipWithIndex.filter(_._1.getLen > 0) | ||
val fileListEntries = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this variable be named fileStatuses
now instead of fileListEntries
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
using(fs.open(metadataFile)) { in => parse(in) } | ||
} catch { | ||
case exc: FileNotFoundException => | ||
if (fs.isFile(path)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit weird to me that we're catching a FileNotFoundException
and then testing whether it's a file or a directory. Are we catching the wrong exception here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that we're catching FileNotFoundException for the metadataFile, but then checking if the path (i.e. the matrix table path) is a file or a directory.
@Test def testFileListEntryOnFile(): Unit = { | ||
// file | ||
val f = r("/a") | ||
val s = fs.fileListEntry(f) | ||
assert(s.getPath == f) | ||
assert(s.isFile) | ||
assert(!s.isDirectory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you get rid of these tests here? They seem valuable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mistake, restored.
@Test def testFileStatusOnDirIsFailure(): Unit = { | ||
val f = r("/dir") | ||
TestUtils.interceptException[FileNotFoundException](r("/dir"))( | ||
fs.fileStatus(r("/dir")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you could create a fileStatus object for a directory. It just doesn't have the directory information as a FLE would have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can only create a fileStatus for a file/object not a directory/prefix. In particular, no single Class B Operation can reliably determine if a path is a directory. For example, if a GET for the object gs://bucket/dir/a
returns 200, I know that gs://bucket/dir
is a directory but if it returns 404, I don't know if that directory does or does not exist.
@Test def testFileListEntryOnDir(): Unit = { | ||
// file | ||
val f = r("/dir") | ||
val s = fs.fileListEntry(f) | ||
assert(s.getPath == f) | ||
assert(!s.isFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about removing these tests and whether these are still valuable. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restored and restored in the next test.
@@ -260,8 +260,8 @@ object GenericLines { | |||
allowSerialRead: Boolean, | |||
filePerPartition: Boolean = false | |||
): GenericLines = { | |||
val fileListEntries = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0) | |||
val totalSize = fileListEntries.map(_._1.getLen).sum | |||
val fileSTatuses = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val fileSTatuses = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0) | |
val fileStatuses = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0) |
val fileListEntries = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0) | ||
val totalSize = fileListEntries.map(_._1.getLen).sum | ||
val fileSTatuses = fileStatuses0.zipWithIndex.filter(_._1.getLen > 0) | ||
val totalSize = fileSTatuses.map(_._1.getLen).sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val totalSize = fileSTatuses.map(_._1.getLen).sum | |
val totalSize = fileStatuses.map(_._1.getLen).sum |
@@ -276,7 +276,7 @@ object GenericLines { | |||
case None => | |||
} | |||
|
|||
val contexts = fileListEntries.flatMap { case (fileListEntry, fileNum) => | |||
val contexts = fileSTatuses.flatMap { case (fileListEntry, fileNum) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val contexts = fileSTatuses.flatMap { case (fileListEntry, fileNum) => | |
val contexts = fileStatuses.flatMap { case (fileListEntry, fileNum) => |
1303d8e
to
8be3389
Compare
CHANGELOG: Hail Query-on-Batch previously used Class A Operations for all interaction with blobs. This change ensures that QoB only uses Class A Operations when necessary.
Inspired by @jigold 's file system improvement campaign, I pursued the avoidance of "list" operations. I anticipate this reduces flakiness in Azure (which is tracked in #13351) and cost in Azure.
I enforced aiotools.fs terminology on hail.fs and Scala:
FileStatus
. Metadata about a blob or file. It does not know if a directory exists at this path.FileListEntry
. Metadata from a list operation. It knows if a directory exists at this path.Variable names were updated to reflect this distinction:
fileStatus
/fileStatuses
fle
/fles
/fileListEntry
/fileListEntries
, respectively.listStatus
renamed tolistDirectory
for clarity.In both Azure and Google,
fileStatus
does not use a list operation.fileListEntry
can be used when we must know if a directory exists. I just rewrote this from first principles because:/
, if the page size is 15 or fewer, we'd miss the first entry with a/
at the end).There are now unit tests for this method.
copyMerge
andconcatenateFiles
previously usedO(N_FILES)
list operations, they now useO(N_FILES)
get operations.exists
to check for a _SUCCESS file now use a get operation.That said, overall, the bulk of our Class A Operations are probably writes.