-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange behavior in ParallelCopyGCSDirectoryIntoHDFSSpark with NIO ver > 0.66 #5935
Comments
For some reason it looks like the google cloud nio API (version 81) is returning As an additional data point, it works correctly on the directories: |
That is possible. But I'm not sure
|
Looks like some NIO version after 66 broke this API for some subset of paths? I don't know how to figure out what the difference is either. @cmnbroad or @lbergelson would you have any ideas about this? (see my comment on this issue above where I narrow this bug down to an nio API call. |
To update: Chris and I just tested copying the bam and indices (3 files) from Also an interested behavior we noticed, and a suspicion that is hard to test (due to lack of access to time machine), that this might be related when the "directory" is created: any directory freshly created after October 2018 might be susceptible to this, which is also the month when newer (>66) release of NIO became available. |
Nagging @droazen as well. |
@jean-philippe-martin Any thoughts on what might be going here? |
My suggestion to @SHuang-Broad was to try leaving out the "directory creation" step, since under the hood directories don't actually exist on GCS. Ie., instead of explicitly creating I suspect that you are accidentally creating 0-byte files with the same name as the directory you want. |
The Google Cloud web UI allows you to create "directories". When you press that button, it creates a file with a trailing slash, and the web UI interprets it as a directory even though it's a file. This causes no end of trouble because now every other program in the world that accesses cloud storage must be updated to adopt this "convention" or they'll see files where the user doesn't expect them. My guess would be that's what's going on here. That said, NIO should understand what these things are and ignore them. The current workaround is to consider 0-byte files as potentially being fake directories. @cwhelan I cannot access the file |
@jean-philippe-martin I've checked both
The line the
|
Thank you @SHuang-Broad, this confirms the problem: the "convention" has been changed so NIO's special casing needs to be adjusted to the new reality of what sort of files are created to make fake directories. In the meantime, the workaround is clear: if a tool offers to create directories in Google cloud, gently decline. It's not necessary and in some cases (like here), creates problems. |
Thanks @jean-philippe-martin ! Do you have a rough estimate about when the workaround will no longer be needed? |
@SHuang-Broad hard to say. It's a small change once I get to it, but then it needs to go through review and wait for a release. To give a very rough estimate I'd say perhaps a month, with wide error bars. |
@jean-philippe-martin Thanks for the estimate! |
@SHuang-Broad The bug was fixed upstream, so the next NIO release after 2019-5-28 will include it, and then it's just a matter of GATK updating to the latest version of NIO. |
Thanks for the update @jean-philippe-martin ! |
* Updating google-cloud-nio 0.81.0-alpha:shaded -> 0.100.0-alpha:shaded * Fixes #5935
* Updating google-cloud-nio 0.81.0-alpha:shaded -> 0.100.0-alpha:shaded * Fixes #5935
* Updating google-cloud-nio 0.81.0-alpha:shaded -> 0.100.0-alpha:shaded * Fixes #5935
* Updating google-cloud-nio 0.81.0-alpha:shaded -> 0.100.0-alpha:shaded * Fixes #5935
* Updating google-cloud-nio 0.81.0-alpha:shaded -> 0.107.0-alpha:shaded * Update picard version 2.20.5 -> 2.20.7 * Fixes #5935 * Exclude transitive dependencies from shaded dependencies. The shaded poms seem to include copies of the dependencies that were shaded which means we get multiple confusing copies of the same class. * Added a workaround on travis https://github.com/googleapis/google-cloud-java/issues/5884, an issue introduced in gcloud 0.90.0-alpha. This required redundantly specifying the google project through an environment variable in addition to logging in.
Bug Report
Affected tool(s) or class(es)
ParallelCopyGCSDirectoryIntoHDFSSpark
Affected version(s)
Description
ParallelCopyGCSDirectoryIntoHDFSSpark
behaves in the following strange way:build.gradle
, it successfully copies GCS "directories" containing reference or BAMsSteps to reproduce
Both scripts referred to below need to be updated accordingly, but trivially
from the master branch, run the attached
test.nio.ver.81.sh
.branch out from master, change the literal
81
to66
on line 69 inbuild.gradle
, run the attachedtest.nio.ver.66.sh
.Expected behavior
Files in the "directories" given in the gs path copied successfully.
Actual behavior
Fail. See logs attached.
test.nio.paraCopyHDFSSpark.zip
UPDATE:
reuploaded attachment
The text was updated successfully, but these errors were encountered: