Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior in ParallelCopyGCSDirectoryIntoHDFSSpark with NIO ver > 0.66 #5935

Closed
2 tasks done
SHuang-Broad opened this issue May 13, 2019 · 15 comments · Fixed by #6042
Closed
2 tasks done

Strange behavior in ParallelCopyGCSDirectoryIntoHDFSSpark with NIO ver > 0.66 #5935

SHuang-Broad opened this issue May 13, 2019 · 15 comments · Fixed by #6042

Comments

@SHuang-Broad
Copy link
Contributor

SHuang-Broad commented May 13, 2019

Bug Report

Affected tool(s) or class(es)

ParallelCopyGCSDirectoryIntoHDFSSpark

Affected version(s)

  • Latest public release version [version?]
  • Latest master branch as of [2019-05-13]

Description

ParallelCopyGCSDirectoryIntoHDFSSpark behaves in the following strange way:

  • under master/latest release, it fails to copy a GCS "directory" containing BAMs
  • under master/latest release, it successfully copies a GCS "directory" containing reference
  • changing the nio lib version from 81 to 66 in build.gradle, it successfully copies GCS "directories" containing reference or BAMs
  • see attached logs

Steps to reproduce

Both scripts referred to below need to be updated accordingly, but trivially

  • from the master branch, run the attached test.nio.ver.81.sh.

  • branch out from master, change the literal 81 to 66 on line 69 in build.gradle, run the attached test.nio.ver.66.sh.

Expected behavior

Files in the "directories" given in the gs path copied successfully.

Actual behavior

Fail. See logs attached.


test.nio.paraCopyHDFSSpark.zip

UPDATE:
reuploaded attachment

@cwhelan
Copy link
Member

cwhelan commented May 13, 2019

For some reason it looks like the google cloud nio API (version 81) is returning false from the method Files.isDirectory() on your directory gs://broad-dsde-methods-shuang/pb/bams/NA12892/ but not on the directory gs://broad-dsde-methods-sv/reference/GRCh38/ at line 206 of ParallelCopyGCSDirectoryIntoHDFSSpark. I'm not sure what's different between these two directories.

As an additional data point, it works correctly on the directories: gs://broad-dsde-methods-sv/samples/G94797_CHM_MIX/WGS1 and gs://broad-dsde-methods-sv/samples/G94797_CHM_MIX/WGS2. So there must be something specific to the bucket / directory that contains your pb data?

@SHuang-Broad
Copy link
Contributor Author

That is possible. But I'm not sure

  • why that is true only after NIO version 66.
  • how to track down the "difference" between the two buckets.

@cwhelan
Copy link
Member

cwhelan commented May 13, 2019

Looks like some NIO version after 66 broke this API for some subset of paths?

I don't know how to figure out what the difference is either.

@cmnbroad or @lbergelson would you have any ideas about this? (see my comment on this issue above where I narrow this bug down to an nio API call.

@SHuang-Broad
Copy link
Contributor Author

To update:

Chris and I just tested copying the bam and indices (3 files) from gs://broad-dsde-methods-sv/samples/G94797_CHM_MIX/WGS1 (note that Chris reports it works on this bucket) to a just-created "directory" gs://broad-dsde-methods-sv/samples/G94797_CHM_MIX/WGS1/tmp, and it fails.

Also an interested behavior we noticed, and a suspicion that is hard to test (due to lack of access to time machine), that this might be related when the "directory" is created: any directory freshly created after October 2018 might be susceptible to this, which is also the month when newer (>66) release of NIO became available.
In the mean time, if one does
gsutil ls -L gs://broad-dsde-methods-sv/samples/G94797_CHM_MIX/WGS1
you'd get, at the last line, TOTAL: 3 objects, which is expected, whereas if one does
gsutil ls -L gs://broad-dsde-methods-sv/samples/G94797_CHM_MIX/WGS1/tmp
guess what: TOTAL: 4 objects!
It seems to list the "directory" itself as an object as well.

@SHuang-Broad
Copy link
Contributor Author

Nagging @droazen as well.

@droazen
Copy link
Contributor

droazen commented May 14, 2019

@jean-philippe-martin Any thoughts on what might be going here?

@droazen
Copy link
Contributor

droazen commented May 14, 2019

My suggestion to @SHuang-Broad was to try leaving out the "directory creation" step, since under the hood directories don't actually exist on GCS. Ie., instead of explicitly creating gs://broad-dsde-methods-sv/samples/G94797_CHM_MIX/WGS1/tmp, just start copying files to gs://broad-dsde-methods-sv/samples/G94797_CHM_MIX/WGS1/tmp/file1, gs://broad-dsde-methods-sv/samples/G94797_CHM_MIX/WGS1/tmp/file2, etc.

I suspect that you are accidentally creating 0-byte files with the same name as the directory you want.

@jean-philippe-martin
Copy link
Contributor

jean-philippe-martin commented May 14, 2019

The Google Cloud web UI allows you to create "directories". When you press that button, it creates a file with a trailing slash, and the web UI interprets it as a directory even though it's a file. This causes no end of trouble because now every other program in the world that accesses cloud storage must be updated to adopt this "convention" or they'll see files where the user doesn't expect them. My guess would be that's what's going on here.

That said, NIO should understand what these things are and ignore them. The current workaround is to consider 0-byte files as potentially being fake directories.

@cwhelan I cannot access the file gs://broad-dsde-methods-shuang/pb/bams/NA12892/ but if you can, could you please check its size for me? I mean the file specifically. If it's nonzero that would explain the problem. If it's zero or nonexistent then we'll need to dig some more to understand what's going on.

@SHuang-Broad
Copy link
Contributor Author

@jean-philippe-martin
Thanks for the input.

I've checked both gs://broad-dsde-methods-shuang/pb/bams/NA12892/ and gs://broad-dsde-methods-sv/samples/G94797_CHM_MIX/WGS1/tmp, they return something like this

gs://broad-dsde-methods-shuang/pb/bams/NA12892/:
    Creation time:          Mon, 22 Apr 2019 16:14:50 GMT
    Update time:            Mon, 22 Apr 2019 16:14:50 GMT
    Storage class:          STANDARD
    Content-Length:         11
    Content-Type:           text/plain
    Hash (crc32c):          XkI+Dw==
    Hash (md5):             apnFdauH+MfR7R5S5+NJzg==
    ETag:                   CJekwKSM5OECEAE=
    Generation:             1555949690032663
    Metageneration:         1
    ACL:                    [
  {
    "entity": "project-owners-222581509023",
    "projectTeam": {
      "projectNumber": "222581509023",
      "team": "owners"
    },
    "role": "OWNER"
  },
  {
    "entity": "project-editors-222581509023",
    "projectTeam": {
      "projectNumber": "222581509023",
      "team": "editors"
    },
    "role": "OWNER"
  },
  {
    "entity": "project-viewers-222581509023",
    "projectTeam": {
      "projectNumber": "222581509023",
      "team": "viewers"
    },
    "role": "READER"
  },
  {
    "email": "shuang@broadinstitute.org",
    "entity": "user-shuang@broadinstitute.org",
    "role": "OWNER"
  }
]
......
......

The line the Content-Length: 11 seems to suggest you are right.
And if I run gsutil ls -lh gs://broad-dsde-methods-shuang/pb/bams/NA12892/, I get

      11 B  2019-04-22T16:14:50Z  gs://broad-dsde-methods-shuang/pb/bams/NA12892/
......
......

@jean-philippe-martin
Copy link
Contributor

jean-philippe-martin commented May 15, 2019

Thank you @SHuang-Broad, this confirms the problem: the "convention" has been changed so NIO's special casing needs to be adjusted to the new reality of what sort of files are created to make fake directories.

In the meantime, the workaround is clear: if a tool offers to create directories in Google cloud, gently decline. It's not necessary and in some cases (like here), creates problems.

@SHuang-Broad
Copy link
Contributor Author

Thanks @jean-philippe-martin !

Do you have a rough estimate about when the workaround will no longer be needed?

@jean-philippe-martin
Copy link
Contributor

@SHuang-Broad hard to say. It's a small change once I get to it, but then it needs to go through review and wait for a release. To give a very rough estimate I'd say perhaps a month, with wide error bars.

@SHuang-Broad
Copy link
Contributor Author

@jean-philippe-martin Thanks for the estimate!

@jean-philippe-martin
Copy link
Contributor

@SHuang-Broad The bug was fixed upstream, so the next NIO release after 2019-5-28 will include it, and then it's just a matter of GATK updating to the latest version of NIO.

@SHuang-Broad
Copy link
Contributor Author

Thanks for the update @jean-philippe-martin !

lbergelson added a commit that referenced this issue Jul 16, 2019
* Updating google-cloud-nio   0.81.0-alpha:shaded -> 0.100.0-alpha:shaded
* Fixes #5935
lbergelson added a commit that referenced this issue Aug 28, 2019
* Updating google-cloud-nio   0.81.0-alpha:shaded -> 0.100.0-alpha:shaded
* Fixes #5935
lbergelson added a commit that referenced this issue Sep 3, 2019
* Updating google-cloud-nio   0.81.0-alpha:shaded -> 0.100.0-alpha:shaded
* Fixes #5935
lbergelson added a commit that referenced this issue Sep 11, 2019
* Updating google-cloud-nio   0.81.0-alpha:shaded -> 0.100.0-alpha:shaded
* Fixes #5935
lbergelson added a commit that referenced this issue Sep 11, 2019
* Updating google-cloud-nio  0.81.0-alpha:shaded -> 0.107.0-alpha:shaded
* Update picard version 2.20.5 -> 2.20.7
* Fixes #5935 
* Exclude transitive dependencies from shaded dependencies.  The shaded poms seem to include copies of the dependencies that were shaded which means we get multiple confusing copies of the same class.  
* Added a workaround on travis https://github.com/googleapis/google-cloud-java/issues/5884, an issue introduced in gcloud 0.90.0-alpha.  This required redundantly specifying the google project through an environment variable in addition to logging in.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants