File _rgdict.avro does not exist #1150

ooliynyk · 2016-09-03T20:27:22Z

I have converted vcf file to adam format using the command # adam-submit vcf2adam file:///tmp/A7VAGPU.vcf.gz file:///tmp/a7.adam

When I tried to run # adam-submit count_kmers file:///tmp/a7.adam/ file:///tmp/kmers.adam 10 I got error:

Command body threw exception:
java.io.FileNotFoundException: File file:/tmp/a7.adam/_rgdict.avro does not exist
Exception in thread "main" java.io.FileNotFoundException: File file:/tmp/a7.adam/_rgdict.avro does not exist
    at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
    at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
    at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
    at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
    at org.bdgenomics.adam.rdd.ADAMContext.loadAvro(ADAMContext.scala:442)
    at org.bdgenomics.adam.rdd.ADAMContext.loadAvroReadGroupMetadata(ADAMContext.scala:160)
    at org.bdgenomics.adam.rdd.ADAMContext.loadParquetAlignments(ADAMContext.scala:520)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadAlignments$1.apply(ADAMContext.scala:1023)
    at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadAlignments$1.apply(ADAMContext.scala:1002)
    at scala.Option.fold(Option.scala:157)
    at org.apache.spark.rdd.Timer.time(Timer.scala:48)
    at org.bdgenomics.adam.rdd.ADAMContext.loadAlignments(ADAMContext.scala:1000)
    at org.bdgenomics.adam.cli.CountReadKmers.run(CountReadKmers.scala:63)
    at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
    at org.bdgenomics.adam.cli.CountReadKmers.run(CountReadKmers.scala:54)
    at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:132)
    at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:72)
    at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Files in .adam directory:

root@2ef6c96e995a:/tmp# ls -la /tmp/a7.adam/
total 2044
drwxr-xr-x  2 root root    4096 Sep  3 20:03 .
drwxrwxrwt 75 root root    4096 Sep  3 20:04 ..
-rw-r--r--  1 root root       8 Sep  3 20:03 ._SUCCESS.crc
-rw-r--r--  1 root root     108 Sep  3 20:03 ._common_metadata.crc
-rw-r--r--  1 root root     140 Sep  3 20:03 ._metadata.crc
-rw-r--r--  1 root root      20 Sep  3 20:03 ._samples.avro.crc
-rw-r--r--  1 root root      20 Sep  3 20:03 ._seqdict.avro.crc
-rw-r--r--  1 root root   15628 Sep  3 20:03 .part-r-00000.gz.parquet.crc
-rw-r--r--  1 root root       0 Sep  3 20:03 _SUCCESS
-rw-r--r--  1 root root   12467 Sep  3 20:03 _common_metadata
-rw-r--r--  1 root root   16770 Sep  3 20:03 _metadata
-rw-r--r--  1 root root    1301 Sep  3 20:03 _samples.avro
-rw-r--r--  1 root root    1402 Sep  3 20:03 _seqdict.avro
-rw-r--r--  1 root root 1999220 Sep  3 20:03 part-r-00000.gz.parquet

A7VAGPU.vcf.gz

The text was updated successfully, but these errors were encountered:

fnothaft · 2016-09-06T16:21:27Z

Hi @ooliynyk! The count_kmers command only works on read data, not on variant data. That said, I'd like to better understand your use case here. Are you trying to create consensus sequences from the genotype calls, which you then count k-mers from?

BrandonColbyMD · 2016-09-07T21:42:06Z

Hi @fnothaft - Thank you for your reply. I'm working with @ooliynyk on this project. We have installed ADAM as part of Sequencing.com's Altruist Database, a free, open-data initiative. The database contains human genome VCF files as well as those files converted to ADAM format.

Our goal is to enable users of the Altruist Database to be able to utilize the power of ADAM for any type of analysis they want to perform on one or more genotypic files within the Altruist Database. For example, they can select to perform analysis on all Altruist records that are female or they may choose to perform analysis on all Altruist records that are carriers of a Cystic Fibrosis variant in the CFTR gene.

Using the Altruist UI (which is still in development), users will be able to perform analysis of data within the Altruist Database by uploading or entering their own commands, by programming ADAM according to their own specs or use the commands/programs created and shared by other users.

We were testing out the count_kmers command to make sure it worked on our dataset in-case that command was entered by a user.

Hope this info was helpful. Any advice and guidance you can provide will be much appreciated so we can make sure that we enable ADAM to be used to its full potential.

heuermh · 2016-09-20T14:30:15Z

Very interesting use case, @ooliynyk @BrandonColbyMD!

As @fnothaft mentioned, count_kmers doesn't make sense to run on VCF files or on ADAM Parquet directories of Variants or Genotypes. We could help by wrapping the exception thrown in a more user friendly error message, and perhaps adding documentation to ADAM CLI commands as to which ADAM bdg-formats schema records they support.

What do you think?

plexteq · 2016-10-17T15:48:29Z

Hi @heuermh, @fnothaft.

So we converted VCF to ADAM using vcf2adam command. Which operations can be applied to the resulting ADAM file? What kind of analysis is possible to perform on it?

Regards,
Alex

BrandonColbyMD · 2016-10-26T23:20:03Z

Hi Hi @heuermh and @fnothaft - wanted to follow up about @plexteq question above. We are in the process of implementing operations which users can use on a user-defined subset of ADAM files in the Altruist Database. All ADAM files are being converted from gVCF files using vcf2adam so they aren't from AVRO files.

Please let us know what operations you recommend to allow for analyzing one or more ADAM files within the Altruist Database.

Thank you!

heuermh · 2016-10-26T23:30:52Z

@BrandonColbyMD ADAM is kind of like the swiss-army knife for getting data in traditional bioinformatics file formats ready for analysis on Spark; most of the interesting bits can be found in downstream repositories or in workbooks. I'll let others chime in with specific examples.

fnothaft · 2017-03-03T23:31:30Z

Closing as this was a version change issue.

fnothaft closed this as completed Mar 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File _rgdict.avro does not exist #1150

File _rgdict.avro does not exist #1150

ooliynyk commented Sep 3, 2016 •

edited

Loading

fnothaft commented Sep 6, 2016

BrandonColbyMD commented Sep 7, 2016 •

edited

Loading

heuermh commented Sep 20, 2016

plexteq commented Oct 17, 2016

BrandonColbyMD commented Oct 26, 2016 •

edited

Loading

heuermh commented Oct 26, 2016

fnothaft commented Mar 3, 2017

File _rgdict.avro does not exist #1150

File _rgdict.avro does not exist #1150

Comments

ooliynyk commented Sep 3, 2016 • edited Loading

fnothaft commented Sep 6, 2016

BrandonColbyMD commented Sep 7, 2016 • edited Loading

heuermh commented Sep 20, 2016

plexteq commented Oct 17, 2016

BrandonColbyMD commented Oct 26, 2016 • edited Loading

heuermh commented Oct 26, 2016

fnothaft commented Mar 3, 2017

ooliynyk commented Sep 3, 2016 •

edited

Loading

BrandonColbyMD commented Sep 7, 2016 •

edited

Loading

BrandonColbyMD commented Oct 26, 2016 •

edited

Loading