Files that are not in hdfs or ADAM format bypass Spark #494

akmorrow13 · 2019-05-09T22:44:04Z

No description provided.

coveralls · 2019-05-09T22:53:02Z

Coverage increased (+7.9%) to 80.273% when pulling 39aa701 on akmorrow13:optimize_http into b9efbdb on bigdatagenomics:master.

AmplabJenkins · 2019-05-09T23:13:36Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/832/
Test FAILed.

AmplabJenkins · 2019-05-10T18:39:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/833/
Test PASSed.

akmorrow13 · 2019-05-10T18:42:03Z

mango-core/src/main/scala/org/bdgenomics/mango/converters/SAMRecordConverter.scala

@@ -0,0 +1,223 @@
+/**
+ * Licensed to Big Data Genomics (BDG) under one


@heuermh this entire file is copied from ADAM

Note this isn't fully tested yet, but should be somewhat faster than what is in ADAM
bigdatagenomics/convert#71

And if for your use case, if you are projecting away the attributes column, perhaps it would be useful to add a flag not to convert those, since they are lazily parsed in htsjdk.

Awesome thanks @heuermh! Would the flag be added in the convert library?

Yes, I could do so. I'm also considering adding a convert adapter layer to the actual converters in ADAM. That way the implementation classes can continue to be private to ADAM and the convert adapter layer part of the public API.

Sorry, been busy with other things. I can implement the attributes column flag tomorrow. While thinking about the adapter layer in ADAM I found bigdatagenomics/adam#2156, which needs review and real-life testing to make sure there is no performance regression.

akmorrow13 · 2019-05-10T18:42:38Z

mango-core/src/main/scala/org/bdgenomics/mango/io/VcfReader.scala

+  }
+
+  // TODO already defined in ADAM in VariantContextConverter line 266
+  def getHeaderLines(header: VCFHeader): Seq[VCFHeaderLine] = {


@heuermh this is also copied from ADAM, although it is pretty small so I don't think copying is the worst thing here

AmplabJenkins · 2019-05-10T22:06:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/834/
Test FAILed.

mango-core/src/main/scala/org/bdgenomics/mango/io/BedReader.scala

heuermh · 2019-05-13T16:37:07Z

mango-core/src/main/scala/org/bdgenomics/mango/io/VcfReader.scala

+            }
+          }
+
+        if (isGzipped)


Does a distinction between GZIP and block-compressed GZIP (BGZF) need to be made here?

not sure, do you have a BGZF reference I can play with?

ADAM adam-core/src/test/resources has

test.compressed.bcf test.uncompressed.bcf test.vcf test.vcf.bgz test.vcf.bgzf.gz test.vcf.gz

Disq src/test/resources has

HiSeq.10000.vcf.bgz HiSeq.10000.vcf.bgz.tbi HiSeq.10000.vcf.bgzf.gz HiSeq.10000.vcf.bgzf.gz.tbi test.vcf test.vcf.bgz test.vcf.bgzf.gz test.vcf.gz

Awesome, thanks @heuermh !

Thanks for the thoughts @heuermh I pushed some tests and a fix. It now works with bgz and bgzf.gz

heuermh · 2019-05-13T16:37:36Z

mango-core/src/main/scala/org/bdgenomics/mango/io/VcfReader.scala

+  private def createIndex(fp: String, codec: VCFCodec): String = {
+
+    val file = new java.io.File(fp)
+    val isGzipped = fp.endsWith(".gz")


...same here

mango-core/src/main/scala/org/bdgenomics/mango/models/AlignmentRecordMaterialization.scala

AmplabJenkins · 2019-05-14T00:54:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/835/
Test FAILed.

AmplabJenkins · 2019-05-14T01:35:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/836/
Test PASSed.

heuermh · 2019-05-23T15:55:25Z

@akmorrow13 Go ahead and resolve conversations above that you feel have been resolved. After I wrap up all the ADAM post-release stuff I'll spend some time on convert stuff as described above.

akmorrow13 · 2019-05-23T20:53:22Z

Thanks @heuermh! just resolved them

AmplabJenkins · 2019-06-28T00:38:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/mango-prb/854/
Test PASSed.

akmorrow13 added 6 commits April 5, 2019 15:57

started http conversion, fails in frontend

68f054d

functions for local bams in AlignmentRecordMaterialization

0991943

implemented bam reader to work with local/http files

d580c0a

works on vcf.gz, bed and narrowPeak files

e062ee0

sam files now work

b11e0d9

loading local/http files works

4a27435

akmorrow13 changed the title ~~Optimize http~~ Files that are not in hdfs or ADAM format bypass Spark May 10, 2019

clean up

9ea49cf

akmorrow13 requested a review from heuermh May 10, 2019 18:41

akmorrow13 commented May 10, 2019

View reviewed changes

fixes for cluster

2a6dfba

heuermh reviewed May 13, 2019

View reviewed changes

works with vcf.bgz and vcf.bgzf.gz files

88fa02d

added test vcf files

87ab87b

akmorrow13 mentioned this pull request Jun 5, 2019

Add SAM record converters, VCF record converter API classes bigdatagenomics/convert#71

Merged

scalate version merge conflict

39aa701

akmorrow13 merged commit a989d74 into bigdatagenomics:master Jun 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files that are not in hdfs or ADAM format bypass Spark #494

Files that are not in hdfs or ADAM format bypass Spark #494

akmorrow13 commented May 9, 2019

coveralls commented May 9, 2019 •

edited

Loading

AmplabJenkins commented May 9, 2019

AmplabJenkins commented May 10, 2019

akmorrow13 May 10, 2019

heuermh May 13, 2019 •

edited

Loading

akmorrow13 May 15, 2019

heuermh May 23, 2019

heuermh Jun 6, 2019

akmorrow13 May 10, 2019

AmplabJenkins commented May 10, 2019

heuermh May 13, 2019

akmorrow13 May 13, 2019

heuermh May 13, 2019

akmorrow13 May 13, 2019 •

edited

Loading

akmorrow13 May 14, 2019 •

edited

Loading

heuermh May 13, 2019

AmplabJenkins commented May 14, 2019

AmplabJenkins commented May 14, 2019

heuermh commented May 23, 2019

akmorrow13 commented May 23, 2019

AmplabJenkins commented Jun 28, 2019

		@@ -0,0 +1,223 @@
		/**
		* Licensed to Big Data Genomics (BDG) under one

Files that are not in hdfs or ADAM format bypass Spark #494

Files that are not in hdfs or ADAM format bypass Spark #494

Conversation

akmorrow13 commented May 9, 2019

coveralls commented May 9, 2019 • edited Loading

AmplabJenkins commented May 9, 2019

AmplabJenkins commented May 10, 2019

Choose a reason for hiding this comment

heuermh May 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented May 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akmorrow13 May 13, 2019 • edited Loading

Choose a reason for hiding this comment

akmorrow13 May 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented May 14, 2019

AmplabJenkins commented May 14, 2019

heuermh commented May 23, 2019

akmorrow13 commented May 23, 2019

AmplabJenkins commented Jun 28, 2019

coveralls commented May 9, 2019 •

edited

Loading

heuermh May 13, 2019 •

edited

Loading

akmorrow13 May 13, 2019 •

edited

Loading

akmorrow13 May 14, 2019 •

edited

Loading