-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to get Spark pipeline working with slightly out of date code. #1313
Comments
We have a new Pipe API (#1114) to support this use case, then you wouldn't need any of the SAM file parsing or conversion stuff at all, rather something like
Bowtie/bowtie2 is something worth having a proper Pipe API wrapper for, let me take a closer look. I don't remember if it can accept SAM as piped input, for example. |
Sorry to bother again, just wanted to ask on a more concrete scale this time and make sure I understand as this is all new to me. So the example starts with taking some FASTQ file and the goal is to align it with bowtie and write to SAM format. The way this code was doing it was using some perhaps overly complicated text parsing/formatting within Spark using some methods provided, then saving the file to SAM format. What you're suggesting is to read in the FASTQ file (what is the reason for using So something like this:
This would eliminate a lot of the original code, especially all of the mapping and filtering. |
Yes, that looks about right. Any of the FASTQ-related load methods would work; they all return |
First, thank you for the response. I'm able to run the code using the skeleton above. Can you comment on the Pipe API or point me in the direction of some documentation? If we were to run |
Hi @TomNash! Glad to hear that you were able to get the skeleton working! The pipe API has some inline docs, but they are limited. I've opened #1368 to address this more fully. In your example, that's pretty close to how it'd work! The pipe API will run that command in a distributed fashion, using ADAM/Spark to parallelize the work across all of the machines and to format the data that is going into the pipe. There are a few small differences from your example:
You can see an example of a similar command in our unit tests. In your case, I think your command would look like:
This would assume that bowtie is installed (and on the PATH) on every node in the cluster. Let me know if this helps, or if you've got any questions! |
As mentioned elsewhere, we would like to build up a repository of commonly used bioinformatics tools wrapped in the Pipe API, a proposal for which is here https://github.com/heuermh/cannoli The example above is not quite right. First, the scala compiler isn't able to deduce types from the implicits, so the This seems to work for me $ brew install homebrew/science/bowtie
$ bowtie-build ref.fa ref
$ brew install homebrew/science/adam
$ git clone https://github.com/heuermh/cannoli
$ cd cannoli
$ ./scripts/move_to_scala_2.11.sh
$ ./scripts/move_to_spark_2.sh
$ mvn install
$ ADAM_MAIN=org.bdgenomics.cannoli.Cannoli \
adam-submit \
--jars target/cannoli-spark2_2.11-0.1-SNAPSHOT.jar \
-- \
bowtie -single -bowtie_index ref reads.ifq bowtie.sam
Using ADAM_MAIN=org.bdgenomics.cannoli.Cannoli
Using SPARK_SUBMIT=/usr/local/bin/spark-submit
# reads processed: 6
# reads with at least one reported alignment: 0 (0.00%)
# reads that failed to align: 6 (100.00%)
No alignments
$ head bowtie.sam
@HD VN:1.5 SO:unsorted
H06HDADXX130110:2:2116:3345:91806 4 * 0 0 * ... |
Nice! Thats awesome, @heuermh! I will take more of a gander over the code later this week. |
Tried using the format above from @heuermh , getting the following output:
Some Googling has lead me to believe this has to do with Maven dependency version issues, but not sure where that problem would arise here. |
Hi @TomNash! What versions of Java and Spark are you running? |
Java:
Scala:
|
@TomNash that is what I see when running a build for Spark 1.x and Scala 2.10 on Spark 2.x with Scala 2.11. Did you use the |
Ran both scripts. Here is full output from install. I've tried a
|
Yeah |
|
Hmm, I'm sorry, I don't see what problem could be. I just tried on a new Mac
|
So the above example works after installing the binary ADAM release. Does cannoli work only with interleaved FASTQ? |
Hello, Thanks for the great tool. I get the following error while running bowtie using ADAM 0.22 and Spark 2.1
|
I've created a new issue in the cannoli repository https://github.com/heuermh/cannoli/issues/18 to track. |
Closing as this has moved downstream to bigdatagenomics/cannoli#18. |
I'm going off the code posted here:
https://github.com/allenday/spark-genome-alignment-demo/blob/master/bin/bowtie_pipe_single.scala
I've made some changes successfully to accommodate changes in the ADAM code, but I'm getting hung up on trying to use the
SAMRecordConverter
. Currently the code is using the line below but this fails, saying the type is not found.val samRecordConverter = new SAMRecordConverter
I've got no Scala experience, but I from what I understand the class settings of
private[adam] SAMRecordConverter
are pretty restrictive. Is there a way I can access and use it?The text was updated successfully, but these errors were encountered: