-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
converting fasta to adam eats a huge ammount of time and memory #1891
Comments
Hi @antonkulaga! Is this a standard reference genome that you could share? By any chance, is it gzipped? Our FASTA converter is not particularly zippy, but the runtimes you are describing seem extremely slow. |
@fnothaft yes, super-slow and often freezing conversions are a big pain for me and the reason why I use ADAM not as often as I would want to. I have no idea what is wrong there :( Here is the latest example of frozen conversion that I had with another file: I converted this 500mb not gziped fasta file http://agingkills.westeurope.cloudapp.azure.com/pipelines/indexes/BOWHEAD_WHALE/Alaska/Trinity.fasta to adam format.
where convert.sh is just:
My bigdata containers are at ( https://github.com/antonkulaga/bigdata-docker/tree/master/containers ), including adam one https://github.com/antonkulaga/bigdata-docker/blob/master/containers/adam/Dockerfile . I converted the fasta file that was already uploaded to HDFS. As the file was a bit more than 500MB I gave only 8GB RAM for ADAM to do the job. The job was running for 2.4 hours and I had to stop it. The log is here:
|
@fnothaft , I also tried to run smaller files (like this one http://agingkills.westeurope.cloudapp.azure.com/pipelines/indexes/BOWHEAD_WHALE/Alaska/Trinity.fasta.transdecoder.pep ) and with a local spark (I also tried to load from localfilesystem to exclue any HDFS issues). |
Thanks for the detail, @antonkulaga! It is ok to pull down those files to test? I've been developing benchmarks for some other things, and will look into this as well. |
@heuermh yes, it is ok to take them. The first one is a de novo transcriptome assembly and is public ( from http://www.bowhead-whale.org/downloads/ , Alaska transcriptome) , the second one is transdecoder protein predictions for this assembly. |
Running on our cluster results in OutOfMemoryError
That's an awfully large broadcast_2 value! |
@heuermh how many GB-s of RAM did you give to the job, how many cores? I had the same error before I gave more memory both to the executor and to the driver. |
I'm on Yarn with dynamicAllocation enabled, so all of it, and all of them ;) I ran with
|
@heuermh , in such case I do not understand. Maybe there is something wrong with the way it splits the file. I have spark.files.maxPartitionBytes 536870912 |
The part where it splits the sequences up into fragments contributes to the issue. An alternative is to read each whole sequence using say biojava and then split and convert, but then everything is put into RAM on the driver, at least in how I've currently written it. |
As I have an urgent deadline I do not have time to figure out what is wrong in FastaConverter now.
may slow things down (have you considered substituting groupByKey with something else to reduce data shuffling?) and that the number of contigs is really large in my case because of de novo assembly. @heuermh regarding your idea with biojava, do you mean this code:
where I should map results with this converter ( https://github.com/heuermh/biojava-adam/blob/master/src/main/java/org/biojava/nbio/adam/convert/ProteinSequenceToSequence.java ) and then save a parquet? |
Yes, I also suspect that is true.
That complicates things some. The biojava stuff depends on #1505, which won't be merged until after ADAM version 0.24.0, currently due Feb 2nd, 2018. The method you quoted is for protein sequences, e.g.
A similar method exists for DNA sequences, loadFastaDna. All the biojava sequences are collected to a single For your use case, with small transcript or short assembled sequences, most if not all of which would be less than the fragment size, it might help to add another loadFasta method that isn't clever about fragments and could avoid the shuffles. |
@heuermh maybe it will also be useful for ADAM as well? So, users will have an option to choose. It is quite common to have files with a lot of short sequences where speed and memory consumption matters more than making sure that all fragments are <= than some length. Let them have an option to choose fast loading method. |
Yep, that is what I'm suggesting. Personally I'd rather that wait until after the I'll try to get #1505 up-to-date and rebased this weekend. |
What I'm seeing now is that the // trim whitespace from FASTA lines
val filtered = rdd.map(kv => (kv._1, kv._2.trim()))
// and those that start with ;
.filter((kv: (Long, String)) => !kv._2.startsWith(";"))
// create a map of line number --> FastaDescriptionLine objects
val descriptionLines: Map[Long, FastaDescriptionLine] = getDescriptionLines(filtered)
// broadcast this map
val indexToContigDescription = rdd.context.broadcast(descriptionLines)
// filter to non-description lines
val sequenceLines = filtered.filter(kv => !isDescriptionLine(kv._2))
val keyedSequences =
if (indexToContigDescription.value.isEmpty) {
sequenceLines.keyBy(kv => -1L)
} else {
// key by highest line number with a description line below
// the line number of our row
sequenceLines.keyBy(row => findContigIndex(row._1, indexToContigDescription.value.keys.toList))
} Is this what you see, @antonkulaga? |
@heuermh something like that. In my case, it is keyBy and then the first flatMap inside of it. |
Still going at 44 hours, at
|
I wonder why I need a lot of RAM and >1 hour time with 8 cores just to convert some 800MB fasta file to ADAM format?
Crazy memory and CPU consumption of ADAM's FastaConverter makes ADAM unusable in many use-cases. What is also annoying is that when it lakes memory it often continues for ages instead of just crashing
P.S. Looks like it also eats a lot of spark-driver RAM. However, I am not sure how I should change RAM limitations based on fasta file size. For instance, what RAM parameters are optimal for 1GB fasta file?
The text was updated successfully, but these errors were encountered: