-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no-SGE-dependent pipeline please #1
Comments
It should be fairly straightforward to run without SGE, although it may take quite some time to complete. I must admit this was on the to do list, but I never got around to implementing it. Basically, something like this
Should be changed to execute the content of The 'module' command can be removed as long as the appropriate binaries are in your $PATH. |
Thanks, I will try... |
Ok, I have been struggling along.
Then java, when trying to run ps2pp, threw this error:
Turning the command to Gave me lines as these in the fa2pp.log:
followed by a lot of 0s and other stuff; the log file grew to unreadable proportions..... is that necessary? |
There is a precompiled ps2pp.jar in the java/ folder. I realise the following command hadn't been updated, so try changing from
to
Or something similar. This will output a half-matrix of pairing probabilities, with a final line of unpaired probabilities, c.f.: https://github.com/noncodo/BigRedButton/blob/master/dotaligner/example/snornal140_0001.pp I might have some spare time in the next fews days to create a 'PC' branch, without all the SGE bling. May I also ask how many sequences you are querying? And what kind of hardware specs are you planning to run this on? |
Hi, thanks for the quick feedback. Yes I figured the java -jar bit out and recompiled dotaligner so that the java stuff ends up together with DotAligner in /opt/dotaligner/bin to keep as close as possible to the original scripts but somehow no pp files are generated. And only ps files are made for the first of the splitted file-blocks (Is that because in a SGE context each of these blocks, together with the command file, gets outsourced to another computer??). As mentioned above I can get ps output and .pp output written into the logfile (when keeping I query ~250 sequences found as targets for the RNAi machinery in our organism, which are collected in the input fasta file. Basically to see whether some secondary structures could be recognition factors. I understood from the paper that your program could address this. But the absence of a description of all the steps in the pipeline makes it quite hard to follow what is supposed to be happening for a wet-scientist with a little bit of programming knowledge.. I am doing this on a 64-bit computer with a i7-7700HQ CPU @ 2.80GHz with 32Gb RAM and ~250 Gb harddisk space available |
Hi, I managed to get the pp files created (this needed a 'for'- loop). For simpler stop-starting points, the attached script tests for the presence of made files so that it skips completed stages and can directly start with the next section that is still incomplete. The script does not use separate command-files and all the progress stuff has been removed; it made it too complicated and did not help preventing starting over from the beginning when some stages (say the RNAfold stage) had completed OK: Please rename attached txt to sh and chmod a+x. There is an exit 1 command so that the script runs until the dotaligner section; it would be a great help if that gets 'de-SGE-ed'. At the moment only one of the 8 available CPUs gets used. Would there be a way to get more CPUs to work on the job ? Could this be a kind of alternative for the SGE approach? I know that in my files there will be sequences that are fairly similar. Would it help if these gets clustered beforehand (but how to do this and how would that fit into the current pipeline)? |
Yes, I was thinking of just splitting the input into N files, and running N jobs in parallel via quick and dirty bash. But for less than 250 sequences, 1 CPU should be reasonable. |
Hi, as said, could you have a look at the dotaligner section that uses the second external script 'worker.sge'; how that works together with the main script I do not understand and I find it difficult to untangle (i.e. bring that section over into the launcher.sh and still let it do what it is supposed to do). It would be a great help if that parts gets 'de-SGE-ed'. I was hoping to use the findings in a presentation in a couple of weeks. |
Well, I managed to get through to the R stage (see the produced script) and then it went pear-shaped again:
I added some more feedback in the R-script, and it is this section where it stops (~line 30 in the R script):
I noticed that the file "ids.txt" in the section after that has not been created during all preceding steps. I also cannot find in your code where that could have happened (the first 'ids.txt' is found in the R-script). This absent file is required by the next R-function; around line 36:
Where in your scripts should this file have been generated...??? How is it supposed to look like? What now? PS below some bits of the stuff that have been generated before the pipeline crashed scores_normalized.tsv:
dist.tsv:
|
The same error as above happens when trying the "RFAM_testing_sample_plusShuffled.fasta.gz" as input which generated the following dist.tsv and scores_normalized.tsv:
What could makes it stall on R-3.5.3? the /'s in the tsv files? |
After editing 'dist.tsv' to look like:
The error is still thrown; I am getting curious now; the silence, the lack of any more feedback......, Have you ever tested the published pipeline as such???? |
Well, it now stops after
on
As mentioned above.... where is this |
Hi Brobr. Sorry for not being actively supporting this, very busy managing several other projects. As mentioned above, I'll get around to this soon enough. You can also get in touch with @goranivanisevic who also developed this project. |
The R code I pushed to github is quite sloppy; we were in a hurry to submit this manuscript before the special issue deadline and, unfortunately, code review took the brunt of it. Luckily there are dedicate souls like yourself to whip it back into shape :) The "ids.txt" file should be created before launching the R script, it is clearly missing in this version. It should contain all unique sequence names, i.e. "RNA_RF00002_AF158725_5_74". |
In addition to the '$' issue, there may be a type here:
I think |
Ok, thanks for the heads up. I'll try this today. |
Ok, I got my (10) clusters (and 47 for the RFAM set; is that correct?)! And I learned a bit of R along the way :-)
See attached the resulting The section "Process clusters from output" still needs correction/testing; as it is, it won't do much; first step has to take into account ( |
Attached script runs to the end and creates output that looks like that what is being aimed for; up for the maintainers to check this.
Hope this helps anyone trying this pipeline on a powerful computer; the longer the sequences you let the pipeline run on the longer it takes to complete (if at all; locarna gave up on my test set) |
I asked Sebastian Will for some advice on how to use LocARNA with larger fragments, which resulted in this possible adaption to the script when one installs the latest version of LocARNA (at the moment of writing 2.0.0RC7:)
If there still is a memory problem he suggests to drop the BTW he says:
PS I won't close the issue as this whole page suddenly was no longer publicly visible. Which would make any sharing efforts futile. |
Hi, what do I need to do to adapt this pipeline to run on a linux box bypassing SGE (I have no access to a server running such a facility); would that lead to restriction for the number of RNA sequences to check?
Apart from the SGE problem; some errors appeared on std.out:
/launcher.sh: line 8: module: command not found
The list of required libraries inside the script mentions 'RcolorBrewer', which should be 'RColorBrewer'
The text was updated successfully, but these errors were encountered: