This is a comparison of a streaming and nonstreaming versions of the Eel Pond mRNASeq Protocol. This page contains workflows for streaming and nonstreaming algorithms of the subset of and the full set of Nematostella data (from Tulin et. al).
Analysis commands:
- Streaming, full dataset
- Nonstreaming, full dataset one, two, three
- Streaming, subset
- Nonstreaming, subset one, two, three
For data subset:
Start by firing up Amazon EC2 (m3.xlarge for data subset). Instructions on setting up an EC2 are here.
For full data set:
Boot up an m4.4xlarge Amazon EC2. Under "Add Storage", add 600 GB on the root volume. Or, m4.large for smaller computer.
Start tmux and mount data:
tmux
lsblk # lists all possible volumes, identify which is right
mkdir data/
sudo mount /dev/xvdf data/ # fill in correct four characters. Note- this mount replaces entire directory, so do it in an empty place
df
ls
Continuing on as root, start at the top of the protocols and continue on with the (non)streaming specific commands for whichever pipeline is being run.
For both datasets:
Install git-core for literate resting text extraction of khmer-protocols.
sudo bash
apt-get update
chmod a+rwxt /mnt
apt-get -y install git-core python-dev python-pip python-virtualenv sysstat
Extract commands from protocols, note ctb branch is nonstreaming.
For streaming (full or subset) or non streaming full data set:
cd /home/ubuntu
rm -fr literate-resting khmer-protocols
git clone https://github.com/dib-lab/literate-resting.git
git clone https://github.com/dib-lab/khmer-protocols.git -b jem-streaming
cd khmer-protocols/mrnaseq
Note: For nonstreaming, data subset, do -b ctb
instead
For all methods: Extract commands from protocols.
for i in [1-9]-*.rst
do
/home/ubuntu/literate-resting/scan.py $i || break
done
In another ssh session, run sar to monitor resrouces. Use screen to do so in same window. Note - ctrl+b = press control key and a at the same time, this won't copy paste. Use tmux to have multiple windows within same ssh session.
Now create a new window to run commands while tmux runs in this one:
ctrl+b c
Start running sar:
sar -u -r -d -o times.dat 1
Change windows to be able to run command:
crtl+b n
When you return to the instance, type:
tmux attach
FULL DATASET (mounted manually)
Streaming:
bash 2-streaming-full.rst.sh
Nonstreaming:
for i in [3-5]-*.rst.sh
do
bash $i
done
DATA SUBSET
Streaming:
bash 1-streaming-subset.rst.sh
Nonstreaming: (from ctb branch instead of jem-streaming)
for i in [1-3]-*.rst.sh
do
bash $i
done
After pipeline finishes, use the following commands to extract disk, CPU, and RAM information from sar in corresponding screen:
sar -d -p -f times.dat > disk.txt
sar -u -f times.dat > cpu.txt
sar -r -f times.dat > ram.txt
gzip *.txt
Use scp to transfer files to local computer (could also use cyberduck, but this is quicker). Fill in with correct paths and < > brackets.
Command for local computer when in your desired file location for the assembly:
scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@<Public DNS>:/mnt/work/trinity_out_dir/Trinity.fasta .
And also copy the times.dat and disk, cpu, and ram files to a local computer, running this same command on the local computer:
scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@<Public DNS>:/home/ubuntu/khmer-protocols/mrnaseq/times.dat .
scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@<Public DNS>:/home/ubuntu/khmer-protocols/mrnaseq/*.txt.gz .
scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@<Public DNS>:/home/ubuntu/times.out .
In sar, do "./extract.py xvdf" to run and get log.out file (specifies disk of interest). The parsing script can be found in this repository: https://github.com/jessicamizzi/sartre-jessica. Note - the sartre-jessica extract script will only work for full data sets that have been mounted. For data subsets, clone and use https://github.com/ctb/sartre.
If readstats.py was run:
scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@e<Public DNS>:/mnt/work/readstats.out .
Install Transrate:
cd
curl -O -L https://bintray.com/artifact/download/blahah/generic/transrate-1.0.1-linux-x86_64.tar.gz
tar xzf transrate-1.0.1-linux-x86_64.tar.gz
export PATH=$PATH:$HOME/transrate-1.0.1-linux-x86_64
echo 'export PATH=$PATH:$HOME/transrate-1.0.1-linux-x86_64' >> ~/.bashrc
export PATH=$PATH:$HOME/transrate-1.0.1-linux-x86_64/bin
echo 'export PATH=$PATH:$HOME/transrate-1.0.1-linux-x86_64/bin' >> ~/.bashrc
transrate --install-deps ref
Make working directory
mkdir /mnt/transrate
cd /mnt/transrate
Copy assembly over, rename it, run sed to fix formatting problems, and run transrate
cp /mnt/work/trinity_out_dir/Trinity.fasta .
sed 's_|_-_g' Trinity.fasta > Trinity.fixed.fa
transrate --assembly Trinity.fixed.fa
Then, download your beautiful assemblies.csv stats file!
scp -i ~/Downloads/amazon-streaming-current.pem ubuntu@<Public DNS>:/mnt/transrate/transrate_results/assemblies.csv .
tmux commands cheat sheet:
tmux
ctrl+b c # create new window
ctrl+b n # move to next window
ctrl+b p # move to previous window
tmux attach # attach back to tmux after logging back on