Before doing anything, make sure you open the terminal window and log in, using the credentials you received from uppmax into your Solander accounts.
ssh username@solander.ibg.uu.se
- You might be asked to type in yes/no - type
yes
. - Then type in your password (you won't be able to see what you're typing in in the password field but it is working)
P.S. Make sure you're typing in the correct letters (confusion can occur as lower case 'l'/upper case 'I', the letter 'O' and zero etc may look the same)
Awesome. Once that's done we can start with the exercizes.
Unix is an operating system that was originally developed at Bell Labs in the 1970s. It is based around a "modular design" where tools do very distinct and narrow tasks. To complete more complex tasks multiple modules are then combined through the use of "pipes" - more about those later. If you are interested in learning more about UNIX then you can have a look at the wikipedia article for UNIX, it's quite thorough and well written.
While there is no straightforward answer to this question there are some things that are often brought up. UNIX (and Unix-like systems) in its design is quite simple and is nowadays very portable. This has lead to it being used to run anything from massive high-performance computer clusters to tiny single-board computers such as Arduino & raspberry pis. This ubiquity and popularity is probably one reason why it is still so popular. Since the year 2000 Mac computers are also running on Apple's own Unix system, another popular Unix-like system is the Android mobile operating system. While the original UNIX operating system is a commercial system there are many Unix-like operating systems such as Linux which are free and open-source (these are generally based on the Linux kernel). The ecosystem of open source and free distribution suits the academic world very well. It is not science if you aren't sharing your findings and how you came to your conclusions - that generally includes your code.
Interaction with UNIX-style systems is typically done through a command-line interface (CLI) - a terminal
of some sort. There is generally no GUI (graphical user interface), though there exist protocols to display graphics through the terminal such as X11
.
To communicate with the system there needs something to interpret the user's command. In a UNIX-like system, this is called a shell
, one of the most common ones - and the one found on Uppmax
is called Bourn-Again shell or simply bash
.
It's an interpreter and it's own (basic) programming language.
Enough exposition, let's get going. Open up the terminal and proceed with the exercise.
Terminal display for version 7 of Unix
(By Huihermit - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=30560188)
OBS! Do steps A/B only if you are using your own computer and not the local computers in the Labs. Any downstream issues you may encounter are solely your responsibility if you choose to do so. We strongly recommend using the computers in the labs but it is still a matter of choice.
For everyone using the local computers, jump directly to the 'Baby steps in Terminal' section below.
The macOS operating system is a Unix-system so modern Macs can natively interact with other Unix systems. macOS comes with an inbuilt terminal application called simply Terminal
.
It is functional and gets the job down, we, however, recommend that you instead install iTerm2
, which comes with several quality of life improvements.
Download it from https://iterm2.com/
MacOS comes with a very bare-bones version of Unix which lacks many tools that most Unix installations come with. Thus we recommend that you install Homebrew which is a package manager for Mac which makes it super easy to install missing programs and tools.
To install Homebrew
; open the terminal and run:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Note that this will take around 5 min to complete and you will be prompted for your user password. The installation requires around 10GB of space
Then you can install some useful tools:
brew install wget
brew install w3m
brew install tree
If you are on a Windows machine then the easiest option for you is to download MobaXterm
.
https://mobaxterm.mobatek.net/
Another option is to use the Ubuntu app from the Windows app store or to set up a dedicated Linux partition. If you know how to do that then go for it, if not stick to MobaXterm.
Open a local terminal and run the following command to install w3m
which we will use in today's lab:
apt-get install -y w3m-img
When you connect to a system you usually end up in your home directory. To see the path to where you stand use the pwd
command. Try it now:
pwd
it should return something like:
/home/user_name
As you move around in the systems directories this will of course change.
To make a directory (called folders on other systems) use the command mkdir
, try the following out:
mkdir directory1
To figure out what happened use the ls
command which lists the content of the current working directory:
ls
As you can see mkdir
makes directories, all (or at least most) UNIX commands are named after their function.
To go to your new directory use cd
(change directory);
cd directory1
Now try to create another directory called directory 2
mkdir directory 2
Take a look at what you did using the ls
command.
Looks like you created two directories, one called 2
and another called directory
. Since the basic syntax for UNIX commands is:
command option1 option2 option3
Each extra option to the command is separated by a space, which means that spaces and other white characters, such as tabs, are not allowed in file or directory names!
To fix this issue use rm
for, you guessed it - remove.
We can remove both at the same time:
rm directory 2
Ouch, rm
by default only removes files, not directories. You need to tell rm
that you want it to work recursively, using what's typically called a flag, in this case -r
.
rm -r directory 2
Check that is worked using ls
.
Ok, let's try and make a second directory again:
mkdir directory2
change into it using cd
cd directory2
So now your current directory should be home/your_usename/directory1/directory2
, to find out use pwd
. If you want to go one directory up, in our case from directory2 to directory1 you can use ..
notation:
cd ..
Check that it took you to the right place. Here are two neat things to know about cd
cd -
takes you back to the last place you were standing.
Just cd
with no options takes you to your home directory.
Now create a directory with your name. This is your profile directory and this is where you work from now on. Use the above instructions to move into the directory.
w3m -dump https://en.wikipedia.org/wiki/Principal_component_analysis > PCA.txt
The above command reads the Wikipedia page for Principal Component Analysis and extracts the body text and saves it to the file PCA.txt
. The >
is used to redirect output to a file.
Now that we have some text to work here are some tools for inspecting files, try them out on PCA.txt
.
cat - concatenates the file contents to standard out (the screen)
less - a nice and easy file viewer, press q to quit!
more - more than less
head - look at the head of a file, by default the first 10 lines.
tail - looks at the tail of a file, by default the 10 last lines.
It turns out that the PCA article is quite big, how big?
Counting is hard and slow for humans but easy for machines. Use the word count command wc
on the file to figure out how many lines, words and characters it contains.
Question 1: What did you get? How many lines and words?
Hmm, it's not that clear, is it? Have a look at the manual for wc
to try and figure it out.
All core UNIX commands have an inbuilt manual you can access it through the man
command:
man wc
Now that you figured that out, use the manual for head
to figure out how to save the first 100 lines of PCA.txt
and save them into short_pca.txt
.
You can use wc
to figure out if you did it correctly.
grep
is a useful tool that prints lines matching a provided pattern.
In our example we can use it to figure out how many lines contain the word PCA
:
grep PCA short_pca.txt
One of the fundamental concepts behind UNIX from the beginning was an emphasis on small task-specific programs. These programs could then be chained together into pipelines to perform more complex tasks.
Imagine that you have a machine that cuts down trees, de-barks them, and cuts them into planks. If you have trees and want planks then this is fine, but what happens if you want to cut down the tree and keep the whole log? Your big fancy machine only makes planks. In the UNIX world, your one machine would consist of smaller machines chained together. One that cuts down the tree, one that removed the bark, and one that cuts into planks. If you get a pile of logs from your neighbor then you can use one of your machines to make planks, etc.
This type of chaining together is called piping in UNIX and is done by the pipe character |
. e.g.
command 1 | command 2 | command3 > output_file
By using grep
, pipe
, and wc
we can now easily figure out how many lines of the Wikipedia article about Principle Component Analysis contains the word PCA
:
grep PCA short_pca.txt | wc -l
Try it for the full article as well!
Question 2: Write down how many times the term"PCA"
appears in both the full PCA.txt
and the short_pca.txt
files.
Remember that there can be more than one hit per line so you need to account for that! Using grep alone might not lead to the correct answer (Hint! Try combining it with flags!)
--
^
^P[0-9]+
will match at the beginning of a line(^
) the letter P literally, any number ([0-9]
) repeating (+
)
it will thus catch the top line but not the second:
P674353
454646464 P
These types of expression can be very useful and powerful
2.2 sed
is a powerful tool for editing streams of files. It is a common way of using rexes in Unix. It's often used to replace one thing with another:
if My_file.txt
contains:
dog dog dog
Then:
sed 's/dog/cat/' My_file.txt. # s is for substitute
cat dog dog
the first instance of dog
is replaced with cat
. We can also replace all instances using the g
global flag:
sed 's/dog/cat/g' My_file.txt. # sg - substitute globally
cat cat cat
You can use sed
on piped output from another program or straight on a single file. For a summary of some things that you can do with it have a look at this link.
Question 3:
Go to this file orange.csv and click on the Copy raw file option.
In your directory: Type in:
nano
And use Ctrl
+ V
(or Command + V on macOS) to paste the raw contents of the file.
Press Ctrl
+ X
(or Command + X on macOS/Linux) to save the file. When prompted, press Y
(yes) to confirm saving changes. When asked for the file name, type orange.csv
and press Enter
What you have just done is similar to saving a file in Notepad, but we used a program called Nano. Type ls
to see if the file is indeed created and saved successfully.
Try looking at the contents of the file by typing cat orange.csv
You have received an Excel-exported file in a strange format—a common scenario in bioinformatics. Now, the task is to clean and reformat this orange.csv
file into a proper .csv by ensuring:
- Decimal points are '.'
- Delimiters (the separators between columns) are commas ','
- Any stray letters mixed in with numeric data are removed
OBS!: You can use regular expressions (regex) alongside the sed command to accomplish these fixes.
Consult the manual with man sed
or info sed
, or search online for examples of sed
and regex
usage.
*** Bonus points: Attempt to do this in one single line if you feel adventurous! Double-check that each column is correctly separated by a comma and that decimals and numbers are displayed properly in the final file. Submit the exact command(s) you used to transform orange.csv into a properly formatted file. Be sure you understand what your command does.
4. Hidden file exercize
Okay, let's refresh our memory for a sec. We should have been doing the things up until now on the server. Let's take a moment and see where we are standing by typing pwd
in the terminal. What does it say? Are you in the right place where you want to be? If not, you can use cd
followed by the directory you want to move into. If you took a wrong turn, use cd ..
to go back.
The file needed for this exercise is hosted on GitHub, here https://github.com/UU-Human-Evolution/1MB335/blob/master/hidden_word_exercise.zip
.
To get it, stand in the directory you created, try using the command wget
:
wget https://github.com/UU-Human-Evolution/1MB335/raw/master/hidden_word_exercise.zip -O fullpathtoyourdirectory/hidden_word_exercise.zip
This command downloads hidden_word_exercise.zip
into your specified directory.
Once the file is downloaded, you can see that you can't read it as it is because it has been zipped (compressed). To unzip it:
unzip hidden_word_exercise.zip
This extracts the contents of hidden_word_exercise.zip
into the current directory.
When the unzipping is done, type ll
and take a look at the files in your directory. Now we have the hidden_word_exercise.zip
file and a directory called unix_tutorial_modified
which was the output of the unzip command.
OBS! In the future you may encounter a .gz
file instead. In such cases use:
gunzip hidden_word_exercise.gz
We will be using these commands a lot so it's really important that you grasp the logic behind transferring files from one place to another, moving around and some of the basic commands. It is always important to know if you are standing in the right place or you are moving a file to the right place.
Question 4: Extracting the Hidden Word
Now that you have some basic UNIX tools at your disposal, go into unix_tutorial_modified
and complete the hidden word exercise.
OBS! Just a heads up - when you run into files that end in .pl
(Perl files), run them by:
perl someprogramname.pl
and then see what it outputs.
For step-by-step instructions visit https://github.com/UU-Human-Evolution/1MB335/blob/master/hidden_word_exercise_instructions.md
Submit the hidden word.
Bash is a programming language in itself so it is possible to set up quite advanced workflows with it. The most simple bash script is just a normal command you would type on the command line saved to a file. Or more realistically you might want to run a couple of things that take a few minutes or hours after each other. This is something that you definitely will do in your future bioinformatics career.
An example:
echo “Wait for 5 seconds”
sleep 5
echo “Completed”
In your directory, add the above text to a file called sleep.sh
and execute it with:
bash sleep.sh
You can see that the code is executed sequentially, it does not progress to the next line until the previous one has finished.
You can chain any type of program/script that you can run on the commandline like this, even those you have written yourself like the perl
programs you ran in the hidden word excercise or the scripts you will write in Lab 2:
python my_python_script.py inputfile.txt
Just to show you how it's done, try doing the following set of instructions.
In the server window: Create a dicrectory with your name followed by _SERVER
and move into it.
Open a local terminal window, without logging into the server.
Create a directory with your name followed by _LOCAL
Create a text file called Bamse.txt
and write something in it.
Type pwd
to see the path to the directory and copy it.
Now run the following and press Enter:
scp -r thefullpathtoyour_LOCALdirectory/Bamse.txt your-username@solander.ibg.uu.se:/fullpathto_SERVERdirectory/
Replace <your-username>
with your actual username on the server.
You will be asked for a password. After you type it in and hit Enter, you should be able to see Bamse.txt
in the _SERVER
directory.
This is the end of Lab1, please make sure that you did and wrote down the answers to all of the questions. Also, make sure to delete any files that you no longer need - you can copy them somewhere else if you want to keep them.