It is important to learn basic BASH (mainly usage of grep
, awk
, sed
) for efficient use of Regular expression
task 1 : Quickly watch this video in 1.5x and understand How shell scripting works, how to write functions and loops
To learn bash go through this video (3 hrs) which starts from basic bash Bash scripting Video link
While watching the video mainly focus on
To learn Regular expressions grep
, awk
, sed
grep, awk and sed video link
To learn specialised REGEX pattern examples Video
From the above videos you learnt
- BASH scripting
- Regex patterns
Now we will apply the knowledge gained in Life science
task 3 : Understanding the importance of grep with regex using bioinformatics examples link
task 4 : Understanding the importance of sed with regex using bioinformatics examples link
task 5 : Understanding the importance of awk with regex using bioinformatics examples link
After understanding above grep, sed, awk examples, visit this cheat sheet Regex cheet sheet Check wether you are familier with all the
- Anchors
- Character class
- Quantifiers
- Escape characters
- String replacements
- Groups and ranges
- Search for a restriction digestion enzyme, pick its restriction digestion site sequence, find how many times the restriction enzyme sites are present in the E.coli genome EcoRI- GAATTC
- Download Ecoli genome in fasta format here
- search using grep
-
Assume a biologist come to you with a file of >1000 coding sequences of a prokaryote, asks you to pick ORF region for each gene. How do you pick the ORF sites
File consists of sequences like
>lcl|LR794089.1_cds_CAB3563250.1_1 [gene=mutS] [protein=Methyl-directed mismatch repair] [frame=2] [protein_id=CAB3563250.1] [location=<1..>498] [gbkey=CDS]
CGCCATCCGGTGGTTGAACAGGTACTGAACGAGCCATTTATCGCCAACCCGCTGAACCTGTCGCCGCAGC
GTCGCATGTTGATCATTACCGGTCCGAATATGGGCGGTAAAAGTACCTATATGCGCCAGACCGCACTGAT
TTGTTTGCTACCCATTATTTCGAGCTGACCCAGTTACCGGAGAAAATGGAAGGCGTGGCTAACGTGCATC
TCGATGC
>lcl|LR794088.1_cds_CAB3563248.1_1 [gene=mutS] [protein=Methyl-directed mismatch repair] [frame=2] [protein_id=CAB3563248.1] [location=<1..>498] [gbkey=CDS]
CGCCATCCGGTAGTTGAACAAGTACTGAATGAGCCATTTATCGCTAACCCGCTGAATCTGTCGCCGCAGC
GCCGTATGTTGATCATCACCGGTCCGAACATGGGCGGTAAAAGTACCTATATGCGCCAGACCGCGTTGAT
CTGTTTGCCACCCACTATTTCGAGCTGACACAGTTACCGGAGAAAATGGAAGGCGTCGCCAACGTGCATC
TCGATGC
- First linerarize the fasta which means print header in one line, then print sequence in one line
>lcl|LR794089........
CGCCATCCGGTGGTTGAA......
>lcl|LR794088.1_.......
CGCCATCCGGTAGTT.........
- using the logic the coding sequence starts with ATG and ends with stop codon TAA or TAG or TGA. try too pick the lines beteen start codon and stop codon using grep