Skip to content

Latest commit

 

History

History
77 lines (55 loc) · 3.58 KB

Lab_2.md

File metadata and controls

77 lines (55 loc) · 3.58 KB

Lab session 2

This lab is mainly focussed on Regular expression and MEME

It is important to learn basic BASH (mainly usage of grep, awk, sed ) for efficient use of Regular expression

task 1 : Quickly watch this video in 1.5x and understand How shell scripting works, how to write functions and loops

To learn bash go through this video (3 hrs) which starts from basic bash Bash scripting Video link

While watching the video mainly focus on

task 2 : Examples for regex patterns

To learn Regular expressions grep, awk, sed grep, awk and sed video link

To learn specialised REGEX pattern examples Video

From the above videos you learnt

  • BASH scripting
  • Regex patterns

Now we will apply the knowledge gained in Life science

task 3 : Understanding the importance of grep with regex using bioinformatics examples link

task 4 : Understanding the importance of sed with regex using bioinformatics examples link

task 5 : Understanding the importance of awk with regex using bioinformatics examples link

After understanding above grep, sed, awk examples, visit this cheat sheet Regex cheet sheet Check wether you are familier with all the

  • Anchors
  • Character class
  • Quantifiers
  • Escape characters
  • String replacements
  • Groups and ranges

Practice questions

  1. Search for a restriction digestion enzyme, pick its restriction digestion site sequence, find how many times the restriction enzyme sites are present in the E.coli genome EcoRI- GAATTC

Hint

  • Download Ecoli genome in fasta format here
  • search using grep
  1. Regular expressions for biologists

  2. Assume a biologist come to you with a file of >1000 coding sequences of a prokaryote, asks you to pick ORF region for each gene. How do you pick the ORF sites

File consists of sequences like

>lcl|LR794089.1_cds_CAB3563250.1_1 [gene=mutS] [protein=Methyl-directed mismatch repair] [frame=2] [protein_id=CAB3563250.1] [location=<1..>498] [gbkey=CDS]
CGCCATCCGGTGGTTGAACAGGTACTGAACGAGCCATTTATCGCCAACCCGCTGAACCTGTCGCCGCAGC
GTCGCATGTTGATCATTACCGGTCCGAATATGGGCGGTAAAAGTACCTATATGCGCCAGACCGCACTGAT
TTGTTTGCTACCCATTATTTCGAGCTGACCCAGTTACCGGAGAAAATGGAAGGCGTGGCTAACGTGCATC
TCGATGC

>lcl|LR794088.1_cds_CAB3563248.1_1 [gene=mutS] [protein=Methyl-directed mismatch repair] [frame=2] [protein_id=CAB3563248.1] [location=<1..>498] [gbkey=CDS]
CGCCATCCGGTAGTTGAACAAGTACTGAATGAGCCATTTATCGCTAACCCGCTGAATCTGTCGCCGCAGC
GCCGTATGTTGATCATCACCGGTCCGAACATGGGCGGTAAAAGTACCTATATGCGCCAGACCGCGTTGAT
CTGTTTGCCACCCACTATTTCGAGCTGACACAGTTACCGGAGAAAATGGAAGGCGTCGCCAACGTGCATC
TCGATGC

Hint

  • First linerarize the fasta which means print header in one line, then print sequence in one line
>lcl|LR794089........
CGCCATCCGGTGGTTGAA......
>lcl|LR794088.1_.......
CGCCATCCGGTAGTT.........
  • using the logic the coding sequence starts with ATG and ends with stop codon TAA or TAG or TGA. try too pick the lines beteen start codon and stop codon using grep