A small exercise to get familiar with how sequences and alignments are stored / represented
- Align sequences by passing the a fasta file to be aligned as the first argument
- The program will ask the user to choose between fasta and clustal format
- A file is created for the aligned sequence in their format of choice
- The file will be have the same base name as their input file
- _aln.fasta will be appended if they requested fasta format
- .aln will be appended if they requested clustal format
- Three sequence files appear in the repository:
- FOXP2.fasta A file of sequences downloaded from the NCBI
- FOXP2_aln.fasta The output of running "align.seq.sh FOXP2" and not opting for clustal format
- FOXP2.aln The output of running "align.seq.sh FOXP2" and opting for clustal format
- Fasta
- Starts with a header line which always begins with > followed by an identifier
- Then the sequence
- Clustal
- Header line which describes the alignment
- An identifier at the beginning of each line, and the aligned sequences on the right
- Each column represents the same position in all of the sequences
- The --- represents missing data in those positons of a sequence
- The goal is to identify conserved regions and variable regions
To keep dependencies isolated, create a virtual environment:
# Create a virtual environment in the .venv folder
python -m venv .venv
# activate the virtual environment
source .venv/bin/activate
# Install a package when you are in the virtual env
pip install -r requirements.txt
# deactivate it
deactivateMAFFT is required to use the sequence alignment tools in this project. You can install MAFFT on macOS via Homebrew:
# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install MAFFT
brew install mafftThe following example demonstrates how MAFFT fails to produce the best alignment of circularly permutated sequences
> seq1
ACGTAAATTAAA
> seq2
AAACGTAAATTA
seq1 --acgtaaattaaa
seq2 aaacgtaaatta--
**********