This project implements and compares several algorithms for DNA sequence analysis, focusing on exact and approximate pattern matching. The algorithms include:
- Naive String Matching (with alignment and comparison counts)
- Boyer-Moore String Matching (with alignment and comparison counts)
- Approximate Matching using k-mer Index
- Approximate Matching using Subsequence Index
The code is designed to analyze DNA sequences, such as those found in the human genome, and answer questions about the efficiency and accuracy of different matching strategies.
home-work2.ipynb
: Main Jupyter notebook containing code, explanations, and results.bm_preproc.py
: Boyer-Moore preprocessing and helper functions.kmer_index.py
: Implementation of a simple k-mer index for fast substring search.kmer_subseq_index.py
: Implementation of a subsequence index for approximate matching.chr1.GRCh38.excerpt.fasta
: Example FASTA file containing a DNA sequence excerpt.README.md
: Project documentation (this file).
- Python 3.x
- Jupyter Notebook (for running
.ipynb
files)
No external packages are required beyond the Python standard library.
-
Clone the repository (if not already done):
git clone https://github.com/Code-with-Bismillah/Algorithms-for-DNA-Sequencing-Module-2-Programming-Homework-2.git cd Algorithms-for-DNA-Sequencing-Module-2-Programming-Homework-2
-
Ensure all files are present:
home-work2.ipynb
bm_preproc.py
kmer_index.py
kmer_subseq_index.py
chr1.GRCh38.excerpt.fasta
-
Launch Jupyter Notebook:
jupyter notebook
-
Open and run
home-work2.ipynb
to see code, explanations, and results.
- Compares the pattern to every possible position in the text.
- Counts the number of alignments tried and character comparisons made.
- Simple but inefficient for large texts.
- Uses preprocessing (bad character and good suffix rules) to skip sections of the text.
- More efficient than naive matching, especially for long patterns and large alphabets.
- Counts alignments and character comparisons for performance analysis.
- Splits the pattern into segments and uses a k-mer index to find candidate matches.
- Allows for a specified number of mismatches (Hamming distance).
- Efficient for finding approximate matches in large genomes.
- Uses a subsequence index to find matches allowing for mismatches and gaps.
- Useful for more flexible approximate matching scenarios.
- Modify the pattern or text in the notebook as needed.
- Run each cell to execute the algorithms and view the results.
- The notebook prints the number of alignments, character comparisons, and matches found for each method.
- Bioinformatics Algorithms: An Active Learning Approach
- Gusfield, D. (1997). Algorithms on Strings, Trees, and Sequences.
This project is for educational purposes.