SF Project

Installation

pyvenv env
source env/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Running

(env) ~/project/sf$ python analyze_sequence.py -h
usage: analyze_sequence.py [-h] --db DB --id ID --outfile OUTFILE --regex REGEX

optional arguments:
  -h, --help         show this help message and exit
  --db DB            database name such as nucleotide
  --id ID            database id such as 224589800
  --outfile OUTFILE  name of output file
  --regex REGEX      regex pattern such as "(A|C|G|T)"

Sample output

(env) ~/project/sf$ python analyze_sequence.py --db nucleotide --id 224589800 --outfile out.txt --regex "(A|C|G|T)"
T	65668756
A	65570891
C	47024412
G	47016562

Sample of csv file

(env) ~/project/sf$ head out.txt
T,10001,10001
A,10002,10002
A,10003,10003
C,10004,10004
C,10005,10005
C,10006,10006
T,10007,10007
A,10008,10008
A,10009,10009
C,10010,10010

Commentary

One of the challenges of this project was dealing with an XML file containing a very large element, such as the human chromosome 1 sequence. I employed the following strategies to avoid pulling the entire element into memory:

When retrieving the XML from NCBI, streamed the data to a file.
When extracting the sequence from XML, used the expat parser, which doesn't read each element into memory as other Python XML parsers do.
Wrote the sequence to a temp file and created a memory map to that file to allow for searching directly on the file.

For the human chromosome 1 (id=224589800), the process is a bit slow when searching the regex "(A|C|G|T)". The bottleneck here is the regular expression search, which is looking at each character and checking whether it matches one of four characters.

Design wise, the main components are:

analyze_sequence.py: script run from command line
efetch.py: contains a function that retrieves XML from NCBI and streams it to a file
sequence_extractor.py: class that extracts the sequence from the XML and writes it to a temp file.

Obviously for production code, I would add things such as error handling, logging, and unit tests, but this gives you some idea of my coding style.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
analyze_sequence.py		analyze_sequence.py
constant.py		constant.py
efetch.py		efetch.py
requirements.txt		requirements.txt
sequence_extractor.py		sequence_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SF Project

Installation

Running

Sample output

Sample of csv file

Commentary

About

Releases

Packages

Languages

societymartingale/sf

Folders and files

Latest commit

History

Repository files navigation

SF Project

Installation

Running

Sample output

Sample of csv file

Commentary

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages