Munge GenBank files into FASTA sequences and tab-separated metadata.
This little C program will extract the following information from a GenBank file:
- name
- accession
- length
- submission date
- host
- country
- collection date
In addition to extracting this information, dates are reformatted e.g. 31-DEC-2001
becomes 2001-12-31
, which makes them more digestible to downstream software like BEAST, and country names are cleaned and matched to ISO3 codes.
gbmunge [-h] -i <Genbank_file> -f <sequence_output> -o <metadata_output> [-t] [-s]
Genbank_file
: filename of GenBank-formatted sequence file (normally downloaded assequence.gb
)sequence_output
: filename of FASTA outputmetadata_output
: filename of tab-separated metadata-t
: flag to- only output sequences with collection dates (of any precision)
- to name sequences as {accession}_{collection_date}
-s
: flag to include sequences in tab-delimited file
git clone https://github.com/sdwfrost/gbmunge
cd gbmunge
make
This will build gbmunge
in the src/
directory. Add the directory to the path, or move the executable somewhere.
A Genbank file of MERS Coronavirus sequences is provided in the test/
directory.
cd test
../src/gbmunge -i sequence.gb -f sequence.fas -o sequence.txt -t
Here are the first few lines of output in sequence.txt
:
name | accession | length | submission_date | host | country_original | country | countrycode | collection_date |
---|---|---|---|---|---|---|---|---|
JX869059_2012-06-13 | JX869059 | 30119 | 2012-12-04 | Homo sapiens | NA | NA | NA | 2012-06-13 |
KC164505_2012-09-11 | KC164505 | 30111 | 2013-07-12 | Homo sapiens | United Kingdom | United Kingdom | GBR | 2012-09-11 |
KC667074_2012-09-19 | KC667074 | 30112 | 2013-04-30 | Homo sapiens | United Kingdom: England | United Kingdom | GBR | 2012-09-19 |
KC776174_2012-04 | KC776174 | 30030 | 2013-03-25 | Homo sapiens | Jordan | Jordan | JOR | 2012-04 |
This code uses a slightly modified version of the GBParsy parser downloaded from the Google Code Archive. I found that the parsing of the LOCUS field wasn't working properly.