Skip to content

Latest commit

 

History

History
7 lines (4 loc) · 842 Bytes

README.md

File metadata and controls

7 lines (4 loc) · 842 Bytes

Metazoa Assemblies

This repository houses the scripts for the manuscript, "Towards a genome sequence for every animal: where are we now?"

dedup_list_tab.py takes as input genome assembly metadata harvested from NCBI through the NCBI datasets tool in the form of a csv. The CSV must be sorted first by taxid and second by contig N50. Then it will choose the assembly for each taxon with the longest contig N50. The script will also discover whether an annotation exists for that species on NCBI (for any assembly).

scrape_assembly_info.py is a web scraper based on Beautiful Soup that will scrape metadata that is not included in the standard NCBI datasets metadata. All that the script needs as input is an assembly accession number.