GitHub - boalang/NR: Detecting and correcting misclassified sequences in the large-scale public databases

Detecting and correcting misclassified sequences in the large-scale public databases

Dataset: Non Redundant (NR) and CD-HIT clustering information

Protobuffer schema and the step by step data generation is shown here.
JSON files version of NR for MongoDB

Detected taxonomically misclassified sequences

Detected misclassification in the clusters

Correcting misclassified sequences

Experiments

Boa_g: Boa for genomics

Boa_g is a domain-specific language and infrastructure on top of Hadoop for genomics data. Website: https://boalang.github.io/bio/

Boa_g example on the infrastructure: http://boa.cs.iastate.edu/examples/boag/index.php

Prerequisites

You need to install Java. Boa_g compiler is written in Java. It can be downloaded here.

Run Boa_g

These instructions will get you a command line, jupyter notebook, Docker container, and Hadoop version of Boa_g. You can also set up a programming environment in Eclipse.

Boa_g Compiler source code

Boa_g compiler is written in Java. See the source code
This is a video on step by step instructions to set up programming environment on Eclipse for Boa compiler. link

Boa_g Query Script examples:

Query over NR database

Download dataset and VirtualBox

Google Drive Link
Web interface is also implemented in the Ubuntu linux and it can be seen in the VirtualBox.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
Boa queries		Boa queries
Command_Line		Command_Line
Docker		Docker
Top Functions		Top Functions
VirtualBox		VirtualBox
compiler		compiler
experiments		experiments
jupyter_notebooks		jupyter_notebooks
misclassification		misclassification
supplemental		supplemental
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting and correcting misclassified sequences in the large-scale public databases

Dataset: Non Redundant (NR) and CD-HIT clustering information

Detected taxonomically misclassified sequences

Detected misclassification in the clusters

Correcting misclassified sequences

Experiments

Boa_g: Boa for genomics

Prerequisites

Run Boa_g

From Jupyter notebook

From command line

On a Docker container

On Hadoop

Boa_g Compiler source code

Boa_g Query Script examples:

Download dataset and VirtualBox

About

Releases

Packages

Languages

boalang/NR

Folders and files

Latest commit

History

Repository files navigation

Detecting and correcting misclassified sequences in the large-scale public databases

Dataset: Non Redundant (NR) and CD-HIT clustering information

Detected taxonomically misclassified sequences

Detected misclassification in the clusters

Correcting misclassified sequences

Experiments

Boag: Boa for genomics

Prerequisites

Run Boag

From Jupyter notebook

From command line

On a Docker container

On Hadoop

Boag Compiler source code

Boag Query Script examples:

Download dataset and VirtualBox

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Boa_g: Boa for genomics

Run Boa_g

Boa_g Compiler source code

Boa_g Query Script examples:

Packages