GitHub - raalesir/HTSeq-Hadoop

Here we focus on describing HTSeq-Hadoop which extends the HTSeq package with Hadoop implementations.

HTSeq provides an Application Programming Interface (API) to manipulate raw and processed Next Generation Sequencing (NGS) data using the Python programming language. A limitation of HTSeq is that it is generally restricted to a single thread, though allowing to scale up to a whole multicore node in some cases.

We modified two widely used tools from HTSeq in RNA-seq analysis: htseq-count for counting how many reads are mapped to the genes and htseq-qa for quality assessment of raw or mapped reads. These were adapted to run in the Hadoop framework in order to significantly increase the scalability.

At the present moment there are two utilities in the HTSeq-Hadoop:

HTSeqCount -- mimicking the functionality of the htseq-count
HTSeqQA -- htseq-qa

The runtime performance of HTSeqCount under Hadoop was compared with the Pig Latin script on the Apache Pig platform. The choice of Hadoop-streaming library made possible to involve the GNU-parallel utility to run HTSeq-Hadoop in multiple threads on the multicore Linux workstations or on a cluster node.

The documentation for the HTSeq-Hadoop is available here.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
docs/source/HTSeq_Hadoop		docs/source/HTSeq_Hadoop
README.rst		README.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

raalesir/HTSeq-Hadoop

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages