Skip to content

raalesir/HTSeq-Hadoop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

Here we focus on describing HTSeq-Hadoop which extends the HTSeq package with Hadoop implementations.

HTSeq provides an Application Programming Interface (API) to manipulate raw and processed Next Generation Sequencing (NGS) data using the Python programming language. A limitation of HTSeq is that it is generally restricted to a single thread, though allowing to scale up to a whole multicore node in some cases.

We modified two widely used tools from HTSeq in RNA-seq analysis: htseq-count for counting how many reads are mapped to the genes and htseq-qa for quality assessment of raw or mapped reads. These were adapted to run in the Hadoop framework in order to significantly increase the scalability.

At the present moment there are two utilities in the HTSeq-Hadoop:

  • HTSeqCount -- mimicking the functionality of the htseq-count
  • HTSeqQA -- htseq-qa

The runtime performance of HTSeqCount under Hadoop was compared with the Pig Latin script on the Apache Pig platform. The choice of Hadoop-streaming library made possible to involve the GNU-parallel utility to run HTSeq-Hadoop in multiple threads on the multicore Linux workstations or on a cluster node.

The documentation for the HTSeq-Hadoop is available here.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages