Skip to content

cobilab/jarvis2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: GPL v3 Format1 Format2 Format3 MX
HGR CR


JARVIS2

JARVIS2: a data compressor for large genome sequences

Installation

git clone https://github.com/cobilab/jarvis2.git
cd jarvis2/src/
make

Execution

Run JARVIS2

Run JARVIS2 using level 9:

./JARVIS2 -v -l 9 File.seq

Parameters

To see the possible options type

./JARVIS2 -h

This will print the following options:


SYNOPSIS                                                           
      ./JARVIS2 [OPTION]... [FILE]                                 
                                                                   
SAMPLE                                                             
      Run Compression   -> ./JARVIS2 -v -l 30 sequence.txt          
      Run Decompression -> ./JARVIS2 -v -d sequence.txt.jc         
                                                                   
DESCRIPTION                                                        
      Lossless compression and decompression of genomic            
      sequences for efficient storage and analysis purposes.       
      Measure an upper bound of the sequence complexity.           
                                                                   
      -h,  --help                                                  
           Usage guide (help menu).                                
                                                                   
      -a,  --version                                               
           Display program and version information.                
                                                                   
      -x,  --explanation                                           
           Explanation of the context and repeat models.           
                                                                   
      -f,  --force                                                 
           Force mode. Overwrites old files.                       
                                                                   
      -v,  --verbose                                               
           Verbose mode (more information).                        
                                                                   
      -d,  --decompress                                            
           Decompression mode.                                     
                                                                   
      -e,  --estimate                                              
           It creates a file with the extension ".iae" with the  
           respective information content. If the file is FASTA or 
           FASTQ it will only use the "ACGT" (genomic) sequence. 
                                                                   
      -s,  --show-levels                                           
           Show pre-computed compression levels (configured).      
                                                                   
      -l [NUMBER],  --level [NUMBER]                               
           Compression level (integer).                            
           Default level: 30.                                      
           It defines compressibility in balance with computational
           resources (RAM & time). Use -s for levels perception.   
                                                                   
      -hs [NUMBER],  --hidden-size [NUMBER]                        
           Hidden size of the neural network (integer).            
           Default value: 40.                                      
                                                                   
      -lr [DOUBLE],  --learning-rate [DOUBLE]                      
           Neural Network leaning rate (double).                   
           Default value: 0.030.                                   
                                                                   
      [FILE]                                                       
           Input sequence filename (to compress) -- MANDATORY.     
           File to compress is the last argument.                  
                                                                   

To see the possible levels (automatic choosen compression parameters), type:

./JARVIS2 -s

This will ouput th following pre-set models for each 33 levels:

Level 1: -rm 20:12:0.1:0.9:6:0.10:0:0.8:200000 
Level 2: -rm 200:12:0.1:0.9:6:0.10:0:0.8:200000 -cm 3:1:0:0.7/0:0:0:0 
Level 3: -rm 500:12:0.1:0.9:6:0.10:0:0.8:200000 -cm 3:1:0:0.7/0:0:0:0 
Level 4: -rm 500:12:0.1:0.9:6:0.10:1:0.8:200000 -cm 3:1:0:0.7/0:0:0:0 
Level 5: -rm 500:12:0.1:0.9:6:0.10:1:0.8:2000000 -cm 3:1:0:0.7/0:0:0:0 
Level 6: -rm 500:11:0.1:0.9:6:0.10:1:0.8:5000000 -cm 3:1:0:0.7/0:0:0:0 
Level 7: -rm 1000:12:0.1:0.9:6:0.10:1:0.8:200000 -cm 2:1:0:0.9/0:0:0:0 -cm 7:1:1:0.7/0:0:0:0 -cm 13:20:1:0.95/0:0:0:0 
Level 8: -rm 1500:12:0.1:0.9:6:0.10:1:0.8:200000 -cm 2:1:0:0.9/0:0:0:0 -cm 7:1:1:0.7/0:0:0:0 -cm 13:20:1:0.95/0:0:0:0 
Level 9: -rm 2000:12:0.1:0.9:6:0.10:1:0.8:250000 -cm 2:1:0:0.9/0:0:0:0 -cm 7:1:1:0.7/0:0:0:0 -cm 13:20:1:0.95/0:0:0:0 
Level 10: -rm 4000:12:0.1:0.9:6:0.10:1:0.8:300000 -cm 2:1:0:0.9/0:0:0:0 -cm 7:1:1:0.7/0:0:0:0 -cm 13:20:1:0.95/0:0:0:0 
Level 11: -rm 5000:12:0.1:0.9:6:0.10:1:0.8:400000 -cm 2:1:0:0.9/0:0:0:0 -cm 7:1:1:0.7/0:0:0:0 -cm 11:10:0:0.95/0:0:0:0 -cm 13:20:1:0.95/0:0:0:0 
Level 12: -rm 1000:13:0.1:0.9:6:0.15:1:0.85:400000 -cm 2:1:0:0.9/0:0:0:0 -cm 7:1:1:0.7/0:0:0:0 -cm 11:10:0:0.95/0:0:0:0 -cm 13:20:1:0.95/0:0:0:0 
Level 13: -rm 2000:14:0.1:0.9:6:0.15:1:0.95:500000 -cm 2:1:0:0.9/0:0:0:0 -cm 7:1:1:0.7/0:0:0:0 -cm 11:10:0:0.95/0:0:0:0 -cm 13:20:1:0.95/0:0:0:0 
Level 14: -rm 2000:15:0.1:0.9:6:0.15:1:0.99:1000000 -cm 2:1:0:0.9/0:0:0:0 -cm 7:1:1:0.7/0:0:0:0 -cm 11:10:0:0.95/0:0:0:0 -cm 13:20:1:0.95/0:0:0:0 
Level 15: -rm 2000:15:0.1:0.9:6:0.15:1:0.999:5000000 -cm 2:1:0:0.9/0:0:0:0 -cm 7:1:1:0.7/0:0:0:0 -cm 11:10:0:0.95/0:0:0:0 -cm 13:20:1:0.95/0:0:0:0 
Level 16: -rm 1000:15:0.1:0.9:6:0.10:1:0.999:0 -cm 1:1:0:0.8/0:0:0:0 -cm 3:1:0:0.93/0:0:0:0 -cm 6:1:1:0.7/0:0:0:0 -cm 12:10:1:0.95/0:0:0:0 
Level 17: -rm 1000:12:0.1:0.9:6:0.10:0:0.8:200000 -rm 1000:12:0.1:0.9:6:0.10:2:0.8:200000 -cm 1:1:0:0.8/0:0:0:0 -cm 3:1:0:0.93/0:0:0:0 -cm 6:1:1:0.7/0:0:0:0 -cm 12:10:1:0.95/0:0:0:0 
Level 18: -rm 1000:15:0.1:0.9:6:0.10:1:0.999:0 -cm 1:1:0:0.8/0:0:0:0 -cm 3:1:0:0.93/0:0:0:0 -cm 6:1:1:0.7/0:0:0:0 -cm 12:10:1:0.95/0:0:0:0 
Level 19: -rm 50:12:0.1:0.9:6:0.10:1:0.85:200000 -cm 3:1:0:0.93/0:0:0:0 -cm 7:10:1:0.7/0:0:0:0 -cm 13:50:1:0.95/0:0:0:0 
Level 20: -rm 50:12:0.1:0.9:6:0.10:1:0.85:200000 -cm 1:1:0:0.8/0:0:0:0 -cm 3:1:0:0.93/0:0:0:0 -cm 6:1:1:0.7/0:0:0:0 -cm 13:200:1:0.95/0:0:0:0 
Level 21: -rm 100:12:0.1:0.9:6:0.10:1:0.85:200000 -cm 1:1:0:0.8/0:0:0:0 -cm 3:1:0:0.93/0:0:0:0 -cm 6:1:1:0.7/0:0:0:0 -cm 13:200:1:0.95/0:0:0:0 
Level 22: -rm 200:12:0.1:0.9:6:0.10:1:0.85:200000 -cm 1:1:0:0.8/0:0:0:0 -cm 3:1:0:0.93/0:0:0:0 -cm 6:1:1:0.7/0:0:0:0 -cm 13:200:1:0.95/0:0:0:0 
Level 23: -rm 500:12:0.2:0.9:7:0.1:1:0.01:200000 -cm 1:1:0:0.7/0:0:0:0 -cm 3:1:0:0.9/0:0:0:0 -cm 7:10:1:0.92/0:0:0:0 -cm 12:10:1:0.94/0:0:0:0 
Level 24: -rm 1000:13:0.2:0.9:7:0.1:1:0.1:200000 -cm 1:1:0:0.7/0:0:0:0 -cm 3:1:0:0.9/0:0:0:0 -cm 7:10:1:0.90/0:0:0:0 -cm 12:20:1:0.95/0:0:0:0 
Level 25: -lr 0.01 -hs 42 -rm 1000:12:0.1:0.9:7:0.4:1:0.2:220000 -cm 1:1:0:0.7/0:0:0:0 -cm 7:10:1:0.7/0:0:0:0 -cm 12:1:1:0.85/0:0:0:0 
Level 26: -lr 0.01 -hs 42 -rm 100:12:0.01:0.9:7:0.8:1:0.2:240000 -cm 1:1:0:0.9/0:0:0:0 -cm 7:10:1:0.9/0:0:0:0 -cm 12:10:1:0.9/0:0:0:0 
Level 27: -lr 0.05 -hs 42 -rm 100:12:1:0.9:7:0.8:0:0.01:250000 -rm 100:12:1:0.9:7:0.8:2:0.01:240000 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.89/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 
Level 28: -lr 0.05 -hs 42 -rm 100:12:1:0.9:7:0.8:0:0.01:250000 -rm 100:12:1:0.9:7:0.8:2:0.01:240000 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.89/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 
Level 29: -lr 0.05 -hs 42 -rm 200:12:1:0.9:7:0.8:0:0.01:250000 -rm 200:12:1:0.9:7:0.8:2:0.01:240000 -cm 1:1:0:0.9/0:0:0:0 -cm 4:1:0:0.9/0:0:0:0 -cm 8:1:1:0.89/0:0:0:0 -cm 12:20:1:0.97/0:0:0:0 
Level 30: -lr 0.05 -hs 42 -rm 200:12:1:0.9:7:0.8:1:0.01:250000 -cm 4:1:0:0.9/0:0:0:0 
Level 31: -lr 0.05 -hs 42 -rm 100:12:1:0.9:7:0.8:1:0.01:250000 -cm 3:1:0:0.9/0:0:0:0 
Level 32: -lr 0.03 -hs 42 -rm 500:12:1:0.9:7:0.8:1:0.01:250000 -cm 3:1:0:0.9/0:0:0:0 
Level 33: -lr 0.03 -hs 42 -rm 200:12:1:0.9:7:0.8:1:0.01:250000 -cm 7:1:0:0.9/0:0:0:0 

To see the meaning of the model parameters, type:

./JARVIS2 -x

This will output the following content:

      -cm [NB_C]:[NB_D]:[NB_I]:[NB_G]/[NB_S]:[NB_E]:[NB_I]:[NB_A]  
      Template of a context model.                                 
      Parameters:                                                  
      [NB_C]: (integer [1;20]) order size of the regular context   
              model. Higher values use more RAM but, usually, are  
              related to a better compression score.               
      [NB_D]: (integer [1;5000]) denominator to build alpha, which 
              is a parameter estimator. Alpha is given by 1/[NB_D].
              Higher values are usually used with higher [NB_C],   
              and related to confiant bets. When [NB_D] is one,    
              the probabilities assume a Laplacian distribution.   
      [NB_I]: (integer {0,1,2}) number to define if a sub-program  
              which addresses the specific properties of DNA       
              sequences (Inverted repeats) is used or not. The     
              number 1 turns ON the sub-program using at the same  
              time the regular context model. The number 2 does    
              only contemple the invesions only (NO regular). The  
              number 0 does not contemple its use (Inverted repeats
              OFF). The use of this sub-program increases the      
              necessary time to compress but it does not affect the
              RAM.                                                 
      [NB_G]: (real [0;1)) real number to define gamma. This value 
              represents the decayment forgetting factor of the    
              regular context model in definition.                 
      [NB_S]: (integer [0;20]) maximum number of editions allowed  
              to use a substitutional tolerant model with the same 
              memory model of the regular context model with       
              order size equal to [NB_C]. The value 0 stands for   
              turning the tolerant context model off. When the     
              model is on, it pauses when the number of editions   
              is higher that [NB_C], while it is turned on when    
              a complete match of size [NB_C] is seen again. This  
              is probabilistic-algorithmic model very usefull to   
              handle the high substitutional nature of genomic     
              sequences. When [NB_S] > 0, the compressor used more 
              processing time, but uses the same RAM and, usually, 
              achieves a substantial higher compression ratio. The 
              impact of this model is usually only noticed for     
              [NB_C] >= 14.                                        
      [NB_R]: (integer {0,1}) number to define if a sub-program    
              which addresses the specific properties of DNA       
              sequences (Inverted repeats) is used or not. It is   
              similar to the [NR_I] but for tolerant models.       
      [NB_E]: (integer [1;5000]) denominator to build alpha for    
              substitutional tolerant context model. It is         
              analogous to [NB_D], however to be only used in the  
              probabilistic model for computing the statistics of  
              the substitutional tolerant context model.           
      [NB_A]: (real [0;1)) real number to define gamma. This value 
              represents the decayment forgetting factor of the    
              substitutional tolerant context model in definition. 
              Its definition and use is analogus to [NB_G].        
                                                                   
      ... (you may use several context models)                     
                                                                   
                                                                   
      -rm [NB_R]:[NB_C]:[NB_A]:[NB_B]:[NB_L]:[NB_G]:[NB_I]:[NB_W]:[NB_Y]
      Template of a repeat model.                                  
      Parameters:                                                  
      [NB_R]: (integer [1;10000] maximum number of repeat models   
              for the class. On very repetive sequences the RAM    
              increases along with this value, however it also     
              improves the compression capability.                 
      [NB_C]: (integer [1;20]) order size of the repeat context    
              model. Higher values use more RAM but, usually, are  
              related to a better compression score.               
      [NB_A]: (real (0;1]) alpha is a real value, which is a       
              parameter estimator. Higher values are usually used  
              in lower [NB_C]. When [NB_A] is one, the             
              probabilities assume a Laplacian distribution.       
      [NB_B]: (real (0;1]) beta is a real value, which is a        
              parameter for discarding or maintaining a certain    
              repeat model.                                        
      [NB_L]: (integer (1;20]) a limit threshold to play with      
              [NB_B]. It accepts or not a certain repeat model.    
      [NB_G]: (real [0;1)) real number to define gamma. This value 
              represents the decayment forgetting factor of the    
              regular context model in definition.                 
      [NB_I]: (integer {0,1,2}) number to define if a sub-program  
              which addresses the specific properties of DNA       
              sequences (Inverted repeats) is used or not. The     
              number 1 turns ON the sub-program using at the same  
              time the regular context model. The number 0 does    
              not contemple its use (Inverted repeats OFF). The    
              number 2 uses exclusively Inverted repeats. The      
              use of this sub-program increases the necessary time 
              to compress but it does not affect the RAM.          
      [NB_W]: (real (0;1)) initial weight for the repeat class.    
      [NB_Y]: (integer {0}, [50;*]) repeat cache size. This will   
              use a cache of entries while hashing. Value '0' will 
              use the whole sequence length. 

Compression and decompression of FASTA and FASTQ data

First, make sure to give permissions to the script by typing the following at the src/ folder

chmod +x JARVIS2.sh

The extension of compressing FASTA and FASTQ data contains a menu to expose the parameters that can be accessed using:

./JARVIS2.sh --help

Preparing JARVIS2 for FASTA and FASTQ:

./JARVIS2.sh --install

Compression of FASTA data:

./JARVIS2.sh --threads 8 --fasta --block 10MB --input sample.fa

Decompression of FASTA data:

./JARVIS2.sh --decompress --fasta --threads 4 --input sample.fa.tar

Compression of FASTQ data:

./JARVIS2.sh --threads 8 --fastq --block 40MB --input sample.fq

Decompression of FASTQ data:

./JARVIS2.sh --decompress --fastq --threads 4 --input sample.fq.tar

Benchmark

JARVIS2 has been tested in two large benchmarks, namely the:

  1. Human genome (T2T Chm13 version 2.0 [article,sequence])
https://github.com/cobilab/HumanGenome
  1. Cassava genome (TME 204 [article,sequence])
https://github.com/cobilab/CassavaGenome

Performance

Currently, as far as we know, JARVIS2 holds the record on the higher compressibility for both genomes.

Citation

JARVIS2: a data compressor for large genome sequences. 
D Pratas, AJ Pinho. Data Compression Conference (DCC), 2023.

PDF Access here

Issues

For any issue let us know at issues link.

License

License: GPL v3

For more information:

http://www.gnu.org/licenses/gpl-3.0.html