-
Pairix is a tool for indexing and querying on a block-compressed text file containing pairs of genomic coordinates.
-
Pairix is a stand-alone C program that was written on top of tabix (https://github.com/samtools/tabix) as a tool for the 4DN-standard pairs file format describing Hi-C data: pairs_format_specification.md
-
However, Pairix can be used as a generic tool for indexing and querying any bgzipped text file containing genomic coordinates, for either 2D- or 1D- indexing and querying.
-
For example: given the custom text file below, you want to extract specific lines from the Pairs file further below. An
awk
command would read the Pairs file from beginning to end. Pairix creates an index and uses it to access the file from a relevant position by taking advantage of bgzf compression, allowing for a fast query on large files.Some custom text file
chr1 10000 20000 chr2 30000 50000 3.5 chr1 30000 40000 chr3 10000 70000 4.6
Pairs format
## pairs format v1.0 #sorted: chr1-chr2-pos1-pos2 #shape: upper triangle #chromsize: chr1 249250621 #chromsize: chr2 243199373 #chromsize: chr3 198022430 ... #genome_assembly: hg38 #columns: readID chr1 pos1 chr2 pos2 strand1 strand2 EAS139:136:FC706VJ:2:2104:23462:197393 chr1 10000 chr1 20000 + + EAS139:136:FC706VJ:2:8762:23765:128766 chr1 50000 chr1 70000 + + EAS139:136:FC706VJ:2:2342:15343:9863 chr1 60000 chr2 10000 + + EAS139:136:FC706VJ:2:1286:25:275154 chr1 30000 chr3 40000 + -
-
Bgzip can be found either in this repo or https://github.com/samtools/tabix (original).
- Pairix is available as a stand-alone command-line program, a python library (pypairix), and an R package (Rpairix https://github.com/4dn-dcic/Rpairix)
- Various utils including
bam2pairs
,merged_nodups2pairs.pl
,pairs_merger
etc. are available within this repo. - The
bgzip
program that is provided as part of the repo is identical to the original program in https://github.com/samtools/tabix.
- For 2D indexing, the input file of paired coordinates must first be sorted by the two chromosome columns and then by the first genomic position column. For 1D indexing, the file must be sorted by a chromosome column and then by a position column.
- The input file must be compressed using bgzip and is either tab-delimited or space-delimited.
- The resulting index file will be given the extension
.px2
.
git clone https://github.com/4dn-dcic/pairix
cd pairix
make
# Add the bin path to PATH for pairix, bgzip, pairs_merger and stream_1d
# In order to use utils, add util path to PATH
# In order to use bam2pairs, add util/bam2pairs to PATH
# eg: PATH=~/git/pairix/bin/:~/git/pairix/util:~/git/pairix/util/bam2pairs:$PATH
Alternatively, conda install pairix
can be used to install both Pairix and Pypairix together. This requires Anaconda or Miniconda.
If you get an error message saying zlib cannot be found, try installing zlib first as below before make
.
# ubuntu
sudo apt-get install zlib1g-dev
# centos
sudo yum install zlib-devel
bgzip textfile
pairix textfile.gz # for recognized file extension
pairix -p <preset> textfile.gz
pairix -s<chr1_column> [-d<chr2_column>] -b<pos1_start_column> -e<pos1_end_column> [-u<pos2_start_column> -v<pos2_end_column>] [-T] textfile.gz # u, v is required for full 2d query.
- column indices are 1-based.
- use
-T
option for a space-delimited file. - use
-f
option to overwrite an existing index file. - presets can be used for indexing :
pairs
,merged_nodups
,old_merged_nodups
for 2D indexing,gff
,vcf
,bed
,sam
for 1D indexing. Default ispairs
. - For the recognized file extensions, the
-p
option can be dropped:.pairs.gz
,.vcf.gz
,gff.gz
,bed.gz
,sam.gz
- Custom column specification (-s, -d, -b, -e, -u, -v) overrides file extension recognition. The custom specification must always have at least chr1_column (-s).
- use
-w <character>
option to change region split character (default '|', see below the Querying section for details). This is useful when your chromosome names contain the '|' character. (e.g.-w '^'
)
pairix textfile.gz region1 [region2 [...]] ## region is in the following format.
# for 1D indexed file
pairix textfile.gz '<chr>:<start>-<end>' '<chr>:<start>-<end>' ...
# for 2D indexed file
pairix [-a] textfile.gz '<chr1>:<start1>-<end1>|<chr2>:<start2>-<end2>' ... # make sure to quote so '|' is not interpreted as a pipe.
pairix [-a] textfile.gz '*|<chr2>:<start2>-<end2>' # wild card is accepted for 1D query on 2D indexed file
pairix [-a] textfile.gz '<chr1>:<start1>-<end1>|*' # wild card is accepted for 1D query on 2D indexed file
- The -a option (auto-flip) flips query when a given chromosome pair doesn't exist.
pairix -a samples/test_4dn.pairs.gz 'chrY|chrX' |head
SRR1658581.13808070 chrX 359030 chrY 308759 - +
SRR1658581.1237993 chrX 711481 chrY 3338402 + -
SRR1658581.38694206 chrX 849049 chrY 2511913 - -
SRR1658581.6691868 chrX 1017548 chrY 967955 + -
SRR1658581.2398986 chrX 1215519 chrY 569356 - +
SRR1658581.21090183 chrX 1406586 chrY 2621557 + -
SRR1658581.35447261 chrX 1501769 chrY 1458068 + -
SRR1658581.26384827 chrX 1857703 chrY 1807309 - +
SRR1658581.13824346 chrX 2129016 chrY 2411576 - -
SRR1658581.6160690 chrX 2194708 chrY 2485859 - -
pairix -L textfile.gz regionfile1 [regionfile2 [...]] # region file contains one region string per line
- The default region split character is '|', which can be changed by using the
-w
option when building an index.
This command prints out all chromosome pairs in the file.
pairix -l textfile.gz
This is equivalent to but much faster than gunzip -c | wc -l
.
pairix -n textfile.gz
By default '|' is used to split the two genomic regions, but in some cases, a different character is used and it is stored in the index. This command prints out the character used for a specific pairs file.
pairix -W textfile.gz
This command prints out the number of bgzk blocks for all chromosome pairs.
pairix -B textfile.gz
(column 2 and 4 are chromosomes (chr1 and chr2), column 3 is position of the first coordinate (pos1)).
# sorting & bgzipping
sort -k2,2 -k4,4 -k3,3n -k5,5n samples/4dn.bsorted.chr21_22_only.pairs |bgzip -c > samples/4dn.bsorted.chr21_22_only.pairs.gz
# indexing
pairix -f samples/4dn.bsorted.chr21_22_only.pairs.gz
# The above command is equivalent to: pairix -f -s2 -b3 -e3 -d4 -u5 samples/4dn.bsorted.chr21_22_only.pairs.gz
# The above command is also equivalent to: pairix -f -p pairs samples/4dn.bsorted.chr21_22_only.pairs.gz
# Pairs extension .pairs.gz is automatically recognized.
(columns 2 and 6 are chromosomes (chr1 and chr2), and column 3 is position of the first coordinate (pos1)).
# sorting & bgzipping
sort -t' ' -k2,2 -k6,6 -k3,3n -k7,7n merged_nodups.txt |bgzip -c > samples/merged_nodups.space.chrblock_sorted.subsample3.txt
#indexing
pairix -f -p merged_nodups samples/merged_nodups.space.chrblock_sorted.subsample3.txt.gz
# The above command is equivalent to : pairix -f -s2 -d6 -b3 -e3 -u7 -T samples/merged_nodups.space.chrblock_sorted.subsample3.txt.gz
semi 2D query with two chromosomes
pairix samples/test_4dn.pairs.gz 'chr21|chr22'
SRR1658581.33025893 chr21 9712946 chr22 21680462 - +
SRR1658581.9428560 chr21 10774937 chr22 37645396 - +
SRR1658581.8816993 chr21 11171003 chr22 33169971 + +
SRR1658581.10673815 chr21 16085548 chr22 35451128 + -
SRR1658581.2504661 chr21 25672432 chr22 21407301 - -
SRR1658581.40524826 chr21 28876237 chr22 42449178 + +
SRR1658581.8969171 chr21 33439464 chr22 43912252 - -
SRR1658581.6842680 chr21 35467614 chr22 33478115 + +
SRR1658581.15363628 chr21 37956917 chr22 21286436 - -
SRR1658581.3572823 chr21 40651454 chr22 41358228 - -
SRR1658581.50137399 chr21 42446807 chr22 49868647 - +
SRR1658581.11358652 chr21 43768599 chr22 40759935 - -
SRR1658581.4127782 chr21 45142744 chr22 36929446 + +
SRR1658581.38401094 chr21 46989766 chr22 45627553 - +
SRR1658581.34261420 chr21 48113817 chr22 51138644 + -
semi 2D query with a chromosome and a range
pairix samples/test_4dn.pairs.gz 'chr21:10000000-20000000|chr22'
SRR1658581.9428560 chr21 10774937 chr22 37645396 - +
SRR1658581.8816993 chr21 11171003 chr22 33169971 + +
SRR1658581.10673815 chr21 16085548 chr22 35451128 + -
full 2D query with two ranges
pairix samples/test_4dn.pairs.gz 'chr21:10000000-20000000|chr22:30000000-35000000'
SRR1658581.8816993 chr21 11171003 chr22 33169971 + +
full 2D multi-query
pairix samples/test_4dn.pairs.gz 'chr21:10000000-20000000|chr22:30000000-35000000' 'chrX:100000000-110000000|chrX:150000000-170000000'
SRR1658581.8816993 chr21 11171003 chr22 33169971 + +
SRR1658581.39700722 chrX 100748075 chrX 154920234 + +
SRR1658581.36337371 chrX 104718152 chrX 151646254 + -
SRR1658581.49591338 chrX 104951264 chrX 154363440 + +
SRR1658581.46205223 chrX 105732382 chrX 155162659 + -
SRR1658581.32048997 chrX 107326643 chrX 151899433 - +
Wild-card 2D query
pairix samples/test_4dn.pairs.gz 'chr21:9000000-9700000|*'
SRR1658581.18102003 chr21 9582382 chr21 9733996 + +
SRR1658581.10121427 chr21 9665774 chr4 49203518 + -
SRR1658581.1019708 chr21 9496682 chr4_gl000193_random 48672 + +
SRR1658581.44516250 chr21 9662891 chr6 7280832 - +
SRR1658581.15515341 chr21 9549471 chr9 68384076 + +
SRR1658581.51399686 chr21 9687495 chrUn_gl000221 87886 + +
SRR1658581.25532108 chr21 9519859 chrUn_gl000226 6821 + -
SRR1658581.22081000 chr21 9659013 chrUn_gl000232 19626 - -
SRR1658581.34308344 chr21 9532618 chrX 61793091 - +
pairix samples/test_4dn.pairs.gz '*|chr21:9000000-9700000'
SRR1658581.21313395 chr1 25612365 chr21 9679403 + -
SRR1658581.46040617 chr1 143255816 chr21 9663103 + +
SRR1658581.54790470 chr14 101961336 chr21 9481250 + +
SRR1658581.38248307 chr18 18518988 chr21 9452846 - +
SRR1658581.9143926 chr2 90452598 chr21 9486716 + -
Symmetric query - 1D query on a 2D-index file is interpreted as a symmetric 2D query. The two commands below are equivalent.
pairix samples/test_4dn.pairs.gz 'chr22:50000000-60000000'
SRR1658581.11011611 chr22 50224888 chr22 50225362 + -
SRR1658581.37423580 chr22 50528835 chr22 50529355 + -
SRR1658581.20673732 chr22 50638372 chr22 51062837 + -
SRR1658581.38906907 chr22 50661701 chr22 50813018 + -
SRR1658581.7631402 chr22 50767962 chr22 50773437 - +
SRR1658581.31517355 chr22 50910780 chr22 50911083 + -
SRR1658581.31324262 chr22 50991542 chr22 50991895 + -
SRR1658581.46124457 chr22 51143411 chr22 51143793 + -
SRR1658581.23040702 chr22 51229529 chr22 51229809 + -
pairix samples/test_4dn.pairs.gz 'chr22:50000000-60000000|chr22:50000000-60000000'
SRR1658581.11011611 chr22 50224888 chr22 50225362 + -
SRR1658581.37423580 chr22 50528835 chr22 50529355 + -
SRR1658581.20673732 chr22 50638372 chr22 51062837 + -
SRR1658581.38906907 chr22 50661701 chr22 50813018 + -
SRR1658581.7631402 chr22 50767962 chr22 50773437 - +
SRR1658581.31517355 chr22 50910780 chr22 50911083 + -
SRR1658581.31324262 chr22 50991542 chr22 50991895 + -
SRR1658581.46124457 chr22 51143411 chr22 51143793 + -
SRR1658581.23040702 chr22 51229529 chr22 51229809 + -
Query using a region file
cat samples/test.regions
chr1:1-50000|*
*|chr1:1-50000
chr2:1-20000|*
*|chr2:1-20000
cat samples/test.regions2
chrX:100000000-110000000|chrY
chr19:1-300000|chr19
bin/pairix -L samples/test_4dn.pairs.gz samples/test.regions samples/test.regions2
SRR1658581.49364897 chr1 36379 chr20 62713042 + +
SRR1658581.31672330 chr1 12627 chr9 23963238 + -
SRR1658581.22713561 chr1 14377 chrX 107423076 - +
SRR1658581.31992022 chrX 108223782 chrY 5017118 - -
SRR1658581.55524746 chr19 105058 chr19 105558 + -
1D indexing
pairix -f samples/SRR1171591.variants.snp.vqsr.p.vcf.gz
# The above command is equivalent to : pairix -f -s1 -b2 -e2 samples/SRR1171591.variants.snp.vqsr.p.vcf.gz
# The above command is also equivalent to : pairix -f -p vcf samples/SRR1171591.variants.snp.vqsr.p.vcf.gz
# The extension `.vcf.gz` is automatically recognized.
1D query
pairix samples/SRR1171591.variants.snp.vqsr.p.vcf.gz chr10:1-4000000
chr10 3463966 . C T 51.74 PASS AC=2;AF=1.00;AN=2;DB;DP=2;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=50.00;MQ0=0;POSITIVE_TRAIN_SITE;QD=25.87;VQSLOD=7.88;culprit=FS GT:AD:DP:GQ:PL 1/1:0,2:2:6:79,6,0
chr10 3978708 rs29320259 T C 1916.77 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=1.016;DB;DP=67;Dels=0.00;FS=0.000;HaplotypeScore=629.1968;MLEAC=2;MLEAF=1.00;MQ=50.00;MQ0=0;MQRankSum=0.773;POSITIVE_TRAIN_SITE;QD=28.61;ReadPosRankSum=0.500;VQSLOD=3.29;culprit=FS GT:AD:DP:GQ:PL 1/1:3,64:67:70:1945,70,0
chr10 3978709 . G A 1901.77 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=0.677;DB;DP=66;Dels=0.00;FS=0.000;HaplotypeScore=579.9049;MLEAC=2;MLEAF=1.00;MQ=50.00;MQ0=0;MQRankSum=0.308;POSITIVE_TRAIN_SITE;QD=28.81;ReadPosRankSum=0.585;VQSLOD=3.24;culprit=FS GT:AD:DP:GQ:PL 1/1:3,63:66:73:1930,73,0
# to install the python module pypairix,
pip install pypairix
# you may need to install python-dev for some ubuntu releases.
# or
cd pairix
python setup.py install
# testing the python module
python test/test.py
Alternatively, conda install pairix
can be used to install both Pairix and Pypairix together. This requires Anaconda or Miniconda.
# to import and use python module pypairix, add the following in your python script.
import pypairix
# 2D query usage example 1 with `query2D(chrom, start, end, chrom2, start2, end2)`
tb=pypairix.open("textfile.gz")
it = tb.query2D(chrom, start, end, chrom2, start2, end2)
for x in it:
print(x)
# 2D query usage example 1 with *autoflip* with `query2D(chrom, start, end, chrom2, start2, end2, 1)`
# Autoflip: if the queried chromosome pair does not exist in the pairs file, query the flipped pair.
tb=pypairix.open("textfile.gz")
it = tb.query2D(chrom2, start2, end2, chrom1, start1, end1, 1)
for x in it:
print(x)
# 2D query usage example 2 with `querys2D(querystr)`
tb=pypairix.open("textfile.gz")
querystr='{}:{}-{}|{}:{}-{}'.format(chrom, start, end, chrom2, start2, end2)
it = tb.querys2D(querystr)
for x in it:
print(x)
# 2D query usage example with wild card
tb=pypairix.open("textfile.gz")
querystr='{}:{}-{}|*'.format(chrom, start, end)
it = tb.querys2D(querystr)
for x in it:
print(x)
# 2D query usage example 3, with *autoflip* with `querys2D(querystr, 1)`
# Autoflip: if the queried chromosome pair does not exist in the pairs file, query the flipped pair.
tb=pypairix.open("textfile.gz")
querystr='{}:{}-{}|{}:{}-{}'.format(chrom2, start2, end2, chrom, start, end)
it = tb.querys2D(querystr, 1)
for x in it:
print(x)
# 1D query on 2D indexed file
tb=pypairix.open("textfile.gz")
querystr='{}:{}-{}'.format(chrom, start, end)
it = tb.querys2D(querystr)
# The above two lines are equivalent to the following:
# querystr='{}:{}-{}|{}:{}-{}'.format(chrom, start, end, chrom, start, end)
# it = tb.querys2D(querystr)
for x in it:
print(x)
# 1D query on 1D indexed file, example 1
tb=pypairix.open("textfile.gz")
it = tb.query(chrom, start, end)
for x in it:
print(x)
# 1D query on 1D indexed file, example 2
tb=pypairix.open("textfile.gz")
querystr='{}:{}-{}'.format(chrom, start, end)
it = tb.querys(querystr)
for x in it:
print(x)
# get the list of (chr-pair) blocks
tb=pypairix.open("textfile.gz")
chrplist = tb.get_blocknames()
print str(chrplist)
# get the column index (0-based)
tb=pypairix.open("textfile.gz")
print( tb.get_chr1_col() )
print( tb.get_chr2_col() )
print( tb.get_startpos1_col() )
print( tb.get_startpos2_col() )
print( tb.get_endpos1_col() )
print( tb.get_endpos2_col() )
# check if key exists (key is a chromosome pair for a 2D-indexed file, or a chromosome for a 1D-indexed file)
tb=pypairix.open("textfile.gz")
print( tb.exists("chr1|chr2") ) # 1 if exists, 0 if not.
print( tb.exists2("chr1","chr2") ) # 1 if exists, 0 if not.
# get header
tb=pypairix.open("textfile.gz")
print (tb.get_header())
# get chromsize
tb=pypairix.open("textfile.gz")
print (tb.get_chromsize())
# get the number of bgzf blocks that span a given chromosome pair
tb=pypairix.open("textfile.gz")
print (tb.bgzf_block_count("chr1", "chr2"))
# check if an indexed file is a triangle
tb=pypairix.open("textfile.gz")
print (tb.check_triangle())
- Rpairix is an R package for reading pairix-indexed pairs files. It has its own repo: https://github.com/4dn-dcic/Rpairix
- This script converts a bam file to a 4dn style pairs file, sorted and indexed.
- See util/bam2pairs/README.md for more details.
- This script sorts, bgzips and indexes a newer version of
merged_nodups.txt
file with strand1 as the first column.
Usage: process_merged_nodup.sh <merged_nodups.txt>
- This script sorts, bgzips and indexes an old version of
merged_nodups.txt
file with readID as the first column.
Usage: process_old_merged_nodup.sh <merged_nodups.txt>
- This script converts Juicer's
merged_nodups.txt
format to 4dn-style pairs format. It requires pairix and bgzip binaries in PATH.
Usage: merged_nodup2pairs.pl <input_merged_nodups.txt> <chromsize_file> <output_prefix>
- An example output file (bgzipped and indexed) looks as below.
## pairs format v1.0
#sorted: chr1-chr2-pos1-pos2
#shape: upper triangle
#chromsize: 1 249250621
#chromsize: 2 243199373
...
#columns: readID chr1 pos1 chr2 pos2 strand1 strand2 frag1 frag2
SRR1658650.8850202.2/2 1 16944943 1 151864549 - + 45178 333257
SRR1658650.8794979.1/1 1 21969282 1 50573348 - - 59146 140641
SRR1658650.6209714.1/1 1 31761397 1 32681095 - + 88139 90865
SRR1658650.6348458.2/2 1 40697468 1 40698014 + - 113763 113763
SRR1658650.12316544.1/1 1 41607001 1 41608253 + + 116392 116398
- This script converts Juicer's old
merged_nodups.txt
format to 4dn-style pairs format. It requires pairix and bgzip binaries in PATH.
Usage: old_merged_nodup2pairs.pl <input_merged_nodups.txt> <output_prefix>
- An example output file (bgzipped and indexed) looks as below.
## pairs format v1.0
#sorted: chr1-chr2-pos1-pos2
#shape: upper triangle
#chromsize: 1 249250621
#chromsize: 2 243199373
...
#columns: readID chr1 pos1 chr2 pos2 strand1 strand2 frag1 frag2
SRR1658650.8850202.2/2 1 16944943 1 151864549 - + 45178 333257
SRR1658650.8794979.1/1 1 21969282 1 50573348 - - 59146 140641
SRR1658650.6209714.1/1 1 31761397 1 32681095 - + 88139 90865
SRR1658650.6348458.2/2 1 40697468 1 40698014 + - 113763 113763
SRR1658650.12316544.1/1 1 41607001 1 41608253 + + 116392 116398
- This script adds juicer-style fragment information to 4DN-DCIC style pairs file.
Usage: gunzip -c <input.pairs.gz> | fragment_4dnpairs.pl [--allow-replacement] - <out.pairs> <juicer-style-restriction-site-file>
- This script removes duplicate headers from a pairs file (either ungzipped or streamed). This is useful when you accidentally created a wrong pairs file with duplicate headers. The order of the headers doesn't change. Duplicates don't necessarily have to be in consecutive lines.
Usage: gunzip -c <input.pairs.gz> | duplicate_header_remover.pl - | bgzip -c > <out.pairs.gz>
- This script removes columns from a pairs file (either ungzipped or streamed).
# The following removes multiple columns from both header and content. The columns to be removed should be referred to by column names.
Usage: gunzip -c <input.pairs.gz> | column_remover.pl - <colname1> [<colname2> ...] | bgzip -c > <out.pairs.gz>
# The following removes a single column from only the content. The column to be removed should be referred to by column index (0-based).
Usage: gunzip -c <input.pairs.gz> | column_remover.pl --do-not-fix-header - <colindex> | bgzip -c > <out.pairs.gz>
Pairs_merger is a tool that merges indexed pairs files that are already sorted, creating a sorted output pairs file. Pairs_merger uses a k-way merge sort algorithm starting with k file streams. Specifically, it loops over a merged iterator composed of a dynamically sorted array of k iterators. It neither requires additional memory nor produces any temporary files.
pairs_merger <in1.gz> <in2.gz> <in3.gz> ... > <out.txt>
# Each of the input files must have a .px2 index file.
bgzip out.txt
## or pipe to bgzip
pairs_merger <in1.gz> <in2.gz> <in3.gz> ... | bgzip -c > <out.gz>
# To index the output file as well
# use the appropriate options according to the output file format.
pairix [options] out.gz
bin/pairs_merger samples/merged_nodups.space.chrblock_sorted.subsample2.txt.gz samples/merged_nodups.space.chrblock_sorted.subsample3.txt.gz | bin/bgzip -c > out.gz
bin/pairix -f -p merged_nodups out.gz
# The above command is equivalent to : bin/pairix -f -s2 -d6 -b3 -e3 -u7 -T out.gz
Merge-pairs.sh is a merger specifically for the 4DN pairs file. This merger is header-friendly. The input pairs files do not need to be indexed, but need to be sorted properly.
# The following command will create outprefix.pairs.gz and outprefix.pairs.gz.px2, given in1.pairs.gz, in2.pairs.gz, ....
merge-pairs.sh <outprefix> <in1.pairs.gz> <in2.pairs.gz> ...
util/merge-pairs.sh output sample1.pairs.gz sample2.pairs.gz
Streamer_1d is a tool that converts a 2d-sorted pairs file to a 1d-sorted stream (sorted by chr1-chr2-pos1-pos2 -> sorted by chr1-pos1). This tool uses a k-way merge sort on k file pointers on the same input file, operates linearly without producing any temporary files. Currently, the speed is actually slower than unix sort and is therefore not recommended.
streamer_1d <in.2d.gz> > out.1d.pairs
streamer_1d <in.2d.gz> | bgzip -c > out.1d.pairs.gz
bin/streamer_1d samples/merged_nodups.space.chrblock_sorted.subsample2.txt.gz | bgzip -c > out.1d.pairs.gz
The tool creates many file pointers for the input file, which is equivalent to opening many files simultaneously. Your OS may have a limit on the number of files that can be open at a time. For example, for Mac El Captain and Sierra, it is by default set to 256. This is usually enough, but in case the number of chromosomes in your pairs file happen to be larger than or close to this limit, the tool may produce an error message saying file limit is exceeded. You can increase this limit outside the program. For example, for Mac El Captain and Sierra, the following command raises the limit to 2000.
# view the limits
ulimit -a
# raise the limit to 2000
ulimit -n 2000
- Note that if the chromosome pair block are ordered in a way that the first coordinate is always smaller than the second ('upper-triangle'), a lower-triangle query will return an empty result. For example, if there is a block with chr1='6' and chr2='X', but not with chr1='X' and chr2='6', then the query for X|6 will not return any result. The search is not symmetric. However, using the
-a
option forpairix
orflip
option forpypairix
turns on autoflip, which searches for '6|X' if 'X|6' doesn't exist. - Tabix and pairix indices are not cross-compatible.
- The issue with integer overflow with get_linecount of pypairix is now fixed. (no need to re-index).
- Fixed issue where autoflip causes segmentation fault or returns an empty result on some systems. This affects
pairix -Y
,pairix -a
,pypairix.check_triangle()
andpypairix.querys2D()
. It does not affect the results on 4DN Data Portal.
- Line count (
pairix -n
) integer overflow issue has been fixed. The index structure has changed. The index generated by the previous versions (0.2.5 ~ 0.3.3, 0.3.4 ~ 0.3.5) can be auto-detected and used as well (backward compatible).
- Backward compatibility is added - The index generated by the previous version (0.2.5 ~ 0.3.3) can now be auto-detected and used by Pairix.
- The maximum chromosome size allowed is now 2^30 instead of 2^29 with new index. Index structure changed.
- The problem of
pypairix
get_blocknames
crashing python when called twice now fixed.
pairix -Y
option is now available to check whether a pairix-indexed file is a triangle (i.e. a chromosome pair occurs in one direction. e.g. if chr1|chr2 exists, chr2|chr1 doesn't)pypairix
check_triangle
function is also now available to check whether a pairix-indexed file is a triangle.pairix -B
option is now listed as part of the usage.
pairix -B
option is now available to print out the number of bgzf blocks for each chromosome (pair).- The same function is available for pypairix (
bgzf_block_count
).
- The problem with
fragment_4dnpairs.pl
of adding an extra column is now fixed. - 1D querying on 2D data now works with
pypairix
(functionquerys2D
).
pairix
can now take 1D query for 2D data. e.g.)pairix file.gz 'chr22:50000-60000'
is equivalent topairix file.gz 'chr22:50000-60000|chr22:50000-60000'
if file.gz is 2D indexed.
pairix
now has option-a
(autoflip) that flips the query in case the matching chromosome pair doesn't exist in the file.pairix
now has option-W
that prints out region split character use for indexing a specific file.merge-pairs.sh
is now included inutil
.This one currently does not work.pairix
can now take 1D query for 2D data. e.g.)pairix file.gz 'chr22:50000-60000'
is equivalent topairix file.gz 'chr22:50000-60000|chr22:50000-60000'
if file.gz is 2D indexed.
- Two utils are added:
duplicate_header_remover.pl
andcolumn_remover.pl
for pairs file format correction.
pairix
has now option-w
which specifies region split character (default '|') during indexing. A query string should use this character as a separater.pypairix
also now has a parameterregion_split_character
in functionbuild_index
(default '|')juicer_shortform2pairs.pl
is now available in theutil
folder.- Index structure changed - please re-index if you're using an older version of index.
- Updated magic number for the new index, to avoid crash caused by different index structure.
- Total linecount is added to the index now with virtually no added runtime or memory for indexing (
pairix -n
andpypairix
get_linecount
to retrieve the total line count) - Index structure changed - please re-index if you're using an older version of index.
- fixed -s option still not working in
old_merged_nodups2pairs.pl
.
- fixed a newly introdued error in
fragment_4dnpairs.pl
- fixed --split option still not working in
merged_noduds2pairs.pl
andold_merged_nodups2pairs.pl
.
- fixed an issue of not excluding chromosomes that are not in chrome size file in
merged_noduds2pairs.pl
andold_merged_nodups2pairs.pl
- fixed --split option not working in
merged_noduds2pairs.pl
andold_merged_nodups2pairs.pl
- mapq filtering option (-m|--mapq) is added to
merged_noduds2pairs.pl
andold_merged_nodups2pairs.pl
- Now util has an (updated) fragment_4dnpairs.pl script in it, which allows adding juicer fragment index information to 4DN-DCIC style pairs file.
- Now pairix index contains a pairix-specific magic number that differentiates it from a tabix index.
- Pypairix produces a better error message when the index file doesn't exist (instead of throwing a segfault).
merged_nodup2pairs.pl
andold_merged_nodup2pairs.pl
now take chromsize file and adds chromsize in the output file header. Upper triangle is also defined according to the chromsize file.bam2pairs
: the option description in the usage printed out and the command field in the output pairs file has not been fixed. (-l instead of -5 for the opposite effect)pairix': command
pairix --helpnow exits 0 after printing usage (
pairix` alone exits 1 as before).
pypairix
: functionbuild_index
now has optionzero
which created a zero-based index (defaut 1-based).bam2pairs
: now adds chromsize in the header. Optionally takes chromsize file to define mate ordering and filter chromosomes. If chromsize file is not fed, the mate ordering is alphanumeric.pypairix
: functionsget_header
andget_chromsize
are added.- pairs format now has chromsize in the header as requirement.
- fixed usage print for
merged_nodup2pairs.pl
andold_merged_nodup2pairs.pl
. pypairix
: functionexists2
is added
- added build_index methods. Now you can build index files (.px2) using command line, R, or Python
- R: px_build_index() : see Rpairix
- Python: pypairix.build_index()
- tests updated
- pairs_format_specification updated
- Now custom column set overrides file extension recognition
- bam2pairs now records 5end of a read for position (instead of leftmost)
- Version number in the binaries fixed.
- Now all source files are in
src/
. pypairix
: functionexists
is addedpairix
: indexing presets (-p option) now includespairs
,merged_nodups
,old_merged_nodups
. It also automatically recognizes extension.pairs.gz
.- merged_nodups.tab examples are now deprecated (since the original space-delimited files can be recognized as well)
pairs_merger
: memory error fixed- updated tests