-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreadme.txt
109 lines (83 loc) · 4.29 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
copMEM, v0.2
============
copMEM is a program for efficient computation of MEMs (Maximal Exact Matches) in a pair of genomes.
Its main algorithmic idea requires that two internal parameters (k1 and k2) are coprime, hence the name.
Installation:
-------------
1) Linux
Use a 64-bit OS (e.g., Ubuntu / g++ 7.x).
Just "make all" in the directory with extracted files.
If you see any error messages, please contact the authors (wbieniec@kis.p.lodz.pl or sgrabow@kis.p.lodz.pl).
2) Windows (64-bit).
We recommend using Visual Studio 2017. The package contains the project file (.sln).
Of course, compile in Release 64-bit mode.
Usage:
------
./copmem -o <MEMs_file> [options] <R> <Q>
where the obligatory parameters are:
-o <MEMs_file> output file with the found MEM information,
<R> reference file with one or more FASTA sequences,
<Q> query file with one or more FASTA sequences,
and the optional one:
-l n minimal MEM length (the default value is 100). This value must be >= 50,
-v | -q verbose mode (prints more details) | quiet mode,
-f fast mode, loads two files in memory, faster I/O operations,
-r calculates reverse complement matches. Does not go with -f switch,
-b calculates forward and reverse complement matches. Does not go with -f switch,
-K 36|44|56 determines length of k-mer for hash calculation. Default is 44,
-e forces k2 == 1,
-H 1|2|3|4|5 selects hash function. Default is 1.
1: maRushPrime1HashSimplified (http://www.amsoftware.narod.ru/algo2.html);
2: xxHash32 by Yann Collet (https://github.com/Cyan4973/xxHash);
3: xxHash64 by Yann Collet (https://github.com/Cyan4973/xxHash);
4: MetroHash64 by J. Andrew Rogers (https://github.com/jandrewrogers/MetroHash);
5: CityHash64 from Google (https://github.com/google/cityhash).
copMEM prints maximal exact matches to the -o file.
Currently, copMEM is a stand-alone program. In a future version we are going to allow to use it also as a drop-in replacement for MUMmer.
MemoryFill:
-----------
MemoryFill is an auxiliary program that enables "cold start", especially for several runs with the same input files.
Usage:
./memoryfill <amount_of_memory>
Use with <amount_of_memory> close to the amount of the installed physical memory. One can use values succeeded by M or G, like 14G.
M = 2^20, G = 2^30.
Demo:
-----
Download genome archives:
Homo sapiens ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
Mus musculus ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz
Unzip files into separate directories and concatenate single sequence files info one file, e.g.:
In Windows:
type mus\*.fa > mus.all.fa
type hum\*.fa > hum.all.fa
In Linux:
cat mus/*.fa > mus.all.fa
cat hum/*.fa > hum.all.fa
Run demo.cmd or demo.sh depending on the operating system.
Notes:
------
In the current version, copMEM is single-threaded. Despite that, it still works faster than its contenders (e.g., about twice faster than E-MEM using 10 threads on a 6c/12t i7 machine).
The memory usage of copMEM is, roughly,
|R| + |maxQSeq| + 4 * (2^29 + |R| / k1) + |output|
bytes, where
maxQSeq is the longest sequence (i.e., chromosome) from file Q, |output| is the maximum output size *per sequence from file Q*, and k1 is an internal parameter approximately equal to sqrt(L - 44). For example, k1 = 8 for L = 100 and k1 = 13 for L = 200.
If the datasets are larger than >4 GB (e.g., Triticum plants), replace the multiplier 4 with 8 in the memory formula above.
The value of 2^29 is the number of slots in the hash table.
Cite:
-----
If you find copMEM useful, please cite (at this moment, only an arXiv report exists):
https://arxiv.org/abs/1805.08816
Szymon Grabowski, Wojciech Bieniecki
"copMEM: Finding maximal exact matches via sampling both genomes"
References:
-----------
[E-MEM, an important competitor]
N. Khiste, L. Ilie
E-MEM: efficient computation of maximal exact matches for very large genomes
Bioinformatics 31(4), 2015, pp. 509--514
(https://academic.oup.com/bioinformatics/article/31/4/509/2748225)
[essaMEM]
M. Vyverman, B. De Baets, V. Fack, P. Dawyndt
essaMEM: finding maximal exact matches using enhanced sparse suffix arrays
Bioinformatics 29(6), 2013, pp. 802--804
(https://academic.oup.com/bioinformatics/article/29/6/802/184020)