-
Notifications
You must be signed in to change notification settings - Fork 102
/
bismark_genome_preparation
executable file
·848 lines (682 loc) · 30.9 KB
/
bismark_genome_preparation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
#!/usr/bin/env perl
use strict;
use warnings;
use Cwd;
use Getopt::Long;
$|++;
## This program is Copyright (C) 2010-23, Felix Krueger (fkrueger@altoslabs.com)
## This program is free software: you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation, either version 3 of the License, or
## (at your option) any later version.
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
## You should have received a copy of the GNU General Public License
## along with this program. If not, see <http://www.gnu.org/licenses/>.
my $verbose;
my $help;
my $version;
my $man;
my $path_to_aligner;
my $parallel;
my $multi_fasta;
my $single_fasta;
my $bowtie2;
my $hisat2;
my $mm2;
my $slam;
my $large_index;
my $parent_dir = getcwd();
my $genomic_composition;
my %genomic_freqs; # storing the genomic sequence composition
my %freqs;
my $bismark_version = 'v0.24.2';
my $copyright_date = '2010-23';
my $last_modified = "19 May 2022";
my $command_line = GetOptions ('verbose' => \$verbose,
'help' => \$help,
'man' => \$man,
'version' => \$version,
'slam' => \$slam,
'path_to_aligner:s' => \$path_to_aligner,
'parallel:i' => \$parallel,
'single_fasta' => \$single_fasta,
'bowtie2' => \$bowtie2,
'hisat2' => \$hisat2,
'minimap2|mm2' => \$mm2,
'genomic_composition' => \$genomic_composition,
'large-index' => \$large_index,
);
die "Please respecify command line options\n\n" unless ($command_line);
if ($help or $man){
print_helpfile();
exit;
}
if ($version){
print << "VERSION";
Bismark - Bisulfite Mapper and Methylation Caller.
Bismark Genome Preparation Version: $bismark_version
Copyright $copyright_date, Felix Krueger, Altos Bioinformatics
https://github.com/FelixKrueger/Bismark
VERSION
exit;
}
my $genome_folder = shift @ARGV; # mandatory
my %chromosomes; # checking if chromosome names are unique (required)
# Ensuring a genome folder has been specified
if ($genome_folder){
unless ($genome_folder =~ /\/$/){
$genome_folder =~ s/$/\//;
}
$verbose and print "Path to genome folder specified as: $genome_folder\n";
chdir $genome_folder or die "Could't move to directory $genome_folder. Make sure the directory exists! $!";
# making the genome folder path abolsolute so it won't break if the path was specified relative
$genome_folder = getcwd();
unless ($genome_folder =~ /\/$/){
$genome_folder =~ s/$/\//;
}
}
else{
die "Please specify a genome folder to be used for bisulfite conversion\n\n";
}
if (defined $parallel){
die "--parallel should have a value of 2 or more. Please respecify\n\n" unless ($parallel > 1);
warn "Using $parallel threads for the top and bottom strand indexing processes each, so using ",$parallel*2," cores in total\n"; sleep(1);
}
else{
$parallel = 1;
}
if ($slam){
warn "This is a SLAM-seq test version. Proceed with care!\n\n"; sleep(3);
}
my $CT_dir;
my $GA_dir;
if ($hisat2){
if ($bowtie2){
die "You may not select both --bowtie2 and --hisat2. Make your pick! (default is Bowtie 2)\n";
}
if ($mm2){
die "You may not select both --minimap2 and --hisat2. Make your pick!\n";
}
$bowtie2 = 0;
$mm2 = 0;
$verbose and print "Aligner to be used: >> HISAT2 <<\n";
}
elsif($mm2){
if ($bowtie2){
die "You may not select both --minimap2 and --bowtie2. Make your pick! (default is Bowtie 2)\n";
}
$bowtie2 = 0;
$hisat2 = 0;
$verbose and print "Aligner to be used: >> Minimap2 <<\n";
}
else{ # Bowtie 2 the default mode
if ($bowtie2){
$verbose and print "Aligner to be used: >> Bowtie 2 << (user-defined)\n";
}
else{
$verbose and print "Aligner to be used: >> Bowtie 2 << (default)\n";
}
$bowtie2 = 1;
}
if ($single_fasta){
if ($mm2){
die "Minimap2 mode does not work in conjunction with --single_fasta. Please respecify!\n\n";
}
warn "Writing individual genomes out into single-entry fasta files (one per chromosome)\n\n";
$multi_fasta = 0;
}
else{
warn "Writing bisulfite genomes out into a single MFA (multi FastA) file\n\n";
$single_fasta = 0;
$multi_fasta = 1;
}
if ($slam){
warn "Genome will be generated and indexed in with in-silico T->C transitions, and NOT in BISULFITE MODE\n"; sleep(3);
if ($mm2){
die "Minimap2 mode does not work in conjunction with --slam. Please respecify!\n\n";
}
}
if($large_index){
if ($mm2){
die "Minimap2 mode does not work in conjunction with --large_index. Please drop one option and respecify!\n\n";
}
warn "Large-index specified. Forcing generated index to be 'large', even if reference has fewer than 4 billion nucleotides.\n\n"; sleep(1);
$large_index = '--large-index';
}
else{
$large_index = '';
}
my @filenames = create_bisulfite_genome_folders();
if ($genomic_composition){
get_genomic_frequencies();
warn "Finished processing genomic nucleotide frequencies\n\n";
%chromosomes = (); # resetting
}
process_sequence_files ();
launch_indexer();
sub launch_indexer{
if ($bowtie2){
warn "Bismark Genome Preparation - Step III: Launching the Bowtie 2 indexer\n";
}
elsif($mm2){
warn "Bismark Genome Preparation - Step III: Launching the Minimap2 indexing process\n";
}
else{
warn "Bismark Genome Preparation - Step III: Launching the HISAT2 indexer\n";
}
print "Please be aware that this process can - depending on genome size - take several hours!\n";
sleep(1);
### if the path to the aligner was specfified explicitely
if ($path_to_aligner){
if ($bowtie2){
$path_to_aligner =~ s/$/bowtie2-build/;
}
elsif ($mm2){
# Acccording to https://github.com/lh3/minimap2#general-usage, the indexing command is simply:
# minimap2 -d ref.mmi ref.fa # indexing
$path_to_aligner =~ s/$/minimap2/;
}
else{
$path_to_aligner =~ s/$/hisat2-build/;
}
}
### otherwise we assume that build path is in the PATH environment
else{
if ($bowtie2){
$path_to_aligner = 'bowtie2-build';
}
elsif ($mm2){
# Acccording to https://github.com/lh3/minimap2#general-usage, the indexing command is simply:
# minimap2 -d ref.mmi ref.fa # indexing
$path_to_aligner = 'minimap2';
}
else{
$path_to_aligner = 'hisat2-build';
}
}
$verbose and print "\n";
### Forking the program to run 2 instances of the indexing process (bowtie2-build, hisat2-build or minimap2)
my $pid = fork();
my $multicore = ''; # If not specified we simply use an empty string (= the default)
# In a conversation, Heng Li suggested:
# "Due to the reduced alphabet, you could probably increase the k-mer size such that "mid_occ" in stderr is about several hundred.
# the default is 15. See here for a little benchmark: https://github.com/FelixKrueger/Bismark/issues/446
# -k 20 seems to be something like a sweet spot, but I'd be happy to open it up as an adjustable parameter if that helps
my $kmer = "-k 20";
if ($parallel) {
if ($mm2){
# minimap2 indexes with a maximum of 3 cores
$multicore = "-t $parallel";
}
else{
$multicore = "--threads $parallel";
}
}
# parent process
if ($pid){
sleep(1);
chdir $CT_dir or die "Unable to change directory: $!\n";
$verbose and warn "Preparing indexing of CT converted genome in $CT_dir\n";
my @fasta_files = <*.fa>;
my $file_list = join (',',@fasta_files);
my $cmd;
if ($mm2){
$cmd = "$path_to_aligner $kmer $multicore -d BS_CT.mmi $file_list";
}
else{
$cmd = "$path_to_aligner $multicore $large_index -f $file_list BS_CT";
}
$verbose and print "Parent process: Starting to index C->T converted genome with the following command:\n\n";
$verbose and print "$cmd\n\n";
system ($cmd)== 0 or die "Parent process: Failed to build index\n";
waitpid( $pid, 0 );
$? == 0 or die "Parent process: Child process failed\n";
warn "\n=========================================\n\nParallel genome indexing complete. Enjoy!\n\n";
}
# child process
elsif ($pid == 0){
sleep(2);
chdir $GA_dir or die "Unable to change directory: $!\n";
$verbose and warn "Preparing indexing of GA converted genome in $GA_dir\n";
my @fasta_files = <*.fa>;
my $file_list = join (',',@fasta_files);
my $cmd;
if ($mm2){
$cmd = "$path_to_aligner $kmer $multicore -d BS_GA.mmi $file_list";
}
else{
$cmd = "$path_to_aligner $multicore $large_index -f $file_list BS_GA";
}
$verbose and print "Child process: Starting to index G->A converted genome with the following command:\n\n";
$verbose and print "$cmd\n\n";
exec ($cmd);
}
# if the platform doesn't support the fork command we will run the indexing processes one after the other
else{
print "Forking process was not successful, therefore performing the indexing sequentially instead\n";
### moving to CT genome folder
$verbose and warn "Preparing to index CT converted genome in $CT_dir\n";
chdir $CT_dir or die "Unable to change directory: $!\n";
my @fasta_files = <*.fa>;
my $file_list = join (',',@fasta_files);
$verbose and print "$file_list\n\n";
my $cmd;
if ($mm2){
$cmd = "$path_to_aligner $multicore -d BS_CT.mmi $file_list";
}
else{
$cmd = "$path_to_aligner $large_index -f $file_list BS_CT";
}
$verbose and print "Child process: Starting to index C->T converted genome with the following command:\n\n";
$verbose and print "$cmd\n\n";
system ($cmd)== 0 or die "Failed to build C>T index\n\n";
# Resetting
@fasta_files=();
$file_list= '';
$cmd = '';
### moving to GA genome folder
$verbose and warn "Preparing to index GA converted genome in $GA_dir\n";
chdir $GA_dir or die "Unable to change directory: $!\n";
@fasta_files = <*.fa>;
$file_list = join (',',@fasta_files);
$verbose and print "$file_list\n\n";
if ($mm2){
$cmd = "$path_to_aligner $multicore -d BS_GA.mmi $file_list";
}
else{
$cmd = "$path_to_aligner $large_index -f $file_list BS_GA";
}
$verbose and print "Child process: Starting to index G->A converted genome with the following command:\n\n";
$verbose and print "$cmd\n\n";
system ($cmd)== 0 or die "Failed to build G>A index\n\n";;
warn "Sequential genome indexing complete\n\n";
}
}
sub process_sequence_files {
my ($total_CT_conversions,$total_GA_conversions) = (0,0);
$verbose and print "Bismark Genome Preparation - Step II: Bisulfite converting reference genome\n\n";
sleep (1);
chdir $genome_folder or die "Could't move to directory $genome_folder $!";
$verbose and print "conversions performed:\n";
if ($slam){
$verbose and print join("\t",'chromosome','T->C','A->G'),"\n";
}
else{
$verbose and print join("\t",'chromosome','C->T','G->A'),"\n";
}
### If someone wants to index a genome which consists of thousands of contig and scaffold files we need to write the genome conversions into an MFA file
### Otherwise the list of comma separated chromosomes we provide for bowtie-build will get too long for the kernel to handle
### This is now the default option
if ($multi_fasta){
### Here we just use one multi FastA file name, append .CT_conversion or .GA_conversion and print all sequence conversions into these files
my $bisulfite_CT_conversion_filename = "$CT_dir/genome_mfa.CT_conversion.fa";
open (CT_CONVERT,'>',$bisulfite_CT_conversion_filename) or die "Can't write to file $bisulfite_CT_conversion_filename: $!\n";
my $bisulfite_GA_conversion_filename = "$GA_dir/genome_mfa.GA_conversion.fa";
open (GA_CONVERT,'>',$bisulfite_GA_conversion_filename) or die "Can't write to file $bisulfite_GA_conversion_filename: $!\n";
}
foreach my $filename(@filenames){
my ($chromosome_CT_conversions,$chromosome_GA_conversions) = (0,0);
if ($filename =~ /\.gz$/){
open (IN,"gunzip -c $filename |") or die "Failed to read from sequence file $filename $!\n";
}
else{
open (IN,$filename) or die "Failed to read from sequence file $filename $!\n";
}
# warn "Reading chromosome information from $filename\n\n";
### first line needs to be a fastA header
my $first_line = <IN>;
chomp $first_line;
### Extracting chromosome name from the FastA header
my $chromosome_name = extract_chromosome_name($first_line);
### Exiting if a chromosome with the same name was present already
if (exists $chromosomes{$chromosome_name}){
die "Exiting because chromosome name '$chromosome_name' already exists. Please make sure all chromosomes have a unique name!\n";
}
else{
$chromosomes{$chromosome_name}++;
}
### alternatively, chromosomes can be written out into single-entry FastA files. This will only work for genomes with up to a few hundred chromosomes.
unless ($multi_fasta){
my $bisulfite_CT_conversion_filename = "$CT_dir/$chromosome_name";
$bisulfite_CT_conversion_filename =~ s/$/.CT_conversion.fa/;
open (CT_CONVERT,'>',$bisulfite_CT_conversion_filename) or die "Can't write to file $bisulfite_CT_conversion_filename: $!\n";
my $bisulfite_GA_conversion_filename = "$GA_dir/$chromosome_name";
$bisulfite_GA_conversion_filename =~ s/$/.GA_conversion.fa/;
open (GA_CONVERT,'>',$bisulfite_GA_conversion_filename) or die "Can't write to file $bisulfite_GA_conversion_filename: $!\n";
}
print CT_CONVERT ">",$chromosome_name,"_CT_converted\n"; # first entry
print GA_CONVERT ">",$chromosome_name,"_GA_converted\n"; # first entry
### TODO: Change this for GrandSlam
while (<IN>){
### in case the line is a new fastA header
if ($_ =~ /^>/){
### printing out the stats for the previous chromosome
$verbose and print join ("\t",$chromosome_name,$chromosome_CT_conversions,$chromosome_GA_conversions),"\n";
### resetting the chromosome transliteration counters
($chromosome_CT_conversions,$chromosome_GA_conversions) = (0,0);
### Extracting chromosome name from the additional FastA header
$chromosome_name = extract_chromosome_name($_);
### alternatively, chromosomes can be written out into single-entry FastA files. This will only work for genomes with up to a few hundred chromosomes.
unless ($multi_fasta){
my $bisulfite_CT_conversion_filename = "$CT_dir/$chromosome_name";
$bisulfite_CT_conversion_filename =~ s/$/.CT_conversion.fa/;
open (CT_CONVERT,'>',$bisulfite_CT_conversion_filename) or die "Can't write to file $bisulfite_CT_conversion_filename: $!\n";
my $bisulfite_GA_conversion_filename = "$GA_dir/$chromosome_name";
$bisulfite_GA_conversion_filename =~ s/$/.GA_conversion.fa/;
open (GA_CONVERT,'>',$bisulfite_GA_conversion_filename) or die "Can't write to file $bisulfite_GA_conversion_filename: $!\n";
}
print CT_CONVERT ">",$chromosome_name,"_CT_converted\n";
print GA_CONVERT ">",$chromosome_name,"_GA_converted\n";
}
else{
my $sequence = uc$_;
### (I) First replacing all ambiguous sequence characters (such as M,S,R....) by N (G,A,T,C,N and the line endings \r and \n are added to a character group)
$sequence =~ s/[^ATCGN\n\r]/N/g;
### (II) Writing the chromosome out into a C->T converted version (equals forward strand conversion)
my $CT_sequence = $sequence;
my $CT_transliterations_performed;
if ($slam){
$CT_transliterations_performed = ($CT_sequence =~ tr/T/C/); # converts all Ts into Cs for SLAM-Seq
}
else{
$CT_transliterations_performed = ($CT_sequence =~ tr/C/T/); # converts all Cs into Ts
}
$total_CT_conversions += $CT_transliterations_performed;
$chromosome_CT_conversions += $CT_transliterations_performed;
print CT_CONVERT $CT_sequence;
### (III) Writing the chromosome out in a G->A converted version of the forward strand (this is equivalent to reverse-
### complementing the forward strand and then C->T converting it)
my $GA_sequence = $sequence;
my $GA_transliterations_performed;
if ($slam){
$GA_transliterations_performed = ($GA_sequence =~ tr/A/G/); # converts all As to Gs on the forward strand, for SLAM-Seq
}
else{
$GA_transliterations_performed = ($GA_sequence =~ tr/G/A/); # converts all Gs to As on the forward strand
}
$total_GA_conversions += $GA_transliterations_performed;
$chromosome_GA_conversions += $GA_transliterations_performed;
print GA_CONVERT $GA_sequence;
}
}
$verbose and print join ("\t",$chromosome_name,$chromosome_CT_conversions,$chromosome_GA_conversions),"\n";
}
close (CT_CONVERT) or die "Failed to close filehandle: $!\n";
close (GA_CONVERT) or die "Failed to close filehandle: $!\n";
print "\nTotal number of conversions performed:\n";
if ($slam){
print "T->C:\t$total_CT_conversions\n";
print "A->G:\t$total_GA_conversions\n";
warn "\nStep II - Genome SLAM conversions - completed\n\n\n";
}
else{
print "C->T:\t$total_CT_conversions\n";
print "G->A:\t$total_GA_conversions\n";
warn "\nStep II - Genome bisulfite conversions - completed\n\n\n";
}
}
sub get_genomic_frequencies{
warn "Calculating genomic frequencies (this may take several minutes depending on genome size) ...\n";
warn "="x164,"\n";
read_genome_into_memory($parent_dir);
foreach my $chr (keys %chromosomes){
warn "Processing chromosome >> $chr <<\n";
process_sequence($chromosomes{$chr});
}
%genomic_freqs = %freqs;
### Attempting to write a genomic nucleotide frequency table out to the genome folder so we can re-use it next time without the need to re-calculate
### if this fails
if ( open (FREQS,'>',"${genome_folder}genomic_nucleotide_frequencies.txt") ){
warn "Writing genomic nucleotide frequencies to the file >${genome_folder}genomic_nucleotide_frequencies.txt< for future re-use\n";
foreach my $f(sort keys %genomic_freqs){
warn "Writing count of (di-)nucleotide: $f\t$genomic_freqs{$f}\n";
print FREQS "$f\t$genomic_freqs{$f}\n";
}
close FREQS or warn "Failed to close filehandle FREQS: $!\n\n";
}
else{
warn "Failed to write out file ${genome_folder}genomic_nucleotide_frequencies.txt because of: $!. Skipping writing out genomic frequency table\n\n";
}
}
sub process_sequence{
my $seq = shift;
my $mono;
my $di;
foreach my $index (0..(length$seq)-1){
my $counted = 0;
if ($index%10000000==0){
# warn "Current index number is $index\n";
}
$mono = substr($seq,$index,1);
unless ( $mono eq 'N'){
$freqs{$mono}++;
}
unless ( ($index + 2) > length$seq ){
$di = substr($seq,$index,2);
if (index($di,'N') < 0) {
$freqs{$di}++;
}
}
}
}
sub extract_chromosome_name {
## Bowtie extracts the first string after the inition > in the FASTA file, so we are doing this as well
my $fasta_header = shift;
if ($fasta_header =~ s/^>//){
my ($chromosome_name) = split (/\s+/,$fasta_header);
return $chromosome_name;
}
else{
die "The specified chromosome ($fasta_header) file doesn't seem to be in FASTA format as required!\n";
}
}
sub create_bisulfite_genome_folders{
$verbose and print "Bismark Genome Preparation - Step I: Preparing folders\n\n";
if ($path_to_aligner){
unless ($path_to_aligner =~ /\/$/){
$path_to_aligner =~ s/$/\//;
}
if (chdir $path_to_aligner){
if ($bowtie2){
$verbose and print "Path to Bowtie 2 specified as: $path_to_aligner\n";
}
else{
$verbose and print "Path to HISAT2 specified as: $path_to_aligner\n";
}
}
else{
die "There was an error with the path to the aligner: $!\n";
}
}
chdir $genome_folder or die "Could't move to directory $genome_folder. Make sure the directory exists! $!";
# Exiting unless there are fastA files in the folder
my @filenames = <*.fa>;
### if there aren't any genomic files with the extension .fa we will look for files with the extension .fa.gz
unless (@filenames){
@filenames = <*.fa.gz>;
}
### if there aren't any genomic files with the extension .fa or .fa.gz we will look for files with the extension .fasta
unless (@filenames){
@filenames = <*.fasta>;
}
### if there aren't any genomic files with the extension .fa, .fa.gz or .fasta we will look for files with the extension .fasta.gz
unless (@filenames){
@filenames = <*.fasta.gz>;
}
unless (@filenames){
die "The specified genome folder $genome_folder does not contain any sequence files in FastA format (with .fa, .fa.gz, .fasta or .fasta.gz file extensions)\n";
}
warn "Bisulfite Genome Indexer version $bismark_version (last modified: $last_modified)\n";
sleep (1);
# creating a directory inside the genome folder to store the bisfulfite genomes unless it already exists
my $bisulfite_dir = "${genome_folder}Bisulfite_Genome/";
unless (-d $bisulfite_dir){
mkdir $bisulfite_dir or die "Unable to create directory $bisulfite_dir $!\n";
$verbose and print "Created Bisulfite Genome folder $bisulfite_dir\n";
}
else{
print "\nA directory called $bisulfite_dir already exists. Already existing converted sequences and/or already existing Bowtie 2, HISAT2 or Minimap2) indices will be overwritten!\n\n";
}
chdir $bisulfite_dir or die "Unable to move to $bisulfite_dir\n";
$CT_dir = "${bisulfite_dir}CT_conversion/";
$GA_dir = "${bisulfite_dir}GA_conversion/";
# creating 2 subdirectories to store a C->T (forward strand conversion) and a G->A (reverse strand conversion)
# converted version of the genome
unless (-d $CT_dir){
mkdir $CT_dir or die "Unable to create directory $CT_dir $!\n";
$verbose and print "Created Bisulfite Genome folder $CT_dir\n";
}
unless (-d $GA_dir){
mkdir $GA_dir or die "Unable to create directory $GA_dir $!\n";
$verbose and print "Created Bisulfite Genome folder $GA_dir\n";
}
# moving back to the original genome folder
chdir $genome_folder or die "Could't move to directory $genome_folder $!";
# $verbose and print "Moved back to genome folder folder $genome_folder\n";
warn "\nStep I - Prepare genome folders - completed\n\n\n";
return @filenames;
}
sub read_genome_into_memory{
## reading in and storing the specified genome in the %chromosomes hash
chdir ($genome_folder) or die "Can't move to $genome_folder: $!";
warn "Now reading in and storing sequence information of the genome specified in: $genome_folder\n\n";
my @chromosome_filenames = <*.fa>;
### if there aren't any genomic files with the extension .fa we will look for files with the extension .fa.gz
unless (@chromosome_filenames){
@chromosome_filenames = <*.fa.gz>;
}
### if there aren't any genomic files with the extension .fa or .fq.gz we will look for files with the extension .fasta
unless (@chromosome_filenames){
@chromosome_filenames = <*.fasta>;
}
### if there aren't any genomic files with the extension .fa or .fa.gz or .fasta we will look for files with the extension .fasta.gz
unless (@chromosome_filenames){
@chromosome_filenames = <*.fasta.gz>;
}
unless (@chromosome_filenames){
die "The specified genome folder $genome_folder does not contain any sequence files in FastA format (with .fa, .fa.gz, .fasta or .fasta.gz file extensions)\n";
}
foreach my $chromosome_filename (@chromosome_filenames){
# skipping the tophat entire mouse genome fasta file
next if ($chromosome_filename eq 'Mus_musculus.NCBIM37.fa');
if( $chromosome_filename =~ /\.gz$/){
open (CHR_IN,"gunzip -c $chromosome_filename |") or die "Failed to read from sequence file $chromosome_filename $!\n";
}
else{
open (CHR_IN,$chromosome_filename) or die "Failed to read from sequence file $chromosome_filename $!\n";
}
### first line needs to be a fastA header
my $first_line = <CHR_IN>;
chomp $first_line;
$first_line =~ s/\r//; # removing /r carriage returns
### Extracting chromosome name from the FastA header
my $chromosome_name = extract_chromosome_name($first_line);
my $sequence;
while (<CHR_IN>){
chomp;
$_ =~ s/\r//; # removing /r carriage returns
if ($_ =~ /^>/){
### storing the previous chromosome in the %chromosomes hash, only relevant for Multi-Fasta-Files (MFA)
if (exists $chromosomes{$chromosome_name}){
warn "chr $chromosome_name (",length $sequence ," bp)\n";
die "Exiting because chromosome name '$chromosome_name' already exists. Please make sure all chromosomes have a unique name!\n";
}
else {
if (length($sequence) == 0){
warn "Chromosome $chromosome_name in the multi-fasta file $chromosome_filename did not contain any sequence information!\n";
}
warn "chr $chromosome_name (",length $sequence ," bp)\n";
$chromosomes{$chromosome_name} = $sequence;
}
### resetting the sequence variable
$sequence = '';
### setting new chromosome name
$chromosome_name = extract_chromosome_name($_);
}
else{
$sequence .= uc$_;
}
}
if (exists $chromosomes{$chromosome_name}){
warn "chr $chromosome_name (",length $sequence ," bp)\t";
die "Exiting because chromosome name '$chromosome_name' already exists. Please make sure all chromosomes have a unique name.\n";
}
else{
if (length($sequence) == 0){
warn "Chromosome $chromosome_name in the file $chromosome_filename did not contain any sequence information!\n";
}
warn "chr $chromosome_name (",length $sequence ," bp)\n";
$chromosomes{$chromosome_name} = $sequence;
}
}
warn "\n";
chdir $parent_dir or die "Failed to move to directory $parent_dir\n";
}
sub print_helpfile{
print << 'HOW_TO';
DESCRIPTION
This script is supposed to convert a specified reference genome into two different bisulfite converted
versions and index them for alignments with Bowtie 2 (default), HISAT2 or Minimap2. The first bisulfite
genome will have all Cs converted to Ts (C->T), and the other one will have all Gs converted to As (G->A).
Both bisulfite genomes will be stored in subfolders within the reference genome folder. Once the bisulfite
conversion has been completed, the program will fork and launch two simultaneous instances of the Bowtie 2,
HISAT2 or minimap2 indexer (bowtie2-build or hisat2-build or minimap2 -d, resepctively). Be aware that the
indexing process can take up to several hours; this will mainly depend on genome size and system resources.
The new, and still experimental, --slam mode will produce T->C and A->G converted genomes instead.
The structure of the genome folder will remain the same as for BS-Seq data. This might, or might not,
change in a future release.
The following is a brief description of command line options and arguments to control the
Bismark Genome Preparation:
USAGE: bismark_genome_preparation [options] <arguments>
OPTIONS:
--help Displays this help file and exits.
--version Displays version information and exits.
--verbose Print verbose output for more details or debugging.
--path_to_aligner </../> The full path to the Bowtie 2, HISAT2 or minimap2 installation folder on your system
(depending on which aligner/indexer you intend to use; please note that this
is the folder and not any executable). Unless this path is specified, it is
assumed that the aligner in question (Bowtie 2/HISAT2/minimap2) is in the PATH.
--bowtie2 This will create bisulfite indexes for use with Bowtie 2. Recommended for most bisulfite
sequencing applications (Default: ON).
--hisat2 This will create bisulfite indexes for use with HISAT2. This is recommended for specialist
applications such as RNA methylation analyses or SLAM-seq type applications (see also: --slam).
(Default: OFF).
--minimap2/--mm2 This will create bisulfite indexes for use with minimap2 (https://github.com/lh3/minimap2).
This is recommended only for specialist applications such as EM-seq with ONT
(Oxford Nanopore Technologies) or PacBio reads. (Default: OFF).
--parallel INT Use several threads for each indexing process to speed up the genome preparation step.
Remember that the indexing is run twice in parallel already (for the top and bottom strand
separately), so e.g. '--parallel 4' will use 8 threads in total. Please also see --large-index
for parallel processing of VERY LARGE genomes (e.g. the axolotl)
--single_fasta Instruct the Bismark Indexer to write the converted genomes into
single-entry FastA files instead of making one multi-FastA file (MFA)
per chromosome. This might be useful if individual bisulfite converted
chromosomes are needed (e.g. for debugging), however it can cause a
problem with indexing if the number of chromosomes is vast (this is likely
to be in the range of several thousand files; the operating system can
only handle lists up to a certain length, and some newly assembled
genomes may contain 20000-500000 contigs of scaffold files which do exceed
this list length limit). Does not work in conjunction with --minimap2.
--genomic_composition Calculate and extract the genomic sequence composition for mono and di-nucleotides
and write the genomic composition table 'genomic_nucleotide_frequencies.txt' to the
genome folder. This may be useful later on when using bam2nuc or the Bismark option
--nucleotide_coverage.
--slam Instead of performing an in-silico bisulfite conversion, this mode transforms T to C (forward strand),
or A to G (reverse strand). The folder structure and rest of the indexing process is currently
exactly the same as for bisulfite sequences, but this might change at some point. This means
that a genome prepared in --slam mode is currently indistinguishable from a true Bisulfite Genome,
so please make sure you name the genome folder appropriately to avoid confusion. Does not
work in conjunction with --minimap2.
--large-index Force generated index to be 'large', even if reference has fewer than 4 billion nucleotides. At the time
of writing this is required for parallel processing of VERY LARGE genomes (e.g. the axolotl). Does not
work in conjunction with --minimap2.
ARGUMENTS:
<path_to_genome_folder> The path to the folder containing the genome to be bisulfite converted. The Bismark Genome
Preparation expects one or more fastA files in the folder (with the file extension: .fa or
.fasta (also ending in .gz)). Specifying this path is mandatory.
This script was last modified on 19 May 2022.
HOW_TO
}