Skip to content

Commit

Permalink
Simplify file input
Browse files Browse the repository at this point in the history
  • Loading branch information
Alexandre Souvorov committed Oct 9, 2018
1 parent bd6fafb commit b838b97
Show file tree
Hide file tree
Showing 4 changed files with 84 additions and 57 deletions.
43 changes: 25 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# SKESA - Strategic Kmer Extension for Scrupulous Assemblies
Version 2.2
Version 2.3

For questions regarding SKESA, please contact
Alexandre Souvorov (souvorov@ncbi.nlm.nih.gov)
Richa Agarwala (agarwala@ncbi.nlm.nih.gov)

Please [cite](#citation) our paper.

## Compilation

Expand Down Expand Up @@ -62,7 +63,8 @@ For questions regarding SKESA, please contact
times for different runs) [string]
--fastq arg Input fastq file(s) (could be used multiple
times for different runs) [string]
--gz Input fasta/fastq files are gzipped [flag]
--use_paired_ends Indicates that a single (not comma separated)
fasta/fastq file contains paired reads [flag]
--sra_run arg Input sra run accession (could be used multiple
times for different runs) [string]
--contigs_out arg Output file for contigs (stdout if not
Expand All @@ -76,8 +78,6 @@ For questions regarding SKESA, please contact
the maximal kmer length in reads [integer]
--vector_percent arg (=0.05) Count for vectors as a fraction of the read
number (1. disables) [float (0,1]]
--use_paired_ends Use pairing information from paired reads in
input [flag]
--insert_size arg Expected insert size for paired reads (if not
provided, it will be estimated) [integer]
--steps arg (=11) Number of assembly iterations from minimal to
Expand All @@ -90,6 +90,7 @@ For questions regarding SKESA, please contact
--allow_snps Allow additional step for snp discovery [flag]

Debugging options:
--force_single_ends Don't use paired-end information [flag]
--seeds arg Input file with seeds [string]
--all arg Output fasta for each iteration [string]
--dbg_out arg Output kmer file [string]
Expand All @@ -102,7 +103,7 @@ For questions regarding SKESA, please contact


## Short description

SKESA is a de-novo sequence read assembler for microbial genomes
based on DeBruijn graphs. It uses conservative heuristics and is designed to
create breaks at repeat regions in the genome. This leads to excellent sequence
Expand All @@ -114,18 +115,24 @@ For questions regarding SKESA, please contact

SKESA can process read information by accessing reads from SRA (option --sra_run)
or from files in fasta (option --fasta) or fastq (option --fastq) format. Any
combination of input streams is allowed. For paired reads, if a single file is
specified, reads are expected to be interleaved with first mate followed by the
second. To specify a separate file for each mate, filenames separated by a comma
for first mate followed by the second mate are listed and in this case, the order
of reads is expected to be same in files for both mates. A limitation of the current
release is that in case multiple streams of paired reads are provided, it is assumed
that all streams have the same insert size. User can explicitly specify expected
combination of input streams is allowed. Files could be gzipped, which is recognized
automatically.

When accessing reads from SRA SKESA automatically determines if the read set consists of
paired-end or single-end reads. For fasta/fastq input of paired reads with separate files
for each mate, filenames separated by a comma for first mate followed by the second mate
are listed and in this case, the order of reads is expected to be same in files for both mates.
Alternatively, a single file with both mates could be specified. In this case the reads are
expected to be interleaved with first mate followed by the second, and the option --use_paired_ends
must be used.

A limitation of the current release is that in case multiple streams of paired reads are provided,
it is assumed that all streams have the same insert size. User can explicitly specify expected
insert size for the reads (option --insert_size). Otherwise, a sample of input
reads is used to estimate the expected insert size. This sampling may lead to very
small differences in assembly of the same read set if the order of reads is different
and selected sample gives a difference in expected insert size.

Two additional options users may wish to specify depending on the resources
available to them are as follows:
1. the number of cores (option --cores) and
Expand Down Expand Up @@ -168,23 +175,23 @@ For questions regarding SKESA, please contact

Example of an assembly that directly accesses SRA for a paired read set SRR1960353 is:

$ skesa --sra_run SRR1960353 --cores 4 --memory 48 --use_paired_ends > SRR1960353.skesa.fa
$ skesa --sra_run SRR1960353 --cores 4 --memory 48 > SRR1960353.skesa.fa

Example of an assembly that uses separate fastq files for each mate of SRR1703350 is:

$ skesa --fastq SRR1703350_1.fq,SRR1703350_2.fq --cores 4 --memory 48 --use_paired_ends > SRR1703350.skesa.fa
$ skesa --fastq SRR1703350_1.fq,SRR1703350_2.fq --cores 4 --memory 48 > SRR1703350.skesa.fa

Example of an assembly that uses interleaved mates for SRR1703350 as fastq input is:

$ skesa --fastq SRR1703350.fq --cores 4 --memory 48 --use_paired_ends > SRR1703350.skesa.fa
$ skesa --fastq SRR1703350.fq --use_paired_ends --cores 4 --memory 48 > SRR1703350.skesa.fa

Example of an assembly that uses reads from SRA for SRR1695624 and gzipped fasta for SRR1745628 is:

$ skesa --sra_run SRR1695624 --fasta SRR1745628.fa.gz --gz --cores 4 --memory 48 --use_paired_ends > SAMN03218571.skesa.fa
$ skesa --sra_run SRR1695624 --fasta SRR1745628.fa.gz --use_paired_ends --cores 4 --memory 48 > SAMN03218571.skesa.fa

Example of the same assembly as above done with both runs accessed from SRA is:

$ skesa --sra_run SRR1695624 --sra_run SRR1745628 --cores 4 --memory 48 --use_paired_ends > SAMN03218571.skesa.fa
$ skesa --sra_run SRR1695624 --sra_run SRR1745628 --cores 4 --memory 48 > SAMN03218571.skesa.fa

## Citation

Expand Down
23 changes: 15 additions & 8 deletions assembler.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,6 @@ namespace DeBruijn {
// steps - number of assembly iterations from minimal to maximal kmer size in reads
// min_count - minimal kmer count to be included in a de Bruijn graph
// min_kmer - the minimal kmer size for the main steps
// usepairedends - whether or not to use paired ends
// max_kmer_paired - insert size (0 if not known)
// maxkmercount - the minimal average count for estimating the maximal kmer
// memory - the upper bound for memory use (GB)
Expand All @@ -73,9 +72,10 @@ namespace DeBruijn {
using GraphDigger = CDBGraphDigger<DBGraph>;

template<typename... GraphArgs>
CDBGAssembler(double fraction, int jump, int low_count, int steps, int min_count, int min_kmer, bool usepairedends,
int max_kmer_paired, int maxkmercount, int ncores, list<array<CReadHolder,2>>& raw_reads, TStrList seeds, bool allow_snps, bool estimate_min_count, GraphArgs... gargs) :
m_fraction(fraction), m_jump(jump), m_low_count(low_count), m_steps(steps), m_min_count(min_count), m_min_kmer(min_kmer), m_usepairedends(usepairedends),
CDBGAssembler(double fraction, int jump, int low_count, int steps, int min_count, int min_kmer, bool forcesinglereads,
int max_kmer_paired, int maxkmercount, int ncores, list<array<CReadHolder,2>>& raw_reads, TStrList seeds,
bool allow_snps, bool estimate_min_count, GraphArgs... gargs) :
m_fraction(fraction), m_jump(jump), m_low_count(low_count), m_steps(steps), m_min_count(min_count), m_min_kmer(min_kmer),
m_max_kmer_paired(max_kmer_paired), m_maxkmercount(maxkmercount), m_ncores(ncores), m_average_count(0), m_raw_reads(raw_reads) {

m_max_kmer = m_min_kmer;
Expand All @@ -88,10 +88,18 @@ namespace DeBruijn {

double total_seq = 0;
size_t total_reads = 0;
size_t paired = 0;
for(auto& reads : m_raw_reads) {
if(forcesinglereads) {
for(CReadHolder::string_iterator is = reads[0].sbegin(); is != reads[0].send(); ++is)
reads[1].PushBack(is);
reads[0].Clear();
}
total_seq += reads[0].TotalSeq()+reads[1].TotalSeq();
total_reads += reads[0].ReadNum()+reads[1].ReadNum();
paired += reads[0].ReadNum();
}
bool usepairedends = paired > 0;

//graph for minimal kmer
double average_count = GetGraph(m_min_kmer, m_raw_reads, true, estimate_min_count ? total_seq : 0, gargs...);
Expand Down Expand Up @@ -192,8 +200,8 @@ namespace DeBruijn {
cerr << endl << "Average count: " << average_count << " Max kmer: " << m_max_kmer << endl;

//estimate insert size
if(steps > 1 || m_usepairedends) {
if(m_max_kmer_paired == 0) {
if(steps > 1 || usepairedends) {
if(m_max_kmer_paired == 0 && usepairedends) {
size_t mates = 0;
for(auto& rh : m_raw_reads)
mates += rh[0].ReadNum();
Expand Down Expand Up @@ -273,7 +281,7 @@ namespace DeBruijn {
}

// three additional iterations with kmers (usually) longer than read length and upto insert size
if(m_usepairedends && m_insert_size > 0 && m_max_kmer_paired > 1.5*m_max_kmer) {
if(usepairedends && m_insert_size > 0 && m_max_kmer_paired > 1.5*m_max_kmer) {
ConnectPairsIteratively();

array<int,3> long_kmers;
Expand Down Expand Up @@ -887,7 +895,6 @@ namespace DeBruijn {
int m_steps; // number of main steps
int m_min_count; // minimal kmer count to be included in a de Bruijn graph
int m_min_kmer; // the minimal kmer size for the main steps
bool m_usepairedends; // whether or not to use paired ends
int m_max_kmer_paired; // insert size
int m_insert_size; // upper bound for the insert size
int m_maxkmercount; // the minimal average count for estimating the maximal kmer
Expand Down
40 changes: 20 additions & 20 deletions readsgetter.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,9 @@ namespace DeBruijn {
// files specified as a list separated by comma with file for first mate followed by the file for second mate.
// ncores - number of cores
// usepairedends - flag to indicate that input reads are paired
// gzipped - flag to indicate that input files are gzipped

CReadsGetter(const vector<string>& sra_list, const vector<string>& fasta_list, const vector<string>& fastq_list, int ncores, bool usepairedends, bool gzipped) :
m_ncores(ncores), m_usepairedends(usepairedends), m_gzipped(gzipped) {
CReadsGetter(const vector<string>& sra_list, const vector<string>& fasta_list, const vector<string>& fastq_list, int ncores, bool usepairedends) :
m_ncores(ncores), m_usepairedends(usepairedends) {

CStopWatch timer;
timer.Restart();
Expand All @@ -86,16 +85,14 @@ namespace DeBruijn {
size_t total = 0;
size_t paired = 0;
for(auto& reads : m_reads) {


total += reads[0].ReadNum()+reads[1].ReadNum();
paired += reads[0].ReadNum();
}

if(total == 0)
throw runtime_error("No valid reads available for assembly");

if(m_usepairedends)
if(paired > 0)
cerr << "Total mates: " << total << " Paired reads: " << paired/2 << endl;
else
cerr << "Total reads: " << total << endl;
Expand Down Expand Up @@ -476,13 +473,21 @@ namespace DeBruijn {
return true;
};

auto OpenStream = [] (const string& file, bool gzipped, bool isfasta, boost::iostreams::filtering_istream& is) {
auto OpenStream = [] (const string& file, bool isfasta, boost::iostreams::filtering_istream& is) {

ifstream gztest(file, ios_base::in|ios_base::binary);
if(!gztest.is_open())
throw runtime_error("Error opening "+file);
array<uint8_t,2> gzstart;
if(!gztest.read(reinterpret_cast<char*>(gzstart.data()), 2))
throw runtime_error("Invalid file "+file);
bool gzipped = (gzstart[0] == 0x1f && gzstart[1] == 0x8b);
gztest.close();

ios_base::openmode mode = ios_base::in;
if(gzipped)
mode |= ios_base::binary;
boost::iostreams::file_source f{file, mode};
if(!f.is_open())
throw runtime_error("Error opening "+file);
if(gzipped)
is.push(boost::iostreams::gzip_decompressor());
is.push(f);
Expand Down Expand Up @@ -518,7 +523,7 @@ namespace DeBruijn {
size_t comma = file.find(',');
if(comma == string::npos) {
boost::iostreams::filtering_istream is;
OpenStream(file, m_gzipped, isfasta, is);
OpenStream(file, isfasta, is);

if(!m_usepairedends) {
while(NextRead(acc1, read1, isfasta, is, file))
Expand Down Expand Up @@ -547,20 +552,16 @@ namespace DeBruijn {
} else {
boost::iostreams::filtering_istream is1;
string file1 = file.substr(0,comma);
OpenStream(file1, m_gzipped, isfasta, is1);
OpenStream(file1, isfasta, is1);
boost::iostreams::filtering_istream is2;
string file2 = file.substr(comma+1);
OpenStream(file2, m_gzipped, isfasta, is2);
int p = m_usepairedends ? 0 : 1;
OpenStream(file2, isfasta, is2);
while(NextRead(acc1, read1, isfasta, is1, file1)) {
if(NextRead(acc2, read2, isfasta, is2, file2)) {
InsertRead(read1, all_reads[p], file1);
InsertRead(read2, all_reads[p], file2);
InsertRead(read1, all_reads[0], file1);
InsertRead(read2, all_reads[0], file2);
} else {
if(m_usepairedends)
throw runtime_error("Files "+file+" contain different number of mates");
else
InsertRead(read1, all_reads[p], file1);
throw runtime_error("Files "+file+" contain different number of mates");
}
}
}
Expand All @@ -586,7 +587,6 @@ namespace DeBruijn {

int m_ncores;
bool m_usepairedends;
bool m_gzipped;
list<array<CReadHolder,2>> m_reads;
int m_kmer_for_adapters;
CKmerMap<int> m_adapters;
Expand Down
Loading

0 comments on commit b838b97

Please sign in to comment.