Simplify file input

ncbi · Oct 9, 2018 · b838b97 · b838b97
1 parent bd6fafb
commit b838b97
Show file tree

Hide file tree

Showing 4 changed files with 84 additions and 57 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,11 @@
 # SKESA - Strategic Kmer Extension for Scrupulous Assemblies
-Version 2.2
+Version 2.3
 
 For questions regarding SKESA, please contact
     Alexandre Souvorov (souvorov@ncbi.nlm.nih.gov)
     Richa Agarwala (agarwala@ncbi.nlm.nih.gov)
 
+Please [cite](#citation)  our paper.
 
 ## Compilation
 
@@ -62,7 +63,8 @@ For questions regarding SKESA, please contact
                                     times for different runs) [string]
       --fastq arg                   Input fastq file(s) (could be used multiple 
                                     times for different runs) [string]
-      --gz                          Input fasta/fastq files are gzipped [flag]
+      --use_paired_ends             Indicates that a single (not comma separated) 
+                                    fasta/fastq file contains paired reads [flag]
       --sra_run arg                 Input sra run accession (could be used multiple
                                     times for different runs) [string]
       --contigs_out arg             Output file for contigs (stdout if not 
@@ -76,8 +78,6 @@ For questions regarding SKESA, please contact
                                     the maximal kmer length in reads [integer]
       --vector_percent arg (=0.05)  Count for  vectors as a fraction of the read 
                                     number (1. disables) [float (0,1]]
-      --use_paired_ends             Use pairing information from paired reads in 
-                                    input [flag]
       --insert_size arg             Expected insert size for paired reads (if not 
                                     provided, it will be estimated) [integer]
       --steps arg (=11)             Number of assembly iterations from minimal to 
@@ -90,6 +90,7 @@ For questions regarding SKESA, please contact
       --allow_snps                  Allow additional step for snp discovery [flag]
 
     Debugging options:
+      --force_single_ends           Don't use paired-end information [flag]
       --seeds arg                   Input file with seeds [string]
       --all arg                     Output fasta for each iteration [string]
       --dbg_out arg                 Output kmer file [string]
@@ -102,7 +103,7 @@ For questions regarding SKESA, please contact
 
 
 ## Short description
-    
+
     SKESA is a de-novo sequence read assembler for microbial genomes
     based on DeBruijn graphs. It uses conservative heuristics and is designed to
     create breaks at repeat regions in the genome. This leads to excellent sequence
@@ -114,18 +115,24 @@ For questions regarding SKESA, please contact
 
     SKESA can process read information by accessing reads from SRA (option --sra_run)
     or from files in fasta (option --fasta) or fastq (option --fastq) format. Any
-    combination of input streams is allowed. For paired reads, if a single file is
-    specified, reads are expected to be interleaved with first mate followed by the
-    second. To specify a separate file for each mate, filenames separated by a comma
-    for first mate followed by the second mate are listed and in this case, the order
-    of reads is expected to be same in files for both mates. A limitation of the current
-    release is that in case multiple streams of paired reads are provided, it is assumed
-    that all streams have the same insert size. User can explicitly specify expected
+    combination of input streams is allowed. Files could be gzipped, which is recognized 
+    automatically. 
+
+    When accessing reads from SRA SKESA automatically determines if the read set consists of 
+    paired-end or single-end reads. For fasta/fastq input of paired reads with separate files 
+    for each mate, filenames separated by a comma for first mate followed by the second mate 
+    are listed and in this case, the order of reads is expected to be same in files for both mates. 
+    Alternatively, a single file with both mates could be specified. In this case the reads are 
+    expected to be interleaved with first mate followed by the second, and the option --use_paired_ends 
+    must be used.
+
+    A limitation of the current release is that in case multiple streams of paired reads are provided, 
+    it is assumed that all streams have the same insert size. User can explicitly specify expected
     insert size for the reads (option --insert_size). Otherwise, a sample of input
     reads is used to estimate the expected insert size. This sampling may lead to very
     small differences in assembly of the same read set if the order of reads is different
     and selected sample gives a difference in expected insert size.
-
+    
     Two additional options users may wish to specify depending on the resources
     available to them are as follows:
         1. the number of cores (option --cores) and
@@ -168,23 +175,23 @@ For questions regarding SKESA, please contact
 
      Example of an assembly that directly accesses SRA for a paired read set SRR1960353 is:
 
-       $ skesa --sra_run SRR1960353 --cores 4 --memory 48 --use_paired_ends > SRR1960353.skesa.fa
+       $ skesa --sra_run SRR1960353 --cores 4 --memory 48 > SRR1960353.skesa.fa
 
      Example of an assembly that uses separate fastq files for each mate of SRR1703350 is:
 
-       $ skesa --fastq SRR1703350_1.fq,SRR1703350_2.fq --cores 4 --memory 48 --use_paired_ends > SRR1703350.skesa.fa
+       $ skesa --fastq SRR1703350_1.fq,SRR1703350_2.fq --cores 4 --memory 48 > SRR1703350.skesa.fa
 
      Example of an assembly that uses interleaved mates for SRR1703350 as fastq input is:
 
-       $ skesa --fastq SRR1703350.fq --cores 4 --memory 48 --use_paired_ends > SRR1703350.skesa.fa
+       $ skesa --fastq SRR1703350.fq --use_paired_ends --cores 4 --memory 48 > SRR1703350.skesa.fa
 
      Example of an assembly that uses reads from SRA for SRR1695624 and gzipped fasta for SRR1745628 is:
 
-       $ skesa --sra_run SRR1695624 --fasta SRR1745628.fa.gz --gz --cores 4 --memory 48 --use_paired_ends > SAMN03218571.skesa.fa
+       $ skesa --sra_run SRR1695624 --fasta SRR1745628.fa.gz --use_paired_ends --cores 4 --memory 48 > SAMN03218571.skesa.fa
 
      Example of the same assembly as above done with both runs accessed from SRA is:
 
-       $ skesa --sra_run SRR1695624 --sra_run SRR1745628 --cores 4 --memory 48 --use_paired_ends > SAMN03218571.skesa.fa
+       $ skesa --sra_run SRR1695624 --sra_run SRR1745628 --cores 4 --memory 48 > SAMN03218571.skesa.fa
 
 ## Citation
 

diff --git a/assembler.hpp b/assembler.hpp
@@ -63,7 +63,6 @@ namespace DeBruijn {
         // steps - number of assembly iterations from minimal to maximal kmer size in reads
         // min_count - minimal kmer count to be included in a de Bruijn graph
         // min_kmer - the minimal kmer size for the main steps
-        // usepairedends - whether or not to use paired ends
         // max_kmer_paired - insert size (0 if not known)
         // maxkmercount - the minimal average count for estimating the maximal kmer
         // memory - the upper bound for memory use (GB)
@@ -73,9 +72,10 @@ namespace DeBruijn {
         using GraphDigger = CDBGraphDigger<DBGraph>;        
 
         template<typename... GraphArgs>
-        CDBGAssembler(double fraction, int jump, int low_count, int steps, int min_count, int min_kmer, bool usepairedends, 
-                      int max_kmer_paired, int maxkmercount, int ncores, list<array<CReadHolder,2>>& raw_reads, TStrList seeds, bool allow_snps, bool estimate_min_count, GraphArgs... gargs) : 
-            m_fraction(fraction), m_jump(jump), m_low_count(low_count), m_steps(steps), m_min_count(min_count), m_min_kmer(min_kmer), m_usepairedends(usepairedends),
+        CDBGAssembler(double fraction, int jump, int low_count, int steps, int min_count, int min_kmer, bool forcesinglereads,
+                      int max_kmer_paired, int maxkmercount, int ncores, list<array<CReadHolder,2>>& raw_reads, TStrList seeds, 
+                      bool allow_snps, bool estimate_min_count, GraphArgs... gargs) : 
+            m_fraction(fraction), m_jump(jump), m_low_count(low_count), m_steps(steps), m_min_count(min_count), m_min_kmer(min_kmer),
             m_max_kmer_paired(max_kmer_paired), m_maxkmercount(maxkmercount), m_ncores(ncores), m_average_count(0), m_raw_reads(raw_reads) {
 
             m_max_kmer = m_min_kmer;
@@ -88,10 +88,18 @@ namespace DeBruijn {
 
             double total_seq = 0;
             size_t total_reads = 0;
+            size_t paired = 0;
             for(auto& reads : m_raw_reads) {
+                if(forcesinglereads) {
+                    for(CReadHolder::string_iterator is = reads[0].sbegin(); is != reads[0].send(); ++is)
+                        reads[1].PushBack(is);
+                    reads[0].Clear();
+                }
                 total_seq += reads[0].TotalSeq()+reads[1].TotalSeq();
                 total_reads += reads[0].ReadNum()+reads[1].ReadNum();
+                paired += reads[0].ReadNum();
             }
+            bool usepairedends = paired > 0;
 
             //graph for minimal kmer
             double average_count = GetGraph(m_min_kmer, m_raw_reads, true, estimate_min_count ? total_seq : 0, gargs...);
@@ -192,8 +200,8 @@ namespace DeBruijn {
             cerr << endl << "Average count: " << average_count << " Max kmer: " << m_max_kmer << endl;
 
             //estimate insert size
-            if(steps > 1 || m_usepairedends) {
-                if(m_max_kmer_paired == 0) {
+            if(steps > 1 || usepairedends) {
+                if(m_max_kmer_paired == 0 && usepairedends) {
                     size_t mates = 0;
                     for(auto& rh : m_raw_reads)
                         mates += rh[0].ReadNum();
@@ -273,7 +281,7 @@ namespace DeBruijn {
             }
 
             // three additional iterations with kmers (usually) longer than read length and upto insert size
-            if(m_usepairedends && m_insert_size > 0 && m_max_kmer_paired > 1.5*m_max_kmer) {
+            if(usepairedends && m_insert_size > 0 && m_max_kmer_paired > 1.5*m_max_kmer) {
                 ConnectPairsIteratively();
 
                 array<int,3> long_kmers;
@@ -887,7 +895,6 @@ namespace DeBruijn {
         int m_steps;                                         // number of main steps
         int m_min_count;                                     // minimal kmer count to be included in a de Bruijn graph
         int m_min_kmer;                                      // the minimal kmer size for the main steps
-        bool m_usepairedends;                                // whether or not to use paired ends
         int m_max_kmer_paired;                               // insert size
         int m_insert_size;                                   // upper bound for the insert size
         int m_maxkmercount;                                  // the minimal average count for estimating the maximal kmer

diff --git a/readsgetter.hpp b/readsgetter.hpp
@@ -65,10 +65,9 @@ namespace DeBruijn {
 //           files specified as a list separated by comma with file for first mate followed by the file for second mate.
 //      ncores - number of cores
 //      usepairedends - flag to indicate that input reads are paired
-//      gzipped - flag to indicate that input files are gzipped
 
-        CReadsGetter(const vector<string>& sra_list, const vector<string>& fasta_list, const vector<string>& fastq_list, int ncores, bool usepairedends, bool gzipped) : 
-            m_ncores(ncores), m_usepairedends(usepairedends), m_gzipped(gzipped) {
+        CReadsGetter(const vector<string>& sra_list, const vector<string>& fasta_list, const vector<string>& fastq_list, int ncores, bool usepairedends) : 
+            m_ncores(ncores), m_usepairedends(usepairedends) {
 
             CStopWatch timer;
             timer.Restart();
@@ -86,16 +85,14 @@ namespace DeBruijn {
             size_t total = 0;
             size_t paired = 0;
             for(auto& reads : m_reads) {
-
-
                 total += reads[0].ReadNum()+reads[1].ReadNum();
                 paired += reads[0].ReadNum();
             }
 
             if(total == 0)
                 throw runtime_error("No valid reads available for assembly");
 
-            if(m_usepairedends)
+            if(paired > 0)
                 cerr << "Total mates: " << total << " Paired reads: " << paired/2 << endl;
             else
                 cerr << "Total reads: " << total << endl;
@@ -476,13 +473,21 @@ namespace DeBruijn {
                 return true;
             };
 
-            auto OpenStream = [] (const string& file, bool gzipped, bool isfasta, boost::iostreams::filtering_istream& is) {
+            auto OpenStream = [] (const string& file, bool isfasta, boost::iostreams::filtering_istream& is) {
+
+                ifstream gztest(file, ios_base::in|ios_base::binary);
+                if(!gztest.is_open())
+                    throw runtime_error("Error opening "+file);
+                array<uint8_t,2> gzstart;
+                if(!gztest.read(reinterpret_cast<char*>(gzstart.data()), 2))
+                    throw runtime_error("Invalid file "+file);
+                bool gzipped = (gzstart[0] == 0x1f && gzstart[1] == 0x8b);
+                gztest.close();
+
                 ios_base::openmode mode = ios_base::in;
                 if(gzipped)
                     mode |= ios_base::binary;
                 boost::iostreams::file_source f{file, mode};
-                if(!f.is_open())
-                    throw runtime_error("Error opening "+file);
                 if(gzipped)
                     is.push(boost::iostreams::gzip_decompressor());
                 is.push(f);
@@ -518,7 +523,7 @@ namespace DeBruijn {
                 size_t comma = file.find(',');
                 if(comma == string::npos) {
                     boost::iostreams::filtering_istream is;
-                    OpenStream(file, m_gzipped, isfasta, is);
+                    OpenStream(file, isfasta, is);
 
                     if(!m_usepairedends) {
                         while(NextRead(acc1, read1, isfasta, is, file))
@@ -547,20 +552,16 @@ namespace DeBruijn {
                 } else {
                     boost::iostreams::filtering_istream is1;
                     string file1 = file.substr(0,comma);
-                    OpenStream(file1, m_gzipped, isfasta, is1);
+                    OpenStream(file1, isfasta, is1);
                     boost::iostreams::filtering_istream is2;
                     string file2 = file.substr(comma+1);
-                    OpenStream(file2, m_gzipped, isfasta, is2);
-                    int p = m_usepairedends ? 0 : 1;
+                    OpenStream(file2, isfasta, is2);
                     while(NextRead(acc1, read1, isfasta, is1, file1)) {
                         if(NextRead(acc2, read2, isfasta, is2, file2)) {
-                            InsertRead(read1, all_reads[p], file1);
-                            InsertRead(read2, all_reads[p], file2);
+                            InsertRead(read1, all_reads[0], file1);
+                            InsertRead(read2, all_reads[0], file2);
                         } else {
-                            if(m_usepairedends)
-                                throw runtime_error("Files "+file+" contain different number of mates");
-                            else
-                                InsertRead(read1, all_reads[p], file1);
+                            throw runtime_error("Files "+file+" contain different number of mates");
                         }
                     }
                 }
@@ -586,7 +587,6 @@ namespace DeBruijn {
 
         int m_ncores;
         bool m_usepairedends;
-        bool m_gzipped;
         list<array<CReadHolder,2>> m_reads;
         int m_kmer_for_adapters;
         CKmerMap<int> m_adapters;