-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusthash doesn't seem to use all given threads, and result2flat breaks with "segmentation fault" #261
Comments
I reviewed the code and found multiple possible issues with the module for smaller datasets. I'll try to finish up the refactoring either later today or tomorrow. Thanks for the bug report. We are still doing the free sticker thing (see https://twitter.com/thesteinegger/status/1201076220957315074). If you want a set send me your address at milot at mirdita dot de. |
Thanks for the super quick reply ! |
@UriNeri clusthash can not find contained sequences. The best would be to used linclust
|
@martin-steinegger @milot-mirdita |
Disregarding how much biological sense it makes, would you mind rerunning the Also the crash in
|
@milot-mirdita No problem! |
@martin-steinegger
|
Could you please provide the full log? Is it possible to provide your input? |
@martin-steinegger |
@UriNeri the log is only written to stdout. We do not store a copy of the log in the temp directory. So you probably need to rerun the whole job. If you use the same tmp folder and command then it will just perform the last step. |
@martin-steinegger I re-run the easy-linclust command on a different input (in email). |
@UriNeri thank you for sending me the input. I could reproduce the issue and fix it. |
@martin-steinegger thank you ! |
46c843895 Update combine pval agg-mode 3 67d610136 Disable fancy progress bars on travis to reduce output 203a21736 Updated two more tests to use tighter ROC thresholds a9052f449 Update regression with tighter bounds for ROC tests c62736a6d Correctly parse keys from data files in filterdb --filter-file This was causing a linsearch instability fe007cb4e Use MultiParam for gapOpen, gapExtend costs 3513001d3 Add easy-rbh workflow d0d3032e9 Fix RBH search if using -a to show alignments ce1a43bf1 Merge branch 'master' of https://github.com/soedinglab/mmseqs2 ea24e4934 Fix issues with abs. path if using aria2c 5228745f5 Improve --alignment-mode parameter description and make it a non expert parameter fffa9b10e Fix various inconsistencies and usability issues with alignall: * alignall alignment-mode did not correspond to align alignment-mode * add-backtrace did not do anything, has to be specified now if backtrace is needed * Did return a alignment db type even though it is incompatible with that type, uses generic for now * various parameters were passed but unused - zdrop and scorebias are used now (however see below) - realign, alt ali, max accept/reject, wrapped are now gone 290668474 Fix wrong warning 813d81f29 Update regression 264d78117 Switch greedy clustering algorithm back to old idea c09f6574e Improve nucleotide clustering workflow 38a737708 Set k-mers in linclust to 0 for the nucleotide clustering 7df6e3f75 Replace characters that can not be reversed by N in extract frames e9678f625 Update regression f886e868f Add nucleotide support to cluster (workflow nucleotide_clustering), clust module will infer identity automatically if missing, Improve low. mem. greedy incremental algorithm, Update regression 5f8735872 Add kmers-per-sequence-scale to linsearch 0310eb607 Change --kmer-per-seq-scale to a multi parameter, add error if cluster is called with a nucleotide sequence e258bc8d8 Fix #299 PDB70 database creation was not working 7095f37e4 Add support reverse complemente in rescorediagonal --rescore-mode 0 and 1 61ca48883 Fix result2dnamsa 70d014e41 Add search-type 4 to Search 462f24cbb Add module result2dnamsa 5670d990e Fix regression error e4451d591 Add result direction parameter to kmersearch 12c499dcd Fix reverse sequences issues in linclust and linsearch 44499c3ce Update filterdb regression test 807b4a56a Fix issue soedinglab/MMseqs2#290. Filterdb checked for mode == true but mode was 2. 24479bc27 Fix Docker a578f52a7 Fix char signedness on PPC a0d64a989 Update regression a07a266f9 Working on PPC64LE support 09734177c Remove remaining _mm_shuffle_epi32 cdef78a69 Merge pull request #285 from hgsommer/misc_small 283c8d03f Replace goto end in ssw 6bfc50281 Fix c/p mistake in convertalignments e61da3447 Fix spelling of 'length' 9a63760fa Replace nested ternary operator 4349b5c6e Avoid repeatedly checking for profile db types c170a11f5 Call MsaFilter::shuffleSequences() from MsaFilter::filter() ef49ba220 Return value from MsaFilter::filter() d155dc36c Replace int by bool literals for bool variable ec6722adc Align headings with column in PSSMCalculator::printProfile() 548a9bd68 Avoid forward declaration of ScoreMatrix d0fbe471f Do some cleanup in StripedSmithWaterman.cpp 91d1aeddc Replace check for zero-sized containers by empty() e47b8eed9 Remove superfluous parameter from ssw_init() 250b1221d Simplify return statements 4fe1116ae Remove counting zero scores in Sequence::mapProfile() 4303728b5 Replace multiplication by zero 1bd602420 Remove increment by zero e4d4389f2 Move check for exit condition in front of allocations 556d26d1a Clean up function signatures in MultipleAlignment 3863af9ac Move include back to header to restore build e1208493a Remove unused TmpResult score field 1fd4db8f2 Die if DBReader cannot reopen files (e.g. no more file handles left) 1e21b87ba Purge sequenceLookup early since its recreate in split databases 40854ddcd Prefiltering and CacheFriendlyOperations refactoring 2433e086b WASM work in progress 14014cd0e Fix prefilter overflow instability e0f971848 Add conda forge to conda install instructions aa175d636 Fix off by one in kmermatcher soedinglab/MMseqs2#274 (comment) d1607bc8a Remove LINE_MAX eca2155d7 Clear string buffer instead of reassigning in swapresults 0f4645edd Fix wrong reverse marking in linsearch reported by UBSAN 5b612a327 Missing mpi binaries for travis regression 83d22417a Next try for ARM compiler flags 7ad122f0a Missed a few variables ac7914bea Do not require a cmake variable to build ARM 0dcfaadbb Update regression to fix broken samtools call on ARM 29927b4c4 More NEON fixes, we assume signed chars, ARM uses unsigned by default 7760220ff Next try to get the ARM regression to work cc6d0d52b Add hack to not break travis log size limit 5408c3d10 Try to get NEON to compile 83192cabd Fix search workflow parameters printed twice f6f001c8c Fix new clang-10 warnings and further travis fixes 259e64341 llvm-10 alias is not whitelisted in travis yet b1249fd54 Fix errors in Travis YAML from previous commit 18486d4c5 Update travis - use native aarch64 for neon - use xenial - shorten script 98c37f3c3 shortend MultiParam usage, improved line breaks in usage c9be07f1a Add gcc-9 to travis 2e5fb309a Fix travis clang build d5865c894 Remove MultiParam g++-9 warning 73679835b Rework target split merging ca5869397 Fix RESSIZE issue in slice search if sequences are used 491900b99 Improve usage text of cluster/linclust 0166850a2 Remove old greedy incremental clustering code and just run the memory efficient version instead. 15163e64c Fix Verbosity in workflows aa78af463 Fix issue soedinglab/MMseqs2#274 7846dfce3 fixed clang template error e1206371c extended MultiParam class, replaced ScoreMatrixFile type by MultiParam<char*> b88b54756 rewrite alphabetSize as multi parameter ecb4e35d4 started template class MultiParam to store sequence type specific values e1a1c1226 changed dbtype comparision in AlignmentSymmetry 2a829aef7 Replace symlinkat call with getcwd/chdir/symlink/chdir to fix Conda build using macOS 10.9 SDK 28e83e8d5 Add OpenMP include to DBReader fb00aa0c3 Fix realloc issue while IndexTable creation of profiles 504e5021f Take max. seq. len of query and target db in prefilter and alignment 16e235214 Fix bug if seq. len > max seq. length in Alignment 80d0187de Fix asan issue 751f5c19f Make ZDROP an expert parameter, change description text 1b6edd0d4 Rework x detection (SIMD) 9677254ab Merge branch 'master' of https://github.com/soedinglab/mmseqs2 1ac1e6866 Fix max seq issues in prefilter cb737033c Reset download strategy to not use aria2c for the NCBI download c95f3ee0e fixed ksw2 test 72b95c0ce Error if we cannot download from NCBI 1d0aad50b Fix databases not piecing togehter all kalamari accessions 516723d53 Merge branch 'master' of https://github.com/soedinglab/MMseqs2 d81b6cca5 added zdrop parameter to control banded nucleotide alignment e2e39a971 Add Kalamari Contaminants database c0c538ea3 Various fixes in databases script 08cc95b3a Fix createtaxdb redownloading when taxdump already exists 018eb3498 Remove a bit whitespace in front of each parameter in usage message 8aa7513de add aggregatetax example, fix typos 8bcd7c740 Fix typo 8e581b762 Rework usage texts 7dc25764a Hide most parameters from createindex 2baa609e8 Add examples to many modules 00a7d7696 fixed bugs for long or wrapped nucleotide sequences a4bdcb478 eggNOG profiles should not depend on the deleted MSAs 4c7830954 Fix eggNOG database construction f7a5599c8 Cleanup not needed files immediately in databases workflow 3ed3690d4 Fix downloads always restarting in databases workflow 4cfac9a8a Fix aria warning with more than 16 connections e0a00e10d Revert "Use SW instead of BandedNucAln if we don't have diagonals" 7ac966b2e Fix result2msa could fail if it was writing compressed output 95729ac7c Fix wrong output DB type written in alignall f899e7c7a Use SW instead of BandedNucAln if we don't have diagonals c08d9fa8e Allow parameter descriptions to span multiple lines 57868498e MMseqs2 is not limited to proteins, update README to reflect that 11818b0a2 Cleanup hiding parameters in workflows c481cea60 Remove some useless includes 2f64aeeb8 Fix databases timestamp appending instead of overwriting ae9e9e329 Add eggNOG setup procedure to databases 31c8e5d50 Shorten two short parameter descriptions 2f49d3e3e Read header from lookup in msa2profile if available 1356869b0 add option to reverese profile dbs ac3482e80 More issues with zlib and tar2db aaafafe43 Fix tar2db keys c751d9e2f More tar2db fixes a9c93014c Fix variadic input to tar2db 51a761305 Add tar2db module to convert content of any tar to a DB 96f9a91e5 Use nedmalloc on Windows/Cygwin 73f5c2a2d Add databases workflow to README 5a7ac9e54 make align output consistent c5ebe5297 fixed setcover cluster mode (by fixing bug in similarity reading for short aln results e.g. hamming distance aln) 481696b5f Fix databases output c6b4a57a8 Beginning cleaning up parameter descriptions a9552a177 Show default value of bool parameters af89c4677 Add a proposed example text structure 9c17f4eba Rework module description texts, better categories, shorten all descriptions, prepare to replace long descriptions with examples 00ff199e8 Add Resfinder DB f1011ecb4 Fix krona again marked as vendored 02001ab03 missing mode resulted in different top1 4375463bc Header db should not have to be a unsplit db edccbf33f Actually fix extractorfs lookup creation 041e8e558 Improve README a8f2c7bad Remove correct workflow script in createtaxdb.sh 26c8202a9 print createdb cmd line again df02bae34 Refactor createseqfiledb, remove stringstream 2523ebe1a do not write null byte af847a724 Fix clang warning from DBConcat ef1ec596f extend dbconcat to handle auxillary files 528bd2134 not needed dec1b9215 Silence warning in GCC 4.8 casting function to void* 2d44c886d Fix extractorfs not being able to create lookup ffe66afac Replace isnumber with isdigit. Add more tests to TestTaxExpr fbe09867e Rework Taxon Expr parsing f58329ef5 Add constructor to define custom functions to ExpressionParser b6ef07281 Initialize expressionparser per thread, was not thread safe f966bfa62 Fix reallocation issue in BandedAlignment bbd3c2bb7 Add +1 to realloc in BandedNucleotideAligner but not to length 6b6e82ae6 Add +1 to realloc in mapSequence 75e2c8ec4 Fix off by one issues in realloc in rescorediagonal and BandedNucleotideAligner afd14c8c2 First step to get rid of maxSeqLen 13ca612db Fix allocation issue in kermatcher if sequences are longer than > 2^16 62de5ba93 Fix off by one in computation for splits in kmermatcher 35e95d180 Change int_sequence to char (big change) ecf82f2f4 Revert "Temporarily disable soft split mode for createdb in easy workflows" d19219dd4 Merge branch 'master' of https://github.com/soedinglab/mmseqs2 1a0d898ec Fix softlink issue in createdb soedinglab/MMseqs2#265 13e0fe466 Temporarily disable soft split mode for createdb in easy workflows 4487b6e14 Fix view module to work with softlinked createdb dbs c1e9eb0e3 Fix MPI issue if only one server is used e781c3fe5 fix MPI compile error 9bcff2844 Fix Filter2 bug of HH-suite in MMseqs2 soedinglab/hh-suite#182 01db79d33 Fix some bugs in splitting handling d9a887453 Fix memory splitting issues in kmermatcher, kmerindexdb 37880f083 Fix MPI in kmermatcher and indexdb bee93123f Update regression 03a89ff1c Merge branch 'master' of https://github.com/soedinglab/mmseqs2 6ca967362 Update the way how k-mers are extracted in kmermatcher. Extraction should be now ~3 times faster. f1388309d Introducing databases workflow to automatically setup and download common databases d78fdbb06 Add progress to convertmsa 18acba224 Do not recreate _mapping file if it already exists in createtaxdb 63a373f5a Skip validations steps correctly if a input db is neither INPUT nor OUTPUT d95caa1a7 Allow modules with zero parameters 9f8aff948 Allow modules to handle -h or --help themselves cf5691f92 Typo 8ebc9d16b fixed access mode 31895414d Clarify parameter help in createdb f644744a8 Merge branch 'master' of https://github.com/soedinglab/mmseqs2 c287719d9 Remove check for profiles for splice serach. It should also work with sequence databases. c75fe9acf regression submodule w filtertaxseqdb 7587a872f Add one more missing check in kmermatcher 8d4e9f4fc Remove +1 from size in initKmerPositionMemory aca141e95 Fix shellcheck error in splicesearch 8bdff50e1 Move +1 from initKmerPositionMemory outside f12821e35 Merge branch 'master' of https://github.com/soedinglab/mmseqs2 d74b76ca5 Avoid overflow in kmermatcher if split is needed fd90ff2c3 Move compiled data resources into subfolders 2fd9f25d2 Merge branch 'master' of https://github.com/soedinglab/mmseqs2 b439ce831 Make the slice search applicable to other databases types, not just profiles 589a2e276 Fix apply crashing on empty entries 82542a6ac Merge branch 'master' of https://github.com/soedinglab/mmseqs2 c0acdd8f3 Fix memory leak in createsubdb. 5129a956d Validate taxonomic ranks and make input/output formats consistent 53bb55b38 Fix issues in hash function soedinglab/MMseqs2#252 764c4a3e7 Fix lca message c013a6929 Fix LCA output message a1206690d Change db validator from result2stats 714f5b4fb Replace mmaped input file with std c io in createsubdb 6e43e9413 Add remove .source file to rmdb 3e58bb85b Fix result2flat soedinglab/MMseqs2#261 3e27833db Revert easycluster.sh back to result2flat. Reason is that createsubdb can not handle soft linked sequence databases (input.0 -> input.fas) 33354680f Merge branch 'master' of https://github.com/soedinglab/mmseqs2 1e92fb504 Replace result2repseq and result2flat with createsubdb and convert2fasta 55bcdd303 single step clustering could potential cluster unrelated sequences due to hash collisions fdd0646b1 Fix clusthash issues with parallelization and nucl input e62a1c717 Merge branch 'master' of https://github.com/soedinglab/mmseqs2 1336b7ad2 Add MSA to allDb and allDbAndFlat 48a037a2e Update Prefiltering.cpp a1adbf52d Fix warning: Remove useless copy constructor from Matcher::result_t d3ca42657 Remove truncatedCounter variable in QueryMatcher 4647525ec Show full help text if "Error in argument " occurs 4149ae457 Remove annoying message in prefilter (truncated result). Move it to the statistics section. d5aab5b86 Update regression 1f1e049e6 Fix output of unclassified hits in convertalis 83ff5c601 Fix permission issues for tmp directory cce6e6714 add support to output taxon in easy-search when using an indexed database f200bdd62 Merge branch 'master' of https://github.com/soedinglab/mmseqs2 6f28a29ae Fix seg. fault if all sequences could be classified 473d60580 Update batches b52668f6e Add chat icon af54c8e8e Update README.md 7eb6a0b70 Makde addtaxonomy more resilient against invalid taxonomy mappings 3482b0e91 Merge pull request #260 from RuoshiZhang/master 36f49f5b5 Fix issue in memory computation for split bcb97d63f Update README.md abcd97de7 write same number of fields even if no hit 38e102181 Update regression to hopefully fix windows failure f41511465 Fix spelling error 1fd24924e Add a search-type 4 for trans-trans search returning a nucl backtrace in offsetalignment 31f6d7ac3 add aggragatetax to assign set tax by majority vote b6e8ee239 allow more dbtypes in swapdb c9d02ef21 add option to view rank index 49db7258e typo fix 9c32930f3 Merge branch 'master' of github.com:soedinglab/MMseqs2 17b5494fe Fix auto detection of dbtype in createdb 8831df81d Merge branch 'master' of github.com:soedinglab/MMseqs2 be1a9822c Fix createseqfiledb soedinglab/MMseqs2#258 02be0c4ea Fix summarizeresult to support reverse position in alignment 7ef586276 added filtertaxseqdb 00f2fd2b8 added mode for all but index 127db8c6d minor tidying for filtertaxdb 8144e7653 Merge branch 'master' of github.com:soedinglab/MMseqs2 48f77fa7d Fix ASan issue in filterdb d722d5724 Fix warning in filterdb 4a4e6ea15 Update regression test for filterdb 31a7dc124 filterdb --join-db ignores lines it cannot join instead of crash 6c6faa96d filterdb's --extract-lines works together with --trim-to-one-column 12bee8142 filterdb can filter by rows with value within percentage #249 5c919ab95 Allow double parameters separately from floats in parsing f9be8a88d Remove broken filterdb paths 1dc04f5e1 Refactoring of filterdb 90e3a9aaf Fix bug for enforced dbtypes in createdb a4cee78db New regression to check stdin support 17ec97c78 Add stdin support to easy workflows 76c9e7c36 Fix compiler warnings in KSeqWrapper 0cc45536b Overwrite dbtype correctly in createdb c0045182b Add stdin to createdb 02a88e438 use https instead of ftp for downloading taxdb data a33bd27f4 offsetalignments now correctly returns a nucleotide backtrace if needed 456e1b5ab include VTML40 in binary for easier access 775de3850 Add missed target .source file for reading in convertalis c08c071b2 Overload patterncompiler isMatch for pos of match ba6aa8d12 avoid appending extra tabs besthitperset git-subtree-dir: lib/mmseqs git-subtree-split: 46c8438958edccd8fd09640eb174e2449529e4df
Expected Behavior
clusthash uses all threads and result2flat produces a complete fasta file (not ending in %)
Current Behavior
clusthash seems to use only 1 of all given threads, and eventually result2flat breaks "segmentation fault"
Steps to Reproduce (for bugs)
Ran from the terminal in the same directory as a contigs fasta file (DNA) (named "cated_sk100.fna"):
THREADS=10
mkdir resultsDB scafDB
mmseqs createdb cated_sk100.fna scafDB/cated_sk100
mmseqs clusthash scafDB/cated_sk100 resultsDB/resultDB --min-seq-id 0.99 --threads $THREADS
mmseqs clust scafDB/cated_sk100 resultsDB/resultDB clusterDB --threads $THREADS
mmseqs result2repseq scafDB/cated_sk100 clusterDB DB_clu_rep
mmseqs result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep scafs_reps.fasta --use-fasta-header
When "Compute 1 unique hashes." is printed, there are 10 resultsDB files and 10 resultDB.index files, however, only one (resultDB.index.7) is getting larger with time (and is > 0 in size). Meanwhile only one thread seems to be utilized (around 8% of the total 10 threads given).
When the clusthash finishes there is one resultsDB.index file, and 10 resultsDB files, 8 with zero size, and resultsDB.index7 and resultsDB.index both with the same size). After this, the process breaks in the last command:
mmseqs result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep scafs_reps.fasta --use-fasta-header
With the message:
`result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep scafs_reps.fasta --use-fasta-header
MMseqs Version: 48a037a
Use fasta header true
Verbosity 3
[1] 18252 segmentation fault (core dumped) mmseqs result2flat scafDB/cated_sk100 scafDB/cated_sk100 DB_clu_rep`
MMseqs Output (for bugs)
Which output should I upload?
Context
I'm trying to remove redundancy by collapsing sequences that are either highly similar (99%) or are also contained within longer sequences from other fasta entries in the file. This fasta file size <1gb but I first tried to run this process on a >80gb file on remote compute node and was concerned when I saw the job was using only a small part of the resources.
Not part of this issue but realted; also tried to do the same thing with a large protein file but I get invalid fasta entry errors (maybe because of the "*" marking STOPs left by the ORF predictor, but that wouldn't happen in the nucleic acid file example above).
Your Environment
I tried on my personal machine and a compute node (PBS), similar behaviour in both.
Personal machine MMseqs2 Version: 48a037a.
Server MMseqs2 Version: 2a8c5f0.
Server: (2a8c5f0)
CPU: Intel(R) Xeon(R) Platinum 8168
Memory: 366 GB
Personal machine: (48a037a)
CPU: Intel Core i7-8700 6-Core model: bits: 64 type: L2 cache: 12.0 MiB
Memory: 15.33 GB
Personal machine: Linux Mint 19.2 Tina Kernel: 4.15.0-72-generic x86_64;
Server: Linux 3.10.0-693.el7.x86_64
Thanks for developing and maintaining this totally amazing tool !
The text was updated successfully, but these errors were encountered: