Adding more species takes less time - unclear why? #955

000generic · 2024-12-23T17:25:55Z

Hi!

I'm running OrthoFinder 3 (3.0.1b1) on two sets of eukaryotic genomes - species51 and species 67, where species51 is a subset of species67. I've run them both twice as core with slight differences in the user-provided species tree - and both times species67 finishes in 4 days - while species51 takes 6 days.

Here is an example command line;

orthofinder -t 100 -a 100 -M msa -X -s input/Species67-hand_built_rooted_newick-14dec2024 -S diamond -A mafft -T fasttree -f input/species67-T1-protomes -n orthofinder_species67_fast

I was wondering what might be causing this substantial difference in time between the two species sets - and why the species set with all the same and more species is finishing significantly faster. Would like to make sure things are ok / make sense - as this seems counter-intuitive to me.

Thank you! Eric

The text was updated successfully, but these errors were encountered:

lauriebelch · 2025-01-06T14:47:08Z

Hi Eric,

That's definitely interesting! If you have the log files we can try and work out which stage of orthofinder is bottlenecking the 51 species

000generic · 2025-01-06T20:02:12Z

Great! I've made the logs and scripts available here:

google drive
It seems like there is inconsistency in the amount of time used for the MSAs in the final MSAs step - and since I first posted this issue - Species51 run with diamond ultrasensitive and 125 CPUs was faster than Species51 diamond default and 125 cpus - despite diamond ultrasensitive being slower than diamond default and no other major differences going into the two runs. This seeming inconsistency in the amount of time required to complete is similar to Species51 with diamond default and 125 CPUs vs Species67 with diamond default and 100 CPUs - here again, it doesn't seem to make sense to me given the number of species and now CPUs (the run with fewer CPUs and more species finishes faster).

More confusing - and in contrast - Species67 with diamond default and 100 CPUs is much faster than Species67 with diamond ultrasensitive and 125 CPUs - this is the opposite of Species51 using the two versions of Diamond.

In all cases, it seems to be the final step of MSAs that the time variability is occurring in ways that seem to be inconsistent given the data and tools going in.

There must be something I am overlooking in the details - or missing in the bigger picture - to make sense of this...?

Species51 Diamond Default 125 CPUs 1000 Gb memory = 35,000+ MSAs at 5 days-6 hours final MSA step

Wed Dec 18 00:03:31 EST 2024
Running Species51 on 125 CPU cores 1000 Gb memory
OrthoFinder version 3.0.1b1 Copyright (C) 2014 David Emms
...
Inferring multiple sequence alignments and gene trees
-----------------------------------------------------
2024-12-18 03:01:27 : Done 0 of 35357
2024-12-18 22:04:09 : Done 1000 of 35357
...
2024-12-19 04:50:07 : Done 35000 of 35357
2024-12-24 10:58:04 : Done MSA/Trees
...
OrthoFinder assigned 968455 genes (89.6% of total) to 36662 orthogroups

Species51 Diamond Ultrasensitive 125 CPUs 600 Gb memory = 34,000+ MSAs at 2 days 3 hours final MSA step

Wed Dec 25 20:07:54 EST 2024
Running Species51 on 125 CPU cores 600 Gb memory 
OrthoFinder version 3.0.1b1 Copyright (C) 2014 David Emms
...
Inferring multiple sequence alignments and gene trees
-----------------------------------------------------
2024-12-26 03:33:41 : Done 0 of 34436
2024-12-27 00:18:11 : Done 1000 of 34436
...
2024-12-27 06:46:14 : Done 34000 of 34436
2024-12-29 09:43:33 : Done MSA/Trees
...
OrthoFinder assigned 971862 genes (90.0% of total) to 35883 orthogroups.

Species67 Diamond Default 100 CPUs 750 Gb memory = 43,000+ MSAs at 2 days 15 hours final MSA step

Thu Dec 19 07:02:27 EST 2024
Running Species67 on 100 CPU cores 750 Gb memory
OrthoFinder version 3.0.1b1 Copyright (C) 2014 David Emms
...
Inferring multiple sequence alignments and gene trees
-----------------------------------------------------
2024-12-19 11:22:09 : Done 0 of 43809
2024-12-20 11:16:03 : Done 1000 of 43809
...
2024-12-20 21:04:46 : Done 43000 of 43809
2024-12-23 12:07:56 : Done MSA/Trees
...
OrthoFinder assigned 1191685 genes (89.5% of total) to 49297 orthogroups.

Species67 Diamond Ultrasensitive 125 CPUs 750 Gb memory = 42,000+ MSAs at 9 days+ and still running

Thu Dec 26 12:56:05 EST 2024
Running Species67 on 125 CPU cores 750 Gb memory
OrthoFinder version 3.0.1b1 Copyright (C) 2014 David Emms
...
Inferring multiple sequence alignments and gene trees
-----------------------------------------------------
2024-12-26 23:50:19 : Done 0 of 42370
2024-12-27 21:52:44 : Done 1000 of 42370
2024-12-28 06:37:06 : Done 42000 of 42370
....STILL RUNNING January 6 2:56 pm

lauriebelch · 2025-01-07T11:22:32Z

Thanks for sharing the log files - I think I have an explanation for what is happening

Diamond ultra sensitive tends to make (marginally) fewer and smaller orthogroups. When it comes to multiple sequence alignment with MAFFT, the limiting factor is often the size of the largest orthogroups. Those alignments take a long long time to run.

Diamond ultra-sensitive is slow for the all-versus-all search, however the alignment steps ends up being quicker, because there are fewer super-large orthogroups.

My reccommendation would be to switch to using FAMSA for alignment, instead of MAFFT (see https://github.com/davidemms/OrthoFinder?tab=readme-ov-file#configjson--adding-addtional-programs-for-tree-inference-local-alignment-or-msa)

FAMSA is much quicker at alignment, so you don't run into the same issue (and it will soon become the default option in the new version of orthofinder)

000generic · 2025-01-07T12:36:38Z

Thank you for the quick and detailed explanation!

I agree on diamond sensitive vs ultrasensitive - that makes sense. Thank you!

I can check and see if the largest orthogroup in Species51 Diamond sensitive is larger than for Species67 Diamond sensitive. That would be the predict I guess to explain the unexpected time difference. I can update here later...

I'll give FAMSA a try - do you think it is better in all cases - even when there are only 10s or a few hundred sequences? Somehow I had the idea that it was for large-scale alignment but at small scales MAFFT was best. But I've no idea where I got that idea / impression.

Thank you again :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding more species takes less time - unclear why? #955

Adding more species takes less time - unclear why? #955

000generic commented Dec 23, 2024 •

edited

Loading

lauriebelch commented Jan 6, 2025 •

edited

Loading

000generic commented Jan 6, 2025 •

edited

Loading

lauriebelch commented Jan 7, 2025

000generic commented Jan 7, 2025

Adding more species takes less time - unclear why? #955

Adding more species takes less time - unclear why? #955

Comments

000generic commented Dec 23, 2024 • edited Loading

lauriebelch commented Jan 6, 2025 • edited Loading

000generic commented Jan 6, 2025 • edited Loading

lauriebelch commented Jan 7, 2025

000generic commented Jan 7, 2025

000generic commented Dec 23, 2024 •

edited

Loading

lauriebelch commented Jan 6, 2025 •

edited

Loading

000generic commented Jan 6, 2025 •

edited

Loading