Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Limit the MSA depth in Jackhmmer/Nhmmer/Phmmer/Hmmsearch #323

Open
Augustin-Zidek opened this issue Apr 10, 2024 · 19 comments

Comments

@Augustin-Zidek
Copy link

Would it be possible to add a flag that limits the number of sequences returned by Jackhmmer/Nhmmer/Phmmer/Hmmsearch?

When HMMER is used in the context of protein structure prediction, we usually care only about the top n hits (usually ~5k) sorted by e-value and the rest are discarded.

Therefore aligning them, outputting them to stdout, and saving them to disk is a wasteful operation, as we then read just the top n and discard the rest.

Moreover, queries with extreme number of hits lead to unnecessary long run times and need a lot of RAM to store the results.

We added our own limited and brittle workaround for this in Jackhmmer, but that only affects the output MSA, but doesn't address the unnecessary computation that happens before:

 p7_tophits_SortBySortkey(info->th);
 p7_tophits_Threshold(info->th, info->pli);
+if (esl_opt_IsOn(go, "--seq_limit"))
+{
+    int seq_limit = esl_opt_GetInteger(go, "--seq_limit");
+    info->th->N = ESL_MIN(info->th->N, seq_limit);
+}
  p7_tophits_CompareRanking(info->th, kh, &nnew_targets);

A better and more robust solution would be probably to modify P7_TOPHITS to have a max_depth attribute that doesn't allow top hits to grow beyond that number.

@cryptogenomicon
Copy link
Member

It's a high priority for us to make better ways of dealing with tons of hits and large alignments.

Unlikely that we'll do something like first-N hits, though, because it makes search results dependent on the order of sequences in the target db.

Or if you want the top-N, not just first-N, then that doesn't save you computation, only space - and you should be able to achieve a top-N strategy now by filtering or downsamplng an output tabular data file any way you want, and immediately discarding the original large output file.

We're working on other downsampling methods, and also on highly compressed alignment representations.

@Augustin-Zidek
Copy link
Author

Oh sorry I wasn't clear: I meant top-N sorted by e-value, not first-N. Thinking of P7_TOPHITS essentially being something like a min-heap of size N.

Or if you want the top-N, not just first-N, then that doesn't save you computation, only space - and you should be able to achieve a top-N strategy now by filtering or downsamplng an output tabular data file any way you want, and immediately discarding the original large output file.

Yes, good point, you have to align to query and sort anyway. But it would still be useful if this functionality was built in, as multi-GB MSA files could be produced, and saving those could take non-trivial amount of time, especially with remote filesystems.

We observed this with a sequence that has over 1M hits in UniRef90 with e-value limit of 0.001 and produces 45 GB Stockholm MSA and 256 MB tblout files.

@cryptogenomicon
Copy link
Member

One way you can go is to generate the tblout file (and not the full MSA); filter or downsample the lines in the tblout file; fetch those subsequences (esl-sfetch for instance); and use hmmalign to align them to the query model. This saves storing the 45G MSA.

@Augustin-Zidek
Copy link
Author

Thanks, that is an interesting thing to try.

However, even with -o /dev/null won't it do a lot of unnecessary IO?

@cryptogenomicon
Copy link
Member

Without -A <msafile> it won't even generate the multiple alignment internally (saving a ton of RAM and some computation), and -o /dev/null saves a lot of storage and probably a bunch of output time too. I definitely focus on the tabular output files when I'm manipulating large output volumes (which these days is nigh-always).

@Augustin-Zidek
Copy link
Author

Augustin-Zidek commented Apr 10, 2024

Here are timings with and without -A:

Without -A

$ jackhmmer \
-o /dev/null --tblout tblout.txt \
--noali \
--cpu 22 \
-N 1 \
-E 0.001 --incE 0.001 \
query.fasta uniref90.fasta

real    18m26.775s
user    332m53.777s
sys     3m47.448s

With -A

$ jackhmmer \
-o /dev/null -A out.a3m --tblout tblout.txt \
--noali \
--cpu 22 \
-N 1 \
-E 0.001 --incE 0.001 \
query.fasta uniref90.fasta

real    20m1.193s
user    332m53.121s
sys     4m5.238s

I am testing with this sequence:

> query
PNPCPESSASFLSRITFWWITGMMVQGYRQPLESTDLWSLNKEDTSEQVVPVLVKNWKKECAKSRKQPVKIVYSSKDPAK
PKGSSKVDVNEEAEALIVKCPQKERDPSLFKVLYKTFGPYFLMSFLFKAVHDLMMFAGPEILKLLINFVNDKKAPEWQGY
FYTALLFISACLQTLVLHQYFHICFVSGMRIKTAVIGAVYRKALVITNAARKSSTVGEIVNLMSVDAQRFMDLATYINMI
WSAPLQVILALYLLWLNLGPSVLAGVAVMVLMVPLNAVMAMKTKTYQVAHMKSKDNRIKLMNEILNGIKVLKLYAWELAF
KDKVLAIRQEELKVLKKSAYLAAVGTFTWVCTPFLVALSTFAVYVTVDENNILDAQKAFVSLALFNILRFPLNILPMVIS
SIVQASVSLKRLRVFLSHEDLDPDSIQRRPIKDAGATNSITVKNATFTWARNDPPTLHGITFSVPEGSLVAVVGQVGCGK
SSLLSALLAEMDKVEGHVTVKGSVAYVPQQAWIQNISLRENILFGRQLQERYYKAVVEACALLPDLEILPSGDRTEIGEK
GVNLSGGQKQRVSLARAVYCDSDVYLLDDPLSAVDAHVGKHIFENVIGPKGLLKNKTRLLVTHAISYLPQMDVIIVMSGG
KISEMGSYQELLARDGAFAEFLRTYASAEQEQGQPEDGLAGVGGPGKEVKQMENGMLVTDTAGKQMQRQLSSSSSYSRDV
SQHHTSTAELRKPGPTEETWKLVEADKAQTGQVKLSVYWDYMKAIGLFISFLSIFLFLCNHVASLVSNYWLSLWTDDPIV
NGTQEHTQVRLSVYGALGISQGITVFGYSMAVSIGGIFASRRLHLDLLHNVLRSPISFFERTPSGNLVNRFSKELDTVDS
MIPQVIKMFMGSLFNVIGACIIILLATPMAAVIIPPLGLIYFFVQRFYVASSRQLKRLESVSRSPVYSHFNETLLGVSVI
RAFEEQERFIRQSDLKVDENQKAYYPSIVANRWLAVRLECVGNCIVLFASLFAVISRHSLSAGLVGLSVSYSLQVTTYLN
WLVRMSSEMETNIVAVERLKEYSETEKEAPWQIQDMAPPKDWPQVGRVEFRDYGLRYREDLDLVLKHINVTIDGGEKVGI
VGRTGAGKSSLTLGLFRIKESAEGEIIIDDINIAKIGLHDLRFKITIIPQDPVLFSGSLRMNLDPFSQYSDEEVWTSLEL
AHLKGFVSALPDKLNHECAEGGENLSVGQRQLVCLARALLRKTKILVLDEATAAVDLETDDLIQSTIRTQFDDCTVLTIA
HRLNTIMDYTRVIVLDKGEIQEWGSPSDLLQQRGLFYSMAKDSGLV

@cryptogenomicon
Copy link
Member

(Pleasantly surprised to see the multithreaded parallelization scaling well to 22 cores; normally jackhmmer starts becoming input-bound very quickly, even at 2 cores or so. You must have some fast storage.)

@Augustin-Zidek
Copy link
Author

Augustin-Zidek commented Apr 10, 2024

A few more observations.

We have one further patch to output a3m instead of Stockholm from Jackhmmer:

-if (textw > 0) esl_msafile_Write(afp, msa, eslMSAFILE_STOCKHOLM);
+// Write the output as A2M unless a '.sto' file extension is provided.
+if (textw > 0) {
+    if (afp_ext && strcmp(afp_ext, "sto") == 0) {
+        esl_msafile_Write(afp, msa, eslMSAFILE_STOCKHOLM);
+    } else if (strcmp(esl_opt_GetString(go, "-A"), "/dev/stdout") == 0 ||
+               (afp_ext && (strcmp(afp_ext, "a3m") == 0 || strcmp(afp_ext, "a2m") == 0))) {
+        esl_msafile_Write(afp, msa, eslMSAFILE_A2M);
+    } else {
+        p7_Fail("Bad output MSA format %s, only a2m/a3m and sto are supported\n", afp_ext);
+    }
+}
 else           esl_msafile_Write(afp, msa, eslMSAFILE_PFAM);

This patch reduces the output MSA size from 45 GB (Stockholm) to 1.6 GB (a3m) and slightly reduces the runtime to:

real    19m2.367s
user    324m10.399s
sys     2m23.680s

This patch together with the --seq_limit patch, we see significantly better runtime (and only 1.6 GB MSA a3m output file size):

real    15m22.416s
user    321m15.616s
sys     1m49.013s

It therefore seems to me that there must me some non-trivial amount of work that is eliminated thanks to applying the naive sequence limit patch.

@Augustin-Zidek
Copy link
Author

I did a bit of profiling and here is the profile for the offending query with --cpu 1 (and with our --seq_limit and a3m-output patches applied).

image

For comparison, here is profile for a query that runs 2.5x faster because it doesn't have that many hits:

image

Does the profile look reasonable to you? Is it reasonable for p7_domaindef_ByPosteriorHeuristics to account for most of the run time?

@npcarter
Copy link
Member

npcarter commented Apr 19, 2024 via email

@Augustin-Zidek
Copy link
Author

Augustin-Zidek commented Apr 22, 2024

Thanks for the extra details, Nick!

The time breakdown of a hmmsearch run will vary tremendously depending on the fraction of sequences in the database that are detected as homologs.

Indeed -- this query (see #323 (comment)) has over 1M homologs in UniRef90.

 For example, here's the top of the output of a perf report on the compute thread when I searched the Caudal_act HMM from Pfam against a version of the Uniprot TrEMBL database

Could run this report with the sequence and database I reported above to check if you get similar numbers to what I am seeing when profiling?

If you have the outputs of these searches, could you send me the pipeline statistics information that appears at the very end of the output?

Absolutely, here they are:

Internal pipeline statistics summary:
-------------------------------------
Query model(s):                            1  (1326 nodes)
Target sequences:                  135301051  (45360760554 residues searched)
Passed MSV filter:                   5218733  (0.0385713); expected 2706021.0 (0.02)
Passed bias filter:                  4358682  (0.0322147); expected 2706021.0 (0.02)
Passed Vit filter:                   1763645  (0.013035); expected 135301.1 (0.001)
Passed Fwd filter:                   1501121  (0.0110947); expected 1353.0 (1e-05)
Initial search space (Z):          135301051  [actual number of targets]
Domain search space  (domZ):         1061640  [number of targets reported over threshold]
# CPU time: 19273.26u 351.44s 05:27:04.69 Elapsed: 00:22:30.69
# Mc/sec: 44531.39

@@ New targets included:   1061640
@@ New alignment includes: 1072120 subseqs (was 1), including original query
# Alignment of 1072120 hits satisfying inclusion thresholds saved to: out.a3m

Also, can you tell me what tool you used to generate those performance profiles?

I used the perf record command (https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record):

perf record -g <jackhmmer command>

I then visualiased the generated perf.data using pprof (https://github.com/google/pprof):

pprof -flame perf.data

This might require an extra conversion step (see https://github.com/google/pprof#using-pprof-with-linux-perf).

@npcarter
Copy link
Member

npcarter commented Apr 22, 2024 via email

@hegelab
Copy link

hegelab commented Nov 21, 2024

Hi,
I attempted to run AlphaFold with CFTR_HUMAN, and found that 250GB of RAM was insufficient, requiring me to allocate 500GB instead. The runtime was also significant, exceeding an hour (the whole data pipeline; I did not benchmarked the seq search alone) despite using a system with decent SSD storage.
The AF3 team addressed this challenge by introducing alignment parsing and narrowing from a file, which helps reduce resource demands. However, this approach impacts performance.
It would be incredibly beneficial for the community to have a more efficient solution to handle such cases without compromising performance.

@npcarter
Copy link
Member

npcarter commented Nov 21, 2024 via email

@npcarter
Copy link
Member

npcarter commented Dec 4, 2024

As an update, I've added support for a2m alignment formatting to our develop branch, in case you want to give it a spin. The format of the alignment is controlled by the --pfam, --stockholm, and --a2m flags, only one of which may be provided. Otherwise, we default to pfam if the output width is unlimited, stockholm if it is. This is only lightly tested, but we'd appreciate any comments you have.

@hegelab
Copy link

hegelab commented Dec 5, 2024

Thanks for your efforts. Sorry for the late response. I thought a lot about this issue. However, I think now that there is nothing to do about the result size either in RAM or a file. Since you have to have all the matches to order them and select the top N hits.

No, there is! Writing down the sentences helps to think :-)
You know that you want the top N hits. So you start the search and collect the hits and scores ordered by the score. However, after you have the next hit (N+1st) then you check if its score is better than the worst score in the current N hits. Based on this you either discard the most recent hit or keep it by replacing the one from the list with the worst score.

I am using within AlphaFold runs. I do not think that the exact number of hits is important, since I know that it is an enough high number to not fit into 256GB RAM.

@npcarter
Copy link
Member

npcarter commented Dec 5, 2024

This could be done, but would have performance implications, because we'd have to keep the hit list sorted at all times rather than just sorting it once at the end. I'm not sure how big a deal that would be, but it would have an effect.

@hegelab
Copy link

hegelab commented Dec 6, 2024

I do not have high experience in MSA. I am not a programmer.
I think this could be useful.
I think that the overhead of maintaining sorting should not be so high. Since it is not real sorting, just inserting a single element into a ordered list. Especially, if N is around 10,000.

@Augustin-Zidek
Copy link
Author

I think that this this could be done using a min-heap with no performance implications:

If `max_sequences` not provided: 
  Use the current algorithm: gather everything, sort at the end. 

Complexity is O(n*log(n)) because of the final sort by sequence bit score.

If `max_sequences` provided:
  Construct a min-heap (compare by sequence bit score).
  For every found sequence:
    If heap full (has capacity max_sequences) AND sequence bit score < heap min:
      Continue: skip this sequence - shortcut to skip insert + pop min

    Insert the sequence in the heap
    If heap full (has capacity max_sequences):
      Pop the min element

Complexity is again O(n*log(n)). However, the number of memory accesses is now much lower thanks to the shortcut -- once the heap is full, most sequences can be rejected very cheaply (just comparing to the heap min). The final sort can also use the min-heap to perform a heap sort.

The second algorithm could be used also if max_sequences is not provided for simplicity, but benchmarking would be needed to confirm this doesn't cause a performance regression. I suspect the heap approach might even turn out to be faster because of better memory locality.

Would you be interested in us providing a proof of concept PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants