Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary float[](BM25Scorer) allocations for non-scoring queries #12297

Closed
jainankitk opened this issue May 16, 2023 · 6 comments
Closed

Unnecessary float[](BM25Scorer) allocations for non-scoring queries #12297

jainankitk opened this issue May 16, 2023 · 6 comments
Labels
Milestone

Comments

@jainankitk
Copy link
Contributor

jainankitk commented May 16, 2023

Description

While looking into customer issue, I noticed increase in GC time from Lucene 7.x to 8.x. From the JVM histograms, one of the primary difference was float[] allocation. Took a heap dump to check the dominator and it was coming from BM25Scorer.

The change seems to have come in with 8fd7ead, which removed some of the special-case logic around the "non-scoring similarity" embedded in IndexSearcher (returned in the false case from the old IndexSearcher#scorer(boolean needsScores)).

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:      24601972     4773247024  [B (java.base@11.0.17)
   2:       2100779     2061684496  [F (java.base@11.0.17)
   3:      33501475      804035400  java.util.ArrayList (java.base@11.0.17)
   4:      16232322      716523504  [Ljava.lang.Object; (java.base@11.0.17)
   5:      14819347      711328656  java.util.HashMap (java.base@11.0.17)
...
...
  34:       1106011       79632792  org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl
  35:       1979609       79184360  org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer

I also validated that the scoring mode for these queries is COMPLETE_NO_SCORES, that has needsScore set to false:

method=org.apache.lucene.search.TermQuery$TermWeight.<init> location=AtExit
ts=2023-05-01 22:32:57; [cost=0.014381ms] result=@ArrayList[
    @ScoreMode[COMPLETE_NO_SCORES],
]
method=org.apache.lucene.search.TermQuery$TermWeight.<init> location=AtExit
ts=2023-05-01 22:32:56; [cost=0.029482ms] result=@ArrayList[
    @ScoreMode[COMPLETE_NO_SCORES],
]
method=org.apache.lucene.search.TermQuery$TermWeight.<init> location=AtExit
ts=2023-05-01 22:32:57; [cost=0.0135ms] result=@ArrayList[
    @ScoreMode[COMPLETE_NO_SCORES],
]

Version and environment details

Using Lucene 8.10.1, though the issue is there starting 8.x goes into 9.x as well

Screenshot

Screenshot 2023-05-19 at 5 56 08 PM

@jainankitk
Copy link
Contributor Author

jainankitk commented May 16, 2023

I also came across this discussion, but could not understand why this is not an issue. Seems like regression to me, for specific customer use case. I also tried quick hack patch (always returning non-scoring similarity) just to confirm, and the float[] did disappear from the histogram / heap dump.

 public class IndexSearcher {
 
+  /** A search-time {@link Similarity} that does not make use of scoring factors
+   *  and may be used when scores are not needed. */
+  private static final Similarity NON_SCORING_SIMILARITY = new Similarity() {
+ 
+    @Override
+    public long computeNorm(FieldInvertState state) {
+      throw new UnsupportedOperationException("This Similarity may only be used for searching, not indexing");
+    }
+ 
+    @Override
+    public SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {
+      return new SimScorer() {
+        @Override
+        public float score(float freq, long norm) {
+          return 0f;
+        }
+      };
+    }
+ 
+  };
+
   private static QueryCache DEFAULT_QUERY_CACHE;
   private static QueryCachingPolicy DEFAULT_CACHING_POLICY = new UsageTrackingQueryCachingPolicy();
   static {
@@ -314,7 +336,7 @@ public class IndexSearcher {
    *  {@link Similarity} that has been set through {@link #setSimilarity(Similarity)}
    *  or the default {@link Similarity} if none has been set explicitly. */
   public Similarity getSimilarity() {
-    return similarity;
+    return NON_SCORING_SIMILARITY;
   }
 
   /**

@jpountz - Can you check once and confirm?

@jainankitk jainankitk changed the title Unnecessary BM25Scorer allocations for non-scoring queries Unnecessary float[](BM25Scorer) allocations for non-scoring queries May 16, 2023
@jpountz
Copy link
Contributor

jpountz commented Jun 13, 2023

I agree with Robert that 1kB per segment doesn't sound like a crazy amount of allocations, which suggests that you are searching many segments. Does the memory allocation profile look much better after your change?

If yes, I wouldn't be against a small isolated change that avoids these allocations. It looks like one way to do it would be to update the TermWeight constructor to assign a dummy simScorer when scores are not needed.

@jainankitk
Copy link
Contributor Author

jainankitk commented Jun 16, 2023

I agree with Robert that 1kB per segment doesn't sound like a crazy amount of allocations, which suggests that you are searching many segments. Does the memory allocation profile look much better after your change?

I agree that per segment it isn't huge but those allocations add up to couple of GBs (dominated heap in image above). And I did notice that the float[] did not show up in the allocation profile after applying above patch

If yes, I wouldn't be against a small isolated change that avoids these allocations. It looks like one way to do it would be to update the TermWeight constructor to assign a dummy simScorer when scores are not needed.

I was thinking along similar lines. @sgup432 will be raising PR for making this change

@sgup432
Copy link
Contributor

sgup432 commented Jun 22, 2023

@jpountz I have opened a PR for this, can you check?. I have added dummy scored in TermsWeight in case score is not needed as discussed.
I was earlier thinking to take a needsScore parameter in IndexSearcher.getSimilarity to cover all use cases but seems like a lot of changes elsewhere.

@jpountz
Copy link
Contributor

jpountz commented Jun 25, 2023

+1 to not change the signature of getSimilarity.

@jpountz
Copy link
Contributor

jpountz commented Jul 11, 2023

This has been addressed via #12383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants