Speed up advancing on the disjunction iterator. #14052

jpountz · 2024-12-10T14:54:00Z

Currently, the disjunction iterator puts all clauses in a heap in order to be able to merge doc IDs in a streaming fashion. This is a good approach for exhaustive evaluation, when only one clause moves to a different doc ID on average and the per-iteration cost is in the order of O(log(N)) where N is the number of clauses.

However, if a selective filter is applied, this could cause many clauses to move to a different doc ID. In the worst-case scenario, all clauses could move to a different doc ID and the cost of maintaiting heap invariants could grow to O(N * log(N)) (every clause introduces a O(log(N)) cost). With many clauses, this is much higher than the cost of checking all clauses sequentially: O(N).

To protect from this reordering overhead, DisjunctionDISIApproximation now only puts the cheapest clauses in a heap in a way that tries to achieve up to 1.5 clauses moving to a different doc ID on average. More expensive clauses are checked linearly.

Currently, the disjunction iterator puts all clauses in a heap in order to be able to merge doc IDs in a streaming fashion. This is a good approach for exhaustive evaluation, when only one clause moves to a different doc ID on average and the per-iteration cost is in the order of O(log(N)) where N is the number of clauses. However, if a selective filter is applied, this could cause many clauses to move to a different doc ID. In the worst-case scenario, all clauses could move to a different doc ID and the cost of maintaiting heap invariants could grow to O(N * log(N)) (every clause introduces a O(log(N)) cost). With many clauses, this is much higher than the cost of checking all clauses sequentially: O(N). To protect from this reordering overhead, DisjunctionDISIApproximation now only puts the cheapest clauses in a heap in a way that tries to achieve up to 1.5 clauses moving to a different doc ID on average. More expensive clauses are checked linearly.

jpountz · 2024-12-10T14:56:53Z

luceneutil suggests that this change gives a small slowdown when a DisjunctionDISIApproximation leads iteration (AndHighOrMedMed, CombinedOrHighMed, CombinedAndHighMed, CombinedOrHighHigh, CombinedAndHighHigh) in favor of a speedup when it catches up on another clause (CountFilteredOrHighMed, AndMedOrHighHigh, CountFilteredOrHighHigh, CountFilteredOrMany). (Note that the slowdown is a fixed cost due to a few additional checks, while the speedup scales with the number of clauses.)

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntNRQ      115.88     (17.1%)      110.62     (12.7%)   -4.5% ( -29% -   30%) 0.366
                  FilteredIntNRQ      114.03     (16.1%)      109.46     (12.6%)   -4.0% ( -28% -   29%) 0.405
                 AndHighOrMedMed       46.62      (2.1%)       45.46      (1.3%)   -2.5% (  -5% -    0%) 0.000
                         Prefix3      130.51      (4.4%)      127.43      (5.9%)   -2.4% ( -12% -    8%) 0.175
               CombinedOrHighMed       74.76      (1.1%)       73.29      (1.9%)   -2.0% (  -4% -    1%) 0.000
              CombinedAndHighMed       57.30      (1.8%)       56.19      (2.0%)   -1.9% (  -5% -    1%) 0.002
              CombinedOrHighHigh       19.72      (1.3%)       19.36      (2.0%)   -1.8% (  -5% -    1%) 0.002
             CombinedAndHighHigh       15.75      (1.9%)       15.47      (2.2%)   -1.8% (  -5% -    2%) 0.012
             FilteredAndHighHigh       63.99      (1.8%)       63.06      (1.3%)   -1.4% (  -4% -    1%) 0.006
            FilteredAndStopWords       49.00      (1.7%)       48.40      (1.2%)   -1.2% (  -4% -    1%) 0.013
                     CountPhrase        4.34      (3.0%)        4.29      (3.1%)   -1.2% (  -7% -    5%) 0.251
                    AndStopWords       32.46      (3.0%)       32.11      (3.9%)   -1.1% (  -7% -    5%) 0.344
                        Wildcard       76.33      (3.2%)       75.52      (3.8%)   -1.1% (  -7% -    6%) 0.367
                    CombinedTerm       32.39      (2.7%)       32.08      (3.5%)   -1.0% (  -6% -    5%) 0.349
                       CountTerm     9409.48      (3.1%)     9321.18      (3.3%)   -0.9% (  -7% -    5%) 0.378
                          Fuzzy1       81.72      (2.4%)       81.00      (2.7%)   -0.9% (  -5% -    4%) 0.304
                  FilteredPhrase       30.89      (2.0%)       30.62      (1.8%)   -0.9% (  -4% -    2%) 0.165
                    FilteredTerm      154.99      (3.1%)      153.79      (2.3%)   -0.8% (  -6% -    4%) 0.400
                   TermTitleSort      157.71      (1.9%)      156.57      (1.7%)   -0.7% (  -4% -    2%) 0.235
                            Term      480.15      (6.3%)      476.76      (5.3%)   -0.7% ( -11% -   11%) 0.717
               FilteredAnd3Terms      198.21      (1.5%)      196.86      (1.8%)   -0.7% (  -3% -    2%) 0.218
     FilteredAnd2Terms2StopWords      200.76      (1.1%)      199.40      (1.0%)   -0.7% (  -2% -    1%) 0.055
                 FilteredPrefix3      123.17      (4.3%)      122.39      (5.9%)   -0.6% ( -10% -    9%) 0.712
                     OrStopWords       33.58      (5.5%)       33.37      (6.7%)   -0.6% ( -12% -   12%) 0.764
                          Fuzzy2       76.93      (2.0%)       76.46      (2.3%)   -0.6% (  -4% -    3%) 0.401
                  CountOrHighMed      140.99      (1.6%)      140.14      (1.1%)   -0.6% (  -3% -    2%) 0.198
               TermDayOfYearSort      637.24      (2.3%)      633.60      (2.6%)   -0.6% (  -5% -    4%) 0.484
                        PKLookup      279.61      (1.9%)      278.15      (2.9%)   -0.5% (  -5% -    4%) 0.517
                 CountOrHighHigh       75.89      (2.0%)       75.50      (1.7%)   -0.5% (  -4% -    3%) 0.398
                  FilteredOrMany       17.03      (3.2%)       16.95      (4.1%)   -0.5% (  -7% -    7%) 0.699
                        Or3Terms      172.16      (3.4%)      171.35      (3.7%)   -0.5% (  -7% -    6%) 0.689
                       And3Terms      181.15      (2.0%)      180.35      (2.9%)   -0.4% (  -5% -    4%) 0.597
              FilteredAndHighMed      132.36      (1.6%)      131.82      (2.3%)   -0.4% (  -4% -    3%) 0.535
                     CountOrMany        7.52      (1.9%)        7.49      (1.7%)   -0.4% (  -3% -    3%) 0.522
                     AndHighHigh       45.42      (2.4%)       45.30      (1.9%)   -0.3% (  -4% -    4%) 0.717
                DismaxOrHighHigh      115.82      (5.0%)      115.54      (4.4%)   -0.2% (  -9% -    9%) 0.876
                   TermMonthSort     3413.43      (1.9%)     3405.69      (2.0%)   -0.2% (  -4% -    3%) 0.729
                      AndHighMed      132.72      (2.0%)      132.45      (1.8%)   -0.2% (  -3% -    3%) 0.747
                          OrMany       19.85      (3.4%)       19.82      (2.6%)   -0.2% (  -5% -    6%) 0.880
             CountFilteredPhrase       26.01      (1.8%)       25.97      (1.6%)   -0.1% (  -3% -    3%) 0.832
             And2Terms2StopWords      166.77      (2.4%)      166.60      (2.9%)   -0.1% (  -5% -    5%) 0.908
                 CountAndHighMed      161.46      (1.5%)      161.39      (2.2%)   -0.0% (  -3% -    3%) 0.947
               FilteredOrHighMed      155.06      (1.2%)      155.02      (1.2%)   -0.0% (  -2% -    2%) 0.955
                 DismaxOrHighMed      168.40      (3.1%)      168.37      (2.5%)   -0.0% (  -5% -    5%) 0.981
                FilteredOr3Terms      167.26      (1.2%)      167.34      (1.0%)    0.1% (  -2% -    2%) 0.889
              Or2Terms2StopWords      162.55      (3.5%)      162.65      (3.7%)    0.1% (  -6% -    7%) 0.956
                      OrHighHigh       52.07      (4.3%)       52.11      (4.9%)    0.1% (  -8% -    9%) 0.960
             FilteredOrStopWords       43.70      (2.7%)       43.75      (2.7%)    0.1% (  -5% -    5%) 0.904
      FilteredOr2Terms2StopWords      148.84      (1.3%)      149.02      (1.3%)    0.1% (  -2% -    2%) 0.790
                      DismaxTerm      600.72      (5.2%)      601.50      (3.8%)    0.1% (  -8% -    9%) 0.932
              FilteredOrHighHigh       64.94      (2.5%)       65.03      (2.5%)    0.1% (  -4% -    5%) 0.872
                      OrHighRare      261.76     (10.9%)      262.11     (10.4%)    0.1% ( -19% -   24%) 0.970
                CountAndHighHigh       55.31      (1.8%)       55.45      (1.7%)    0.2% (  -3% -    3%) 0.677
                          Phrase       15.80      (5.4%)       15.84      (4.5%)    0.3% (  -9% -   10%) 0.865
                      TermDTSort      286.63      (5.6%)      287.51      (6.4%)    0.3% ( -11% -   13%) 0.878
                       OrHighMed      191.75      (3.8%)      192.45      (3.6%)    0.4% (  -6% -    8%) 0.766
          CountFilteredOrHighMed       68.17      (1.7%)       68.93      (1.4%)    1.1% (  -1% -    4%) 0.034
                AndMedOrHighHigh       60.11      (2.0%)       62.55      (2.2%)    4.1% (   0% -    8%) 0.000
         CountFilteredOrHighHigh       57.67      (2.0%)       64.28      (2.0%)   11.5% (   7% -   15%) 0.000
             CountFilteredOrMany        3.88      (4.8%)        8.71      (3.3%)  124.6% ( 111% -  139%) 0.000

jpountz added this to the 10.1.0 milestone Dec 10, 2024

javanna modified the milestones: 10.1.0, 10.2.0 Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up advancing on the disjunction iterator. #14052

Speed up advancing on the disjunction iterator. #14052

jpountz commented Dec 10, 2024

jpountz commented Dec 10, 2024

Speed up advancing on the disjunction iterator. #14052

Are you sure you want to change the base?

Speed up advancing on the disjunction iterator. #14052

Conversation

jpountz commented Dec 10, 2024

jpountz commented Dec 10, 2024