Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move synonym map off-heap for SynonymGraphFilter #13054

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

msfroh
Copy link
Contributor

@msfroh msfroh commented Jan 30, 2024

Description

This stores the synonym map's FST and word lookup off-heap in a separate, configurable directory.

The initial implementation is rough, but the unit tests pass with this change randomly enabled.

Obvious things that need work are:

  1. I tried to do something like a codec, but not really a codec for the synonym map files. For a solution that could evolve over time, we should probably at least write something to the metadata file saying what format was used.
  2. Right now it makes no effort to detect changes to the synonym files. I would suggest that SynonymGraphFilterFactory rebuild the directory if a checksum of the input files doesn't match a value recorded in the metadata file.
  3. I don't think I like the random seeks in OffHeapBytesRefHashLike, but I don't see an alternative (besides moving it on-heap). Given that the original issue was only about moving the FST off-heap, maybe we can keep the word dictionary on-heap.

Fixes #13005

@dungba88
Copy link
Contributor

I think measuring latency impact would also be important for this change as off-heap will be slower (I understand that we are only adding off-heap as an option here, not forcing, but still).

@msfroh
Copy link
Contributor Author

msfroh commented Feb 1, 2024

I did some rough benchmarks using the large synonym file attached to https://issues.apache.org/jira/browse/LUCENE-3233

The benchmark code and input is at msfroh@174f98a

Attempt On-heap load Off-heap load Off-heap reload On-heap process Off-heap process Off-heap reload process
1 1146.022381 1117.004359 4.099065 569.120851 656.430684 613.475144
2 1079.578922 1060.926854 1.761465 456.203168 655.596275 622.534246
3 1035.911388 1076.611629 1.750233 579.41094 655.955431 614.788388
4 1037.825728 1085.513933 2.074129 696.390519 688.664985 613.266972
5 1017.489384 1008.209808 1.717748 485.510526 620.800148 620.708538
6 1014.641653 1024.412669 1.740371 483.617261 619.696259 619.910897
7 1027.691397 1045.129567 1.727786 670.49456 622.48759 616.303549
8 984.005971 1009.265777 1.736832 513.543926 615.448442 613.06279
9 1027.841112 1027.057453 1.732985 486.502644 622.535269 620.285635
10 981.689573 1074.613506 1.71059 707.810107 613.417977 624.34832
11 1026.165712 1065.3181 1.689407 479.610417 621.454353 616.183786
12 994.949905 1046.898091 1.730394 498.938696 612.279425 619.965444
13 1035.144288 1043.119169 1.739726 472.821155 619.267425 613.029508
14 996.056368 1017.663948 1.699742 692.135015 619.725163 620.454352
15 1046.605644 1018.287866 1.713526 470.391592 619.723699 612.068366
16 1007.579733 1042.062818 1.70251 508.481346 619.481298 619.178419
17 1038.166702 1054.039165 1.683814 485.439337 620.901934 616.017789
18 1000.900448 1058.492139 1.7267 515.185816 622.204031 627.560895
19 1236.416447 1080.877889 1.643654 434.73928 624.825435 625.622426
20 997.663619 1038.478411 1.657257 497.232157 623.337627 620.943519
Mean 1036.617319 1049.699158 1.8518967 535.1789657 628.7116725 618.4854492
Stddev 59.71799264 28.44516049 0.535792004 86.95026923 19.55324941 4.52695571

So, it looks like the time to load synonyms is mostly unchanged (1050ms versus 1037ms), and loading "pre-compiled" synonyms is super-duper fast.

We do seem to take a 17.5% hit on processing time. (629ms versus 535ms.) I might try profiling to see where that time is being spent. If it's doing FST operations, I'll assume it's a cost of doing business. If it's spent loading the also off-heap output words, I might consider moving those (optionally?) back on heap.

@msfroh
Copy link
Contributor Author

msfroh commented Feb 9, 2024

I decided to try experimenting with moving the output words back onto the heap, since I didn't like the fact that every word lookup was triggering a seek.

Running now, I got way less variance on the on-heap runs. I also added some GCs between iterations, since I wanted to measure the heap usage of each. That likely removed some GC pauses from the on-heap version.

I then switched back to the off-heap words to confirm the results that I saw last time (and compare against the implementation with on-heap words).

The conclusion seems to be roughly:

  • Existing on-heap FST averages about 444ms to process a lot of synonyms.
  • Off-heap FST with on-heap words averages 515 or 516ms. (About 16% slower than existing on-heap.)
  • Off-heap FST with off-heap words averages 620ms. (About 40% slower than existing on-heap.)

The on-heap FST seems to occupy about 36MB of heap. The off-heap FST with on-heap words occupies about 560kB. The off-heap FST with off-heap words occupies about 150kB.

With these trade-offs, I think off-heap FST with on-heap words may be a good choice for folks with large sets of synonyms. I don't think I would recommend off-heap FST with off-heap words.

Attempt OnHeap FST load time OffHeap FST (OnHeap words) load time OffHeap FST (OnHeap words) reload time OnHeap FST processing time OffHeap FST (OnHeap words) processing time OffHeap FST (OnHeap words) reloaded processing time OffHeap FST (OffHeap words) processing time OffHeap FST (OffHeap words) reloaded processing time
1 1191.339685 1072.285824 9.669646 436.391631 520.550704 516.11297 623.451546 620.531215
2 1030.432454 1033.619768 8.874105 448.848403 516.784387 517.230739 621.522464 622.793343
3 984.83645 1037.807342 8.912252 443.789813 512.066535 517.716981 622.455444 620.468985
4 1049.63589 1048.60113 8.894401 449.237547 518.946226 516.868933 617.837364 616.810236
5 990.22176 1049.618665 8.861166 448.923912 512.559801 511.114898 616.555422 617.122551
6 978.41877 1063.824595 8.930418 440.251675 517.632376 518.175232 621.969759 622.828416
7 985.434177 1049.113913 8.872906 443.209607 511.210536 518.802292 624.151468 622.097039
8 985.376238 1046.102696 8.823786 440.815454 517.491411 517.905752 623.390319 625.387487
9 983.341325 1065.892279 8.871586 449.145252 516.029267 516.916524 622.811992 622.798858
10 985.438642 1046.71167 8.8518 445.970679 512.045037 518.934149 622.592098 614.661805
11 990.592624 1050.377106 8.832753 443.844237 515.758106 510.808005 611.62254 622.956946
12 986.747374 1066.052969 8.884928 444.398327 517.259451 524.770132 622.085785 619.311172
13 984.328191 1052.189621 8.88281 439.612497 517.861131 515.796013 617.862222 615.101452
14 984.405339 1049.06783 8.835775 438.871305 517.885493 515.853446 615.254987 623.464483
15 997.323593 1064.473985 8.90682 443.640208 515.329143 518.807239 623.020916 623.013801
16 997.253932 1066.558928 8.900308 442.534843 511.930766 516.365803 624.316916 615.037306
17 999.464751 1046.464149 8.895899 443.48306 514.841946 517.082166 617.615908 618.661376
18 1001.896073 1045.304622 8.877555 444.875225 515.029862 510.365428 618.540866 624.355309
19 986.055833 1045.208347 8.863339 441.647553 511.489699 517.213428 623.61503 621.198543
20 984.112667 1047.317164 8.940865 451.304206 514.762544 510.45981 621.057397 621.483146
21 988.310511 1046.154648 8.865301 447.25874 514.859414 517.24163 623.916511 614.185296
22 982.874582 1062.113889 8.867098 439.785463 510.387721 516.885653 623.494968 622.527091
23 980.96967 1048.050631 8.867966 439.05464 511.423329 516.984465 621.567988 621.204435
24 983.189843 1046.083632 8.81578 440.574651 518.390122 520.392926 622.34785 614.923018
25 987.033178 1074.553767 8.812579 446.687106 513.914686 521.952744 615.870183 621.089011
26 985.771758 1076.245942 8.845264 444.718264 516.274395 513.5547 615.927497 615.53522
27 981.748774 1046.85677 8.818164 443.252924 513.632714 515.919924 626.659516 622.307368
28 983.979894 1062.317764 8.869256 443.267803 513.965345 509.688356 615.790469 615.712761
29 980.908776 1045.006602 8.855109 444.452376 517.488159 509.770143 621.96871 621.582871
30 981.508232 1046.722776 8.790313 443.952753 513.840793 512.847346 621.747601 621.901271
31 999.165558 1063.517734 8.792905 440.356205 517.677777 517.920992 620.90204 613.422668
32 1000.854281 1060.766663 9.027399 444.385706 510.006231 514.006688 623.684492 620.742008
33 1001.620724 1046.329083 8.72687 443.912072 509.793229 513.313214 620.695915 621.266234
34 1008.677463 1044.437966 8.799494 447.077333 516.263674 514.751767 622.775084 620.885167
35 987.309353 1048.062722 8.763122 440.748052 518.972785 518.608101 621.032898 620.85482
36 980.960836 1052.037316 8.834358 445.210623 518.850346 511.742763 620.719135 621.679027
37 983.807955 1049.894433 8.798039 440.302584 511.351473 510.557417 615.059624 619.802549
38 982.144377 1071.744423 8.81421 444.036711 518.551589 515.779265 614.579103 615.092139
39 984.460101 1051.399337 8.781058 439.998112 518.709639 511.192122 614.797646 621.154429
40 981.16924 1047.739329 8.827411 446.515425 512.815557 519.138446 621.769829 613.83963
Mean 995.5780219 1053.415701 8.87387035 443.6585744 515.115835 515.7387151 620.4259376 619.7447621
StdDev 34.59108879 10.41692769 0.139732265 3.344053675 2.951489941 3.528886642 3.48797555 3.346472427

@msfroh
Copy link
Contributor Author

msfroh commented Feb 23, 2024

@dungba88 -- I'm trying to resolve conflicts with your changes, but I'm a little stuck. I don't understand how we're supposed to use the FST APIs to write the FST to disk now.

After merging our changes, SynonymMap contains:

      FST<BytesRef> fst = FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader());
      if (directory != null) {
        fstOutput.close(); // TODO -- Should fstCompiler.compile take care of this?
        try (SynonymMapDirectory.WordsOutput wordsOutput = directory.wordsOutput()) {
          BytesRef scratchRef = new BytesRef();
          for (int i = 0; i < words.size(); i++) {
            words.get(i, scratchRef);
            wordsOutput.addWord(scratchRef);
          }
        }
        directory.writeMetadata(words.size(), maxHorizontalContext, fst);
        return directory.readMap();
      }

That call to FST.fromFSTReader(...) fails with:

The DataOutput must implement FSTReader, but got FSIndexOutput(path="/home/froh/ws/lucene/lucene/analysis/common/build/tmp/tests-tmp/lucene.analysis.synonym.TestSynonymGraphFilter_1171182AD5892267-001/tempDir-001/synonyms.fst")

Is there something else that I'm supposed to be calling on the write path? Note that in the "off-heap" case above (when directory != null), we just need to write the FST. The directory.readMap() call loads it fresh from disk, discarding the FST that we constructed on heap.

@dungba88
Copy link
Contributor

@msfroh

As you only need to write the FST metadata, there is no need to create the FST. You can just call

directory.writeMetadata(words.size(), maxHorizontalContext, fstMetadata);

where as fstMetadata is returned by fstCompiler.compile()

@msfroh
Copy link
Contributor Author

msfroh commented Feb 25, 2024

@msfroh

As you only need to write the FST metadata, there is no need to create the FST. You can just call

directory.writeMetadata(words.size(), maxHorizontalContext, fstMetadata);

where as fstMetadata is returned by fstCompiler.compile()

Hmm... I'm not sure that will work, because the logic to save the metadata is associated with the FST instance (i.e. in the FST::saveMetadata method).

I tried extracting that method out into a static, but the line outputs.writeFinalOutput(...) breaks things.

@dungba88
Copy link
Contributor

You are right, the saveMetadata is still in FST.

Now to create the FST written off heap, you need to create the corresponding DataInput and use the FST constructor.

However the saveMetadata method can be moved to FSTMetadata as well since all of the information are stored there (including outputs)

@dungba88
Copy link
Contributor

I could put a PR for the saveMetadata change if you prefer.

@msfroh
Copy link
Contributor Author

msfroh commented Feb 26, 2024

I could put a PR for the saveMetadata change if you prefer.

I'll update to take care of that. Thanks for the pointers!

@msfroh msfroh force-pushed the synonym_fst_offheap branch 2 times, most recently from cfdb2fb to 32ff74b Compare February 26, 2024 19:01
@dungba88
Copy link
Contributor

I realized I also need the saveMetadata change for #12985. Do you think we should make it a standalone PR and merge first? Otherwise I've cherry-picked from this PR :)

@msfroh
Copy link
Contributor Author

msfroh commented Feb 27, 2024

I realized I also need the saveMetadata change for #12985. Do you think we should make it a standalone PR and merge first? Otherwise I've cherry-picked from this PR :)

Sure -- a standalone PR should work. We can both rebase onto that.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Mar 13, 2024
@msfroh
Copy link
Contributor Author

msfroh commented May 25, 2024

@dungba88 - I forgot about this change for a while. Did you create a separate PR for the saveMetadata change? Should I?

@github-actions github-actions bot removed the Stale label May 26, 2024
Copy link

github-actions bot commented Jun 9, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jun 9, 2024
@dungba88
Copy link
Contributor

dungba88 commented Jul 8, 2024

@msfroh I also forgot about this. Let me create a PR

@dungba88
Copy link
Contributor

dungba88 commented Jul 8, 2024

I published a PR here: #13549. Please take a look when you have time!

@github-actions github-actions bot removed the Stale label Jul 9, 2024
@dungba88
Copy link
Contributor

Note: The above PR has been merged

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this change! Synonym dictionaries can become massive, so having the option for off-heap'ing the FST at a smallish performance hit makes a lot of sense.

Plus it takes advantage of the newish capability of FSTs to be accessed off-heap (thank you Tantivy inspiration!) in Lucene.

I left a few comments...

* Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the
* FST and output words are kept off-heap.
*/
public class SynonymMapDirectory implements Closeable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mark this with @lucene.experimental so we are free to change the API within non-major releases?

Or: could this be package private? Does the user need to create this wrapper themselves for some reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that the user would create a SynonymMapDirectory and pass it to SynonymMap.Builder.build(SynonymMapDirectory) as the way of opting in to off-heap FSTs for their synonyms.

Given the issue you called out below regarding the need to close the IndexInput for the FST, I feel like the user needs to hold onto "something" (other than the SynonymMap) that gives them an obligation to close filesystem resources when they're done.

Alternatively, I'd be happy to make SynonymMap implement Closeable. Then I'd probably just ask the user to specify a Path instead. At that point, we could hide SynonymMapDirectory altogether.


/**
* Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the
* FST and output words are kept off-heap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the user control separately whether FST and output words are off heap or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, if you're using SynonymMapDirectory, you get off-heap FST and on-heap words.

If you don't use SynonymMapDirectory (i.e. you're using the existing constructor or the arg-less version of SynonymMap.Builder.build()), then everything is on-heap like before.

The numbers I posted in #13054 (comment) (which, granted, was just a single synthetic benchmark) seemed to suggest (to me, at least) that the "sweet spot" is off-heap FST with on-heap words. The performance hit from moving words off-heap (at least with my implementation) was pretty bad. Lots of seeking involved. Also, the vast majority of heap savings came from moving the FST.

I'm happy to bring back off-heap words as an option if we think someone would be willing to take that perf hit for slightly lower heap utilization.

}
}

abstract static class BytesRefHashLike {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be public since it's used in the public ctor for SynonymMap? I thought we had static checking for this though...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe rename to remove any reference to BytesRefHash? E.g. WordProvider or IDToWord or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... you're right.

More importantly, since the ctor for SynonymMap is public, I probably shouldn't change its signature.

I'll leave the existing public constructor (that takes a BytesRefHash), add a new private constructor (that takes a SynonymDictionary -- the new name I picked for BytesRefHashLike), and have the public constructor delegate to the private one. (That way, SynonymDictionary can remain package-private.)

new SynonymMapFormat(); // TODO -- Should this be more flexible/codec-like? Less?
private final Directory directory;

public SynonymMapDirectory(Path path) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to store several SynonymMaps in one Directory? Or one must make a separate Directory for each? That's maybe fine ... e.g. one could make a FilterDirectory impl that can share a single underlying filesystem directory and distinguish the files by e.g. a unique filename prefix or so.

Copy link
Contributor Author

@msfroh msfroh Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, since I split the synonyms across three files (synonyms.mdt, synonyms.wrd, and synonyms.fst), I assumed that there would be a single synonym map per directory.

That said, I suppose it wouldn't be hard to combine those three into a single file (with a .syn extension, say), where the SynonymMapDirectory could look at a prefix. Specifically, the current implementation reads the metadata and words once (keeping the words on-heap), then spends the rest of its time in the FST.

Then a single filesystem directory could have something like:

first_synonyms.syn
second_synonyms.syn
... etc. ...

What do you think? (I'm also happy to let each serialized SynonymMap live in its own directory.)

}

public SynonymMap readMap() throws IOException {
return synonymMapFormat.readSynonymMap(directory);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this will hold open IndexInput file handles right? How do these get closed? (SynonymMap doesn't have a close I think?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like a model where the user creates a SynonymMapDirectory and passes it to SynonymMap.Builder.build()) if they want to store/read compiled synonyms on the filesystem.

Before returning the IndexInput, the SynonymMapDirectory will keep a reference to it. When the user calls close on their SynonymMapDirectory, it will close the outstanding IndexInput.

msfroh added 3 commits August 1, 2024 17:18
This stores the synonym map's FST and word lookup off-heap in a
separate, configurable directory.
msfroh added 2 commits August 1, 2024 17:18
Moved FST metadata saving into FSTMetadata class per suggestion from
@dungba88.
- Reduced visibility of some things
- Brought back the old SynonymMap public constructor
- Renamed BytesRefHashLike to SynonymDictionary
- Hold a reference to FST's IndexInput to close it
@msfroh msfroh force-pushed the synonym_fst_offheap branch from 49b622f to cef4fac Compare August 2, 2024 00:37
Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SynonymGraphFilter should read FSTs off-heap?
3 participants