-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move synonym map off-heap for SynonymGraphFilter #13054
base: main
Are you sure you want to change the base?
Conversation
I think measuring latency impact would also be important for this change as off-heap will be slower (I understand that we are only adding off-heap as an option here, not forcing, but still). |
I did some rough benchmarks using the large synonym file attached to https://issues.apache.org/jira/browse/LUCENE-3233 The benchmark code and input is at msfroh@174f98a
So, it looks like the time to load synonyms is mostly unchanged (1050ms versus 1037ms), and loading "pre-compiled" synonyms is super-duper fast. We do seem to take a 17.5% hit on processing time. (629ms versus 535ms.) I might try profiling to see where that time is being spent. If it's doing FST operations, I'll assume it's a cost of doing business. If it's spent loading the also off-heap output words, I might consider moving those (optionally?) back on heap. |
I decided to try experimenting with moving the output words back onto the heap, since I didn't like the fact that every word lookup was triggering a seek. Running now, I got way less variance on the on-heap runs. I also added some GCs between iterations, since I wanted to measure the heap usage of each. That likely removed some GC pauses from the on-heap version. I then switched back to the off-heap words to confirm the results that I saw last time (and compare against the implementation with on-heap words). The conclusion seems to be roughly:
The on-heap FST seems to occupy about 36MB of heap. The off-heap FST with on-heap words occupies about 560kB. The off-heap FST with off-heap words occupies about 150kB. With these trade-offs, I think off-heap FST with on-heap words may be a good choice for folks with large sets of synonyms. I don't think I would recommend off-heap FST with off-heap words.
|
@dungba88 -- I'm trying to resolve conflicts with your changes, but I'm a little stuck. I don't understand how we're supposed to use the FST APIs to write the FST to disk now. After merging our changes,
That call to
Is there something else that I'm supposed to be calling on the write path? Note that in the "off-heap" case above (when |
As you only need to write the FST metadata, there is no need to create the FST. You can just call
where as |
Hmm... I'm not sure that will work, because the logic to save the metadata is associated with the I tried extracting that method out into a static, but the line |
You are right, the saveMetadata is still in FST. Now to create the FST written off heap, you need to create the corresponding DataInput and use the FST constructor. However the saveMetadata method can be moved to FSTMetadata as well since all of the information are stored there (including outputs) |
I could put a PR for the saveMetadata change if you prefer. |
I'll update to take care of that. Thanks for the pointers! |
cfdb2fb
to
32ff74b
Compare
I realized I also need the |
Sure -- a standalone PR should work. We can both rebase onto that. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
@dungba88 - I forgot about this change for a while. Did you create a separate PR for the saveMetadata change? Should I? |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
@msfroh I also forgot about this. Let me create a PR |
I published a PR here: #13549. Please take a look when you have time! |
Note: The above PR has been merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this change! Synonym dictionaries can become massive, so having the option for off-heap'ing the FST at a smallish performance hit makes a lot of sense.
Plus it takes advantage of the newish capability of FSTs to be accessed off-heap (thank you Tantivy inspiration!) in Lucene.
I left a few comments...
* Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the | ||
* FST and output words are kept off-heap. | ||
*/ | ||
public class SynonymMapDirectory implements Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mark this with @lucene.experimental
so we are free to change the API within non-major releases?
Or: could this be package private? Does the user need to create this wrapper themselves for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that the user would create a SynonymMapDirectory
and pass it to SynonymMap.Builder.build(SynonymMapDirectory)
as the way of opting in to off-heap FSTs for their synonyms.
Given the issue you called out below regarding the need to close
the IndexInput
for the FST, I feel like the user needs to hold onto "something" (other than the SynonymMap
) that gives them an obligation to close filesystem resources when they're done.
Alternatively, I'd be happy to make SynonymMap
implement Closeable
. Then I'd probably just ask the user to specify a Path
instead. At that point, we could hide SynonymMapDirectory
altogether.
|
||
/** | ||
* Wraps an {@link FSDirectory} to read and write a compiled {@link SynonymMap}. When reading, the | ||
* FST and output words are kept off-heap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the user control separately whether FST and output words are off heap or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, if you're using SynonymMapDirectory
, you get off-heap FST and on-heap words.
If you don't use SynonymMapDirectory
(i.e. you're using the existing constructor or the arg-less version of SynonymMap.Builder.build()
), then everything is on-heap like before.
The numbers I posted in #13054 (comment) (which, granted, was just a single synthetic benchmark) seemed to suggest (to me, at least) that the "sweet spot" is off-heap FST with on-heap words. The performance hit from moving words off-heap (at least with my implementation) was pretty bad. Lots of seek
ing involved. Also, the vast majority of heap savings came from moving the FST.
I'm happy to bring back off-heap words as an option if we think someone would be willing to take that perf hit for slightly lower heap utilization.
} | ||
} | ||
|
||
abstract static class BytesRefHashLike { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public
since it's used in the public
ctor for SynonymMap
? I thought we had static checking for this though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe rename to remove any reference to BytesRefHash
? E.g. WordProvider
or IDToWord
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... you're right.
More importantly, since the ctor for SynonymMap
is public, I probably shouldn't change its signature.
I'll leave the existing public constructor (that takes a BytesRefHash
), add a new private constructor (that takes a SynonymDictionary
-- the new name I picked for BytesRefHashLike
), and have the public constructor delegate to the private one. (That way, SynonymDictionary
can remain package-private.)
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymMapDirectory.java
Outdated
Show resolved
Hide resolved
new SynonymMapFormat(); // TODO -- Should this be more flexible/codec-like? Less? | ||
private final Directory directory; | ||
|
||
public SynonymMapDirectory(Path path) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to store several SynonymMap
s in one Directory
? Or one must make a separate Directory
for each? That's maybe fine ... e.g. one could make a FilterDirectory
impl that can share a single underlying filesystem directory and distinguish the files by e.g. a unique filename prefix or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, since I split the synonyms across three files (synonyms.mdt
, synonyms.wrd
, and synonyms.fst
), I assumed that there would be a single synonym map per directory.
That said, I suppose it wouldn't be hard to combine those three into a single file (with a .syn
extension, say), where the SynonymMapDirectory
could look at a prefix. Specifically, the current implementation reads the metadata and words once (keeping the words on-heap), then spends the rest of its time in the FST.
Then a single filesystem directory could have something like:
first_synonyms.syn
second_synonyms.syn
... etc. ...
What do you think? (I'm also happy to let each serialized SynonymMap
live in its own directory.)
} | ||
|
||
public SynonymMap readMap() throws IOException { | ||
return synonymMapFormat.readSynonymMap(directory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this will hold open IndexInput
file handles right? How do these get closed? (SynonymMap
doesn't have a close
I think?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like a model where the user creates a SynonymMapDirectory
and passes it to SynonymMap.Builder.build())
if they want to store/read compiled synonyms on the filesystem.
Before returning the IndexInput
, the SynonymMapDirectory
will keep a reference to it. When the user calls close
on their SynonymMapDirectory
, it will close the outstanding IndexInput
.
This stores the synonym map's FST and word lookup off-heap in a separate, configurable directory.
Moved FST metadata saving into FSTMetadata class per suggestion from @dungba88.
- Reduced visibility of some things - Brought back the old SynonymMap public constructor - Renamed BytesRefHashLike to SynonymDictionary - Hold a reference to FST's IndexInput to close it
49b622f
to
cef4fac
Compare
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
This stores the synonym map's FST and word lookup off-heap in a separate, configurable directory.
The initial implementation is rough, but the unit tests pass with this change randomly enabled.
Obvious things that need work are:
OffHeapBytesRefHashLike
, but I don't see an alternative (besides moving it on-heap). Given that the original issue was only about moving the FST off-heap, maybe we can keep the word dictionary on-heap.Fixes #13005