-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose DelimitedTermFrequencyTokenFilter #9413
Comments
While including DelimitedTermFrequencyTokenFilter into the OpenSearch codebase is great, I am wondering if using two separate fields in documents to store terms and their corresponding frequencies could be an alternative. This approach involves using a script query to calculate scores based on the term frequencies. I wanted to confirm if this is the actual use case you are aiming for.
|
@noCharger the alternative approach looks very suboptimal:
|
Russ, if you have an idea for implementation, feel free to submit as a PR. cc: @msfroh, @rishabhmaurya, @jainankitk |
Relates: opensearch-project#9413 This commit exposes Lucene's delimited term frequency token filter to be able to provide term frequencies along with terms.
Relates: opensearch-project#9413 This commit exposes Lucene's delimited term frequency token filter to be able to provide term frequencies along with terms. Signed-off-by: Russ Cam <russcam@canva.com>
I've opened #9479 |
* Expose DelimitedTermFrequencyTokenFilter Relates: #9413 This commit exposes Lucene's delimited term frequency token filter to be able to provide term frequencies along with terms. Signed-off-by: Russ Cam <russcam@canva.com> * fix format violations Signed-off-by: Russ Cam <russcam@canva.com> * fix test and add to changelog Signed-off-by: Russ Cam <russcam@canva.com> * Address PR feedback - Add unit tests for DelimitedTermFrequencyTokenFilterFactory - Remove IllegalArgumentException as caught exception - Add skip to yaml rest tests to skip for version < 2.10 Signed-off-by: Russ Cam <russcam@canva.com> * formatting Signed-off-by: Russ Cam <russcam@canva.com> * Rename filter Signed-off-by: Russ Cam <russcam@canva.com> * update naming in REST tests Signed-off-by: Russ Cam <russcam@canva.com> --------- Signed-off-by: Russ Cam <russcam@canva.com>
* Expose DelimitedTermFrequencyTokenFilter Relates: opensearch-project#9413 This commit exposes Lucene's delimited term frequency token filter to be able to provide term frequencies along with terms. Signed-off-by: Russ Cam <russcam@canva.com> * fix format violations Signed-off-by: Russ Cam <russcam@canva.com> * fix test and add to changelog Signed-off-by: Russ Cam <russcam@canva.com> * Address PR feedback - Add unit tests for DelimitedTermFrequencyTokenFilterFactory - Remove IllegalArgumentException as caught exception - Add skip to yaml rest tests to skip for version < 2.10 Signed-off-by: Russ Cam <russcam@canva.com> * formatting Signed-off-by: Russ Cam <russcam@canva.com> * Rename filter Signed-off-by: Russ Cam <russcam@canva.com> * update naming in REST tests Signed-off-by: Russ Cam <russcam@canva.com> --------- Signed-off-by: Russ Cam <russcam@canva.com> (cherry picked from commit 1126d2f)
* Expose DelimitedTermFrequencyTokenFilter Relates: opensearch-project#9413 This commit exposes Lucene's delimited term frequency token filter to be able to provide term frequencies along with terms. Signed-off-by: Russ Cam <russcam@canva.com> * fix format violations Signed-off-by: Russ Cam <russcam@canva.com> * fix test and add to changelog Signed-off-by: Russ Cam <russcam@canva.com> * Address PR feedback - Add unit tests for DelimitedTermFrequencyTokenFilterFactory - Remove IllegalArgumentException as caught exception - Add skip to yaml rest tests to skip for version < 2.10 Signed-off-by: Russ Cam <russcam@canva.com> * formatting Signed-off-by: Russ Cam <russcam@canva.com> * Rename filter Signed-off-by: Russ Cam <russcam@canva.com> * update naming in REST tests Signed-off-by: Russ Cam <russcam@canva.com> --------- Signed-off-by: Russ Cam <russcam@canva.com> (cherry picked from commit 1126d2f) Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
* Expose DelimitedTermFrequencyTokenFilter Relates: opensearch-project#9413 This commit exposes Lucene's delimited term frequency token filter to be able to provide term frequencies along with terms. Signed-off-by: Russ Cam <russcam@canva.com> * fix format violations Signed-off-by: Russ Cam <russcam@canva.com> * fix test and add to changelog Signed-off-by: Russ Cam <russcam@canva.com> * Address PR feedback - Add unit tests for DelimitedTermFrequencyTokenFilterFactory - Remove IllegalArgumentException as caught exception - Add skip to yaml rest tests to skip for version < 2.10 Signed-off-by: Russ Cam <russcam@canva.com> * formatting Signed-off-by: Russ Cam <russcam@canva.com> * Rename filter Signed-off-by: Russ Cam <russcam@canva.com> * update naming in REST tests Signed-off-by: Russ Cam <russcam@canva.com> --------- Signed-off-by: Russ Cam <russcam@canva.com> (cherry picked from commit 1126d2f) Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
* Expose DelimitedTermFrequencyTokenFilter Relates: #9413 This commit exposes Lucene's delimited term frequency token filter to be able to provide term frequencies along with terms. * fix format violations * fix test and add to changelog * Address PR feedback - Add unit tests for DelimitedTermFrequencyTokenFilterFactory - Remove IllegalArgumentException as caught exception - Add skip to yaml rest tests to skip for version < 2.10 * formatting * Rename filter * update naming in REST tests --------- (cherry picked from commit 1126d2f) Signed-off-by: Russ Cam <russcam@canva.com> Signed-off-by: Andriy Redko <andriy.redko@aiven.io> Co-authored-by: Russ Cam <russ.cam@forloop.co.uk>
* Expose DelimitedTermFrequencyTokenFilter Relates: opensearch-project#9413 This commit exposes Lucene's delimited term frequency token filter to be able to provide term frequencies along with terms. Signed-off-by: Russ Cam <russcam@canva.com> * fix format violations Signed-off-by: Russ Cam <russcam@canva.com> * fix test and add to changelog Signed-off-by: Russ Cam <russcam@canva.com> * Address PR feedback - Add unit tests for DelimitedTermFrequencyTokenFilterFactory - Remove IllegalArgumentException as caught exception - Add skip to yaml rest tests to skip for version < 2.10 Signed-off-by: Russ Cam <russcam@canva.com> * formatting Signed-off-by: Russ Cam <russcam@canva.com> * Rename filter Signed-off-by: Russ Cam <russcam@canva.com> * update naming in REST tests Signed-off-by: Russ Cam <russcam@canva.com> --------- Signed-off-by: Russ Cam <russcam@canva.com> Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
* Expose DelimitedTermFrequencyTokenFilter Relates: opensearch-project#9413 This commit exposes Lucene's delimited term frequency token filter to be able to provide term frequencies along with terms. Signed-off-by: Russ Cam <russcam@canva.com> * fix format violations Signed-off-by: Russ Cam <russcam@canva.com> * fix test and add to changelog Signed-off-by: Russ Cam <russcam@canva.com> * Address PR feedback - Add unit tests for DelimitedTermFrequencyTokenFilterFactory - Remove IllegalArgumentException as caught exception - Add skip to yaml rest tests to skip for version < 2.10 Signed-off-by: Russ Cam <russcam@canva.com> * formatting Signed-off-by: Russ Cam <russcam@canva.com> * Rename filter Signed-off-by: Russ Cam <russcam@canva.com> * update naming in REST tests Signed-off-by: Russ Cam <russcam@canva.com> --------- Signed-off-by: Russ Cam <russcam@canva.com> Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
* Expose DelimitedTermFrequencyTokenFilter Relates: opensearch-project#9413 This commit exposes Lucene's delimited term frequency token filter to be able to provide term frequencies along with terms. Signed-off-by: Russ Cam <russcam@canva.com> * fix format violations Signed-off-by: Russ Cam <russcam@canva.com> * fix test and add to changelog Signed-off-by: Russ Cam <russcam@canva.com> * Address PR feedback - Add unit tests for DelimitedTermFrequencyTokenFilterFactory - Remove IllegalArgumentException as caught exception - Add skip to yaml rest tests to skip for version < 2.10 Signed-off-by: Russ Cam <russcam@canva.com> * formatting Signed-off-by: Russ Cam <russcam@canva.com> * Rename filter Signed-off-by: Russ Cam <russcam@canva.com> * update naming in REST tests Signed-off-by: Russ Cam <russcam@canva.com> --------- Signed-off-by: Russ Cam <russcam@canva.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Is your feature request related to a problem? Please describe.
Lucene provides DelimitedTermFrequencyTokenFilter to be able to provide tokens along with their term frequency. For example, a document with a text field with repeated terms like
can use DelimitedTermFrequencyTokenFilter as part of analysis, and provide term frequencies along with terms
DelimitedTermFrequencyTokenFilter is not exposed in OpenSearch, even though it can be useful. Care needs to be taken when using it, per the lucene docs:
Describe the solution you'd like
Expose DelimitedTermFrequencyTokenFilter as a token filter in OpenSearch.
Describe alternatives you've considered
The alternatives are
or
Additional context
Exposing DelimitedTermFrequencyTokenFilter would be complimentary to exposing termFreq in Painless scripting: #9081
The text was updated successfully, but these errors were encountered: