Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose DelimitedTermFrequencyTokenFilter #9413

Closed
russcam opened this issue Aug 17, 2023 · 4 comments · Fixed by #9479
Closed

Expose DelimitedTermFrequencyTokenFilter #9413

russcam opened this issue Aug 17, 2023 · 4 comments · Fixed by #9479
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc v2.10.0

Comments

@russcam
Copy link
Contributor

russcam commented Aug 17, 2023

Is your feature request related to a problem? Please describe.
Lucene provides DelimitedTermFrequencyTokenFilter to be able to provide tokens along with their term frequency. For example, a document with a text field with repeated terms like

{
   "text": "dog dog dog cat cat"
}

can use DelimitedTermFrequencyTokenFilter as part of analysis, and provide term frequencies along with terms

{
   "text": "dog|3 cat|2"
}

DelimitedTermFrequencyTokenFilter is not exposed in OpenSearch, even though it can be useful. Care needs to be taken when using it, per the lucene docs:

  1. the field must be indexed with IndexOptions.DOCS_AND_FREQS but no positions or offsets.
  2. make sure your Tokenizer doesn't split on the delimiter, or this won't work

Describe the solution you'd like
Expose DelimitedTermFrequencyTokenFilter as a token filter in OpenSearch.

Describe alternatives you've considered
The alternatives are

  • to send documents with repeated terms. For many terms with large term frequencies, this can mean much larger payloads going over the wire.

or

  • include and maintain an analysis plugin to provide the functionality.

Additional context

Exposing DelimitedTermFrequencyTokenFilter would be complimentary to exposing termFreq in Painless scripting: #9081

@russcam russcam added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 17, 2023
@noCharger
Copy link
Contributor

While including DelimitedTermFrequencyTokenFilter into the OpenSearch codebase is great, I am wondering if using two separate fields in documents to store terms and their corresponding frequencies could be an alternative. This approach involves using a script query to calculate scores based on the term frequencies. I wanted to confirm if this is the actual use case you are aiming for.

PUT /myindex/_doc/1
{
  "text": "dog,cat",
  "freq": "3,2"
}

PUT /myindex/_doc/2
{
  "text": "dog,cat",
  "freq": "2,2"
}

GET /myindex/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "text": "dog"
          }
        }
      ],
      "should": [
        {
          "script_score": {
            "query": {
              "match_all": {}
            },
            "script": {
              "source": """
                String[] terms = doc['text'].value.split(',');
                String[] freqs = doc['freq'].value.split(',');
                double score = 0;
                for (int i = 0; i < terms.length; i++) {
                  if (terms[i].equals(params.value)) {
                    score += Double.parseDouble(freqs[i]);
                  }
                }
                return score;
              """,
              "params": {
                   "value": "dog"
                }
            }
          }
        }
      ]
    }
  }
}

@russcam
Copy link
Contributor Author

russcam commented Aug 21, 2023

@noCharger the alternative approach looks very suboptimal:

  1. Need two fields to carry information that can be readily provided in one
  2. Doesn't leverage the postings list, and instead requires use of doc values to make values available to scripting
  3. Every script execution needs to process (split) doc values to make use of them, as opposed to indexing in a form requiring little to no processing
  4. Can't make use of term statistics for multiple purposes i.e. scripting, boosting, etc. because individual terms and their freqs are not indexed in a form that persists this information

@macohen
Copy link
Contributor

macohen commented Aug 21, 2023

Russ, if you have an idea for implementation, feel free to submit as a PR.

cc: @msfroh, @rishabhmaurya, @jainankitk

@mingshl mingshl added Search Search query, autocomplete ...etc and removed untriaged labels Aug 21, 2023
russcam added a commit to russcam/OpenSearch that referenced this issue Aug 22, 2023
Relates: opensearch-project#9413

This commit exposes Lucene's delimited term frequency token filter to be
able to provide term frequencies along with terms.
russcam added a commit to russcam/OpenSearch that referenced this issue Aug 22, 2023
Relates: opensearch-project#9413

This commit exposes Lucene's delimited term frequency token filter to be
able to provide term frequencies along with terms.

Signed-off-by: Russ Cam <russcam@canva.com>
@russcam
Copy link
Contributor Author

russcam commented Aug 22, 2023

I've opened #9479

@reta reta added the v2.10.0 label Aug 22, 2023
msfroh pushed a commit that referenced this issue Aug 31, 2023
* Expose DelimitedTermFrequencyTokenFilter

Relates: #9413

This commit exposes Lucene's delimited term frequency token filter to be
able to provide term frequencies along with terms.

Signed-off-by: Russ Cam <russcam@canva.com>

* fix format violations

Signed-off-by: Russ Cam <russcam@canva.com>

* fix test and add to changelog

Signed-off-by: Russ Cam <russcam@canva.com>

* Address PR feedback

- Add unit tests for DelimitedTermFrequencyTokenFilterFactory
- Remove IllegalArgumentException as caught exception
- Add skip to yaml rest tests to skip for version < 2.10

Signed-off-by: Russ Cam <russcam@canva.com>

* formatting

Signed-off-by: Russ Cam <russcam@canva.com>

* Rename filter

Signed-off-by: Russ Cam <russcam@canva.com>

* update naming in REST tests

Signed-off-by: Russ Cam <russcam@canva.com>

---------

Signed-off-by: Russ Cam <russcam@canva.com>
reta pushed a commit to reta/OpenSearch that referenced this issue Aug 31, 2023
* Expose DelimitedTermFrequencyTokenFilter

Relates: opensearch-project#9413

This commit exposes Lucene's delimited term frequency token filter to be
able to provide term frequencies along with terms.

Signed-off-by: Russ Cam <russcam@canva.com>

* fix format violations

Signed-off-by: Russ Cam <russcam@canva.com>

* fix test and add to changelog

Signed-off-by: Russ Cam <russcam@canva.com>

* Address PR feedback

- Add unit tests for DelimitedTermFrequencyTokenFilterFactory
- Remove IllegalArgumentException as caught exception
- Add skip to yaml rest tests to skip for version < 2.10

Signed-off-by: Russ Cam <russcam@canva.com>

* formatting

Signed-off-by: Russ Cam <russcam@canva.com>

* Rename filter

Signed-off-by: Russ Cam <russcam@canva.com>

* update naming in REST tests

Signed-off-by: Russ Cam <russcam@canva.com>

---------

Signed-off-by: Russ Cam <russcam@canva.com>
(cherry picked from commit 1126d2f)
reta pushed a commit to reta/OpenSearch that referenced this issue Aug 31, 2023
* Expose DelimitedTermFrequencyTokenFilter

Relates: opensearch-project#9413

This commit exposes Lucene's delimited term frequency token filter to be
able to provide term frequencies along with terms.

Signed-off-by: Russ Cam <russcam@canva.com>

* fix format violations

Signed-off-by: Russ Cam <russcam@canva.com>

* fix test and add to changelog

Signed-off-by: Russ Cam <russcam@canva.com>

* Address PR feedback

- Add unit tests for DelimitedTermFrequencyTokenFilterFactory
- Remove IllegalArgumentException as caught exception
- Add skip to yaml rest tests to skip for version < 2.10

Signed-off-by: Russ Cam <russcam@canva.com>

* formatting

Signed-off-by: Russ Cam <russcam@canva.com>

* Rename filter

Signed-off-by: Russ Cam <russcam@canva.com>

* update naming in REST tests

Signed-off-by: Russ Cam <russcam@canva.com>

---------

Signed-off-by: Russ Cam <russcam@canva.com>
(cherry picked from commit 1126d2f)
Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
reta pushed a commit to reta/OpenSearch that referenced this issue Aug 31, 2023
* Expose DelimitedTermFrequencyTokenFilter

Relates: opensearch-project#9413

This commit exposes Lucene's delimited term frequency token filter to be
able to provide term frequencies along with terms.

Signed-off-by: Russ Cam <russcam@canva.com>

* fix format violations

Signed-off-by: Russ Cam <russcam@canva.com>

* fix test and add to changelog

Signed-off-by: Russ Cam <russcam@canva.com>

* Address PR feedback

- Add unit tests for DelimitedTermFrequencyTokenFilterFactory
- Remove IllegalArgumentException as caught exception
- Add skip to yaml rest tests to skip for version < 2.10

Signed-off-by: Russ Cam <russcam@canva.com>

* formatting

Signed-off-by: Russ Cam <russcam@canva.com>

* Rename filter

Signed-off-by: Russ Cam <russcam@canva.com>

* update naming in REST tests

Signed-off-by: Russ Cam <russcam@canva.com>

---------

Signed-off-by: Russ Cam <russcam@canva.com>
(cherry picked from commit 1126d2f)
Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
msfroh pushed a commit that referenced this issue Aug 31, 2023
* Expose DelimitedTermFrequencyTokenFilter

Relates: #9413

This commit exposes Lucene's delimited term frequency token filter to be
able to provide term frequencies along with terms.



* fix format violations



* fix test and add to changelog



* Address PR feedback

- Add unit tests for DelimitedTermFrequencyTokenFilterFactory
- Remove IllegalArgumentException as caught exception
- Add skip to yaml rest tests to skip for version < 2.10



* formatting



* Rename filter



* update naming in REST tests



---------


(cherry picked from commit 1126d2f)

Signed-off-by: Russ Cam <russcam@canva.com>
Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
Co-authored-by: Russ Cam <russ.cam@forloop.co.uk>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this issue Sep 12, 2023
* Expose DelimitedTermFrequencyTokenFilter

Relates: opensearch-project#9413

This commit exposes Lucene's delimited term frequency token filter to be
able to provide term frequencies along with terms.

Signed-off-by: Russ Cam <russcam@canva.com>

* fix format violations

Signed-off-by: Russ Cam <russcam@canva.com>

* fix test and add to changelog

Signed-off-by: Russ Cam <russcam@canva.com>

* Address PR feedback

- Add unit tests for DelimitedTermFrequencyTokenFilterFactory
- Remove IllegalArgumentException as caught exception
- Add skip to yaml rest tests to skip for version < 2.10

Signed-off-by: Russ Cam <russcam@canva.com>

* formatting

Signed-off-by: Russ Cam <russcam@canva.com>

* Rename filter

Signed-off-by: Russ Cam <russcam@canva.com>

* update naming in REST tests

Signed-off-by: Russ Cam <russcam@canva.com>

---------

Signed-off-by: Russ Cam <russcam@canva.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
brusic pushed a commit to brusic/OpenSearch that referenced this issue Sep 25, 2023
* Expose DelimitedTermFrequencyTokenFilter

Relates: opensearch-project#9413

This commit exposes Lucene's delimited term frequency token filter to be
able to provide term frequencies along with terms.

Signed-off-by: Russ Cam <russcam@canva.com>

* fix format violations

Signed-off-by: Russ Cam <russcam@canva.com>

* fix test and add to changelog

Signed-off-by: Russ Cam <russcam@canva.com>

* Address PR feedback

- Add unit tests for DelimitedTermFrequencyTokenFilterFactory
- Remove IllegalArgumentException as caught exception
- Add skip to yaml rest tests to skip for version < 2.10

Signed-off-by: Russ Cam <russcam@canva.com>

* formatting

Signed-off-by: Russ Cam <russcam@canva.com>

* Rename filter

Signed-off-by: Russ Cam <russcam@canva.com>

* update naming in REST tests

Signed-off-by: Russ Cam <russcam@canva.com>

---------

Signed-off-by: Russ Cam <russcam@canva.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this issue Apr 25, 2024
* Expose DelimitedTermFrequencyTokenFilter

Relates: opensearch-project#9413

This commit exposes Lucene's delimited term frequency token filter to be
able to provide term frequencies along with terms.

Signed-off-by: Russ Cam <russcam@canva.com>

* fix format violations

Signed-off-by: Russ Cam <russcam@canva.com>

* fix test and add to changelog

Signed-off-by: Russ Cam <russcam@canva.com>

* Address PR feedback

- Add unit tests for DelimitedTermFrequencyTokenFilterFactory
- Remove IllegalArgumentException as caught exception
- Add skip to yaml rest tests to skip for version < 2.10

Signed-off-by: Russ Cam <russcam@canva.com>

* formatting

Signed-off-by: Russ Cam <russcam@canva.com>

* Rename filter

Signed-off-by: Russ Cam <russcam@canva.com>

* update naming in REST tests

Signed-off-by: Russ Cam <russcam@canva.com>

---------

Signed-off-by: Russ Cam <russcam@canva.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search Search query, autocomplete ...etc v2.10.0
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants