Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add search_after parameter to new terms_enum api #72910

Closed
markharwood opened this issue May 11, 2021 · 9 comments
Closed

Add search_after parameter to new terms_enum api #72910

markharwood opened this issue May 11, 2021 · 9 comments
Assignees
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@markharwood
Copy link
Contributor

To allow client applications to page through many autocomplete suggestions we should offer a search_after parameter. Calling applications would pass the last term value from the previous set of results and the returned results would be all those matching the provided prefix string param but after the supplied search_after parameter. An example might be:

Request 1

localhost:9200/myindex/_terms_enum
{
"field" : "url",
"string" : "www."
}}

Result 1

{
   "terms": [
    "www.aa.com",
    "www.abc.com",
    ...
    "www.bbc.com"
    ]

Request 2 (with new search_after param)

localhost:9200/myindex/_terms_enum
{
"field" : "url",
"string" : "www.",
"search_after" : "www.bbc.com"
}}

Result 2

{
   "terms": [
    "www.breitling.com",
    "www.bt.com",
    ...
    "www.cnn.com"
    ]

It would be a client error to pass a search_after string that does not start with the provided string prefix.

Example invalid request that would produce an error

   "search_after": "www.bbc.com",
   "string": "some_other_prefix"

Related to #71550

@markharwood markharwood added >enhancement :Search/Search Search-related issues that do not fall into other categories labels May 11, 2021
@markharwood markharwood self-assigned this May 11, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label May 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@markharwood
Copy link
Contributor Author

I'm not sure if this is viable - I coded up an implementation but was hitting UnsupportedOperationExceptions using search_after on a keyword field:

  1>               "reason" : "java.lang.UnsupportedOperationException
  1>    at org.apache.lucene.codecs.blocktree.IntersectTermsEnum.seekCeil(IntersectTermsEnum.java:566)
  1>    at org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:221)
  1>    at org.elasticsearch.xpack.core.termsenum.action.MultiShardTermsEnum.<init>(MultiShardTermsEnum.java:52)
  1>    at org.elasticsearch.xpack.core.termsenum.action.TransportTermsEnumAction.dataNodeOperation(TransportTermsEnumAction.java:319)
  1>    at org.elasticsearch.xpack.core.termsenum.action.TransportTermsEnumAction.lambda$asyncNodeOperation$1(TransportTermsEnumAction.java:632)
  1>    at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
  1>    at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
  1>    at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:728)
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
  1>    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  1>    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  1>    at java.base/java.lang.Thread.run(Thread.java:834)

The seekCeil operation is not always handled in TermsEnum implementations and the keyword field's use of automaton.getTermsEnum(terms) is an example where this doesn't work.

We only suggested offering this API because we thought it was cheap/simple to do but based on this experience I'm not sure we should offer it.

@jimczi
Copy link
Contributor

jimczi commented May 11, 2021

The IntersectTermsEnum takes the startTerm as an argument in the constructor. That's also expose in Fields:

public TermsEnum intersect(CompiledAutomaton compiled, final BytesRef startTerm) throws IOException {

@jtibshirani
Copy link
Contributor

We've had discussions on a general approach to support paging in our REST APIs: #64099. I wonder if this aligns with what we're been planning for other APIs, maybe we're standardizing on using search_after? Tagging @ywangd and @cjcenizal in case they're interested.

markharwood added a commit to markharwood/elasticsearch that referenced this issue May 14, 2021
…s returning a TermsEnum implementation that supports seekCeil() but we fell at the first hurdle with the keyword field not working - UnsupportedOperationException.

See elastic#72910 for discussion.
markharwood added a commit that referenced this issue May 14, 2021
Adds an optional parameter to the _terms_enum request designed to allow paging.
The last term from a previous result can be passed as the search_after parameter to a subsequent request, meaning only terms after the given term (but still matching the provided string prefix) are returned
Relates to #72910
markharwood added a commit that referenced this issue May 14, 2021
Backport of ebb113a
Adds an optional parameter to the _terms_enum request designed to allow paging.
The last term from a previous result can be passed as the search_after parameter to a subsequent request, meaning only terms after the given term (but still matching the provided string prefix) are returned
Relates to #72910
@ywangd
Copy link
Member

ywangd commented May 17, 2021

We've had discussions on a general approach to support paging in our REST APIs: #64099. I wonder if this aligns with what we're been planning for other APIs, maybe we're standardizing on using search_after? Tagging @ywangd and @cjcenizal in case they're interested.

Thanks @jtibshirani We haven't really settled on an approach for user interface on the other issue. I have a separate document about search API key metadata where I also talked about pagination. The basic idea is to have a more generic parameter, I call it resumption_token, instead of search_after. The main reason is to hide implementation details about how the data is searched and where they come from:

  1. For termsEnum, just a single search_after is sufficient because there is no variation on how search is done.
  2. But sometimes (often times?) we want to iterate the results with a Point-In-Time reader. But instead of having both search_after and pit parameters, we can have a single resumption_token to encode both information and let the server handle it transparently for users.
  3. There are use cases where search_after may not make most of sense or won't maintain its canonical meaning. For example, the list of snapshots does not come from an index at all (@henningandersen), so search_after is somehow a misuse. But resumption_token is generic and the handler can decide how it should be handed out and handled. (Does termsEnum fall into this category as well? It's lucene data but does not from the actual "search"?)

Personally I feel it's easier for users to consume our APIs if they provide consistent interfaces for simliar concept, e.g. pagination, instead of similar unerlying implementations.

@albertzaharovits also suggested we could have this resumption_token provided by the search infrastructure itself. But since pagination is needed for places where index search is not required, I think it is also reasonable to have it outside of the search itself.

@henningandersen
Copy link
Contributor

I too would like to not have our generic pagination named search_after. I prefer just after, I think that is short and concise. Other necessary names is the parameter for the limit of how many results to include in a page and the value in the response to use to get the next page. I think limit and next can work out here.

Another choice is whether limit and after should go in the body or as query-params. We could opt to support both.

@henningandersen
Copy link
Contributor

@jtibshirani provided input through another channel that she prefers size over limit, I tend to agree that is more in line with previous conventions. We also need a sort parameter though I suppose it is implicit in the terms_enum api, since there is only one (?).

@markharwood
Copy link
Contributor Author

markharwood commented May 19, 2021

We also need a sort parameter though I suppose it is implicit in the terms_enum api, since there is only one (?).

In the interests of maximum performance, the terms_enum API doesn't support anything other than the sort order used internally to arrange terms in the index

@markharwood
Copy link
Contributor Author

Closed in #72933

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

6 participants