Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Opensearch need constant_keyword field type? #9981

Closed
jainankitk opened this issue Sep 11, 2023 · 10 comments · Fixed by #12285
Closed

Does Opensearch need constant_keyword field type? #9981

jainankitk opened this issue Sep 11, 2023 · 10 comments · Fixed by #12285
Assignees
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request feature New feature or request good first issue Good for newcomers Search Search query, autocomplete ...etc

Comments

@jainankitk
Copy link
Collaborator

Came across constant_keyword field type added by Elasticsearch here. The idea is pretty simple where 2 indices can be maintained partitioning the documents based on specific value for field. Essentially, all the documents with value X for field F go to index I1, everything with non X value go to index I2. While searching both the indices can be evaluated, for filter on field F, it will MatchAll or MatchNone on index I1 depending on filter value. This is much more efficient in practice compared to the default single index approach that will match lot of documents in the index.

That being said, I have not come across many customers in managed service looking for something like this. Want to get community feedback if they think it will be useful?

@jainankitk jainankitk added enhancement Enhancement or improvement to existing feature or request untriaged discuss Issues intended to help drive brainstorming and decision making Search Search query, autocomplete ...etc feature New feature or request and removed discuss Issues intended to help drive brainstorming and decision making labels Sep 11, 2023
@msfroh
Copy link
Collaborator

msfroh commented Sep 11, 2023

I had been thinking about doing something with ingest pipelines to route incoming documents to different indices based on some predicate(s) on field values, and then route queries to the right indices using search pipelines. (This assumes that both index pipelines and search pipelines can be evaluated before finalizing on a target index.) I worked on something like that to add capacity for "hot documents" a few years back.

With these constant_keyword fields, I guess you could do a similar thing at ingest time, but the search request processor would just need to assign add a filter on the constant_keyword field (hot_docs:true versus hot_docs:false?).

On the other hand, if we can just select target indices based on a search-time predicate, that feels easier to me, I think.

@jainankitk
Copy link
Collaborator Author

With these constant_keyword fields, I guess you could do a similar thing at ingest time, but the search request processor would just need to assign add a filter on the constant_keyword field (hot_docs:true versus hot_docs:false?).

The experience for constant_keyword is not as seamless as the user needs to take care of ingesting documents to the right index.

PUT bicycles
{
  "mappings": {
    "properties": {
      "cycle_type": {
        "type": "constant_keyword",
        "value": "bicycle"
      },
      "name": {
        "type": "text"
      }
    }
  }
}

PUT other_cycles
{
  "mappings": {
    "properties": {
      "cycle_type": {
        "type": "keyword"
      },
      "name": {
        "type": "text"
      }
    }
  }
}

@jainankitk
Copy link
Collaborator Author

Also, during search it is not zero cost on other_cycles index (cannot be rewritten as MatchNone) as no correlation between the 2 indices:

GET bicycles,other_cycles/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "description": "dutch"
        }
      },
      "filter": {
        "term": {
          "cycle_type": "bicycle"
        }
      }
    }
  }
}

@jainankitk
Copy link
Collaborator Author

On the other hand, if we can just select target indices based on a search-time predicate, that feels easier to me, I think.

Can you elaborate more on this approach? The target indices selection should be done during search and ingestion!? It will be ideal experience for customer to not deal with multiple indices

@msfroh
Copy link
Collaborator

msfroh commented Sep 11, 2023

Can you elaborate more on this approach? The target indices selection should be done during search and ingestion!? It will be ideal experience for customer to not deal with multiple indices

I was imagining something where you could do e.g.

// The pipeline always overwrites the `_index` metadata field.
PUT /_ingest/pipeline/index_routing
{
  "processors": [
    {
      "set" : {
        "if" : "ctx?.cycle_type == 'bicycle'",
        "field" : "_index",
        "value": "bicycles"
      }
    },
    {
      "set" : {
        "if" : "ctx?.cycle_type != 'bicycle'",
        "field" : "_index",
        "value": "other_cycles"
      }
    }
  ]
}

PUT /<any_index>/doc/1?pipeline=index_routing
{
  "cycle_type" : "bicycle",
  "description" : "Dutch step-through urban cruiser bike"
}

PUT /_search/pipeline/index_routing
{
  "request_processors" : [
    {
      "conditional_routing": {
        "required_clauses" : [{
          "term" : {
            "cycle_type" : "bicycle"
          }
        }],
        "target_index" : "bicycles",
        "else_index": "other_bicycles"
      }
    }
  ]
}

POST /_search?search_pipeline=index_routing
{
"query": {
    "bool": {
      "must": {
        "match": {
          "description": "dutch"
        }
      },
      "filter": {
        "term": {
          "cycle_type": "bicycle"
        }
      }
    }
  }
}

Essentially, the defined pipelines would route the index and search requests to the right indices. The user would need to define the pipelines appropriately, but wouldn't need to worry about routing after that.

@msfroh msfroh added discuss Issues intended to help drive brainstorming and decision making and removed untriaged labels Sep 20, 2023
@jainankitk
Copy link
Collaborator Author

@msfroh - The above defined ingestion experience using pipelines is much better and seamless to the users. It takes care of efficiency concern as well, by only querying the requisite index instead of all possible ones.

@jainankitk
Copy link
Collaborator Author

To not even have the user specify search/index pipeline in the request, I am wondering if we can create alias on top of bicycles/other_bicycles and the pipelines are specified for any indexing or search request to that alias!?

@msfroh
Copy link
Collaborator

msfroh commented Oct 27, 2023

Oh -- incidentally, it turns out that we already almost have the constant_keyword field type.

We already have ConstantFieldType:

public abstract class ConstantFieldType extends MappedFieldType {

Right now, the only implementation is IndexFieldType:

static final class IndexFieldType extends ConstantFieldType {

Essentially, that's how the _index field is implemented. Most of the query cleverness is already defined in ConstantFieldType. We'd just need to subclass it.

@msfroh msfroh added the good first issue Good for newcomers label Nov 20, 2023
@msfroh
Copy link
Collaborator

msfroh commented Nov 27, 2023

@hasnain2808 -- you expressed some interest in working on this one in our OpenSearch Lucene Study Group meeting (https://forum.opensearch.org/t/opensearch-lucene-study-group-meeting-monday-november-20th/16729/9).

Can you please respond to this issue so we can assign it to you? It can only be assigned to a maintainer or someone who participates in the issue.

@hasnain2808
Copy link
Contributor

Sure @msfroh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request feature New feature or request good first issue Good for newcomers Search Search query, autocomplete ...etc
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants