From 849b87e299f6f03cd2266e0af1a8e83480bd5c06 Mon Sep 17 00:00:00 2001 From: Miguel Grinberg Date: Thu, 23 Jan 2025 11:07:38 +0000 Subject: [PATCH 1/2] Added DSL documentation to Guide --- docs/guide/dsl/asyncio.asciidoc | 103 ++++ docs/guide/dsl/configuration.asciidoc | 125 ++++ docs/guide/dsl/examples.asciidoc | 5 + docs/guide/dsl/faceted_search.asciidoc | 145 +++++ docs/guide/dsl/howto.asciidoc | 7 + docs/guide/dsl/persistence.asciidoc | 761 ++++++++++++++++++++++++ docs/guide/dsl/search_dsl.asciidoc | 704 ++++++++++++++++++++++ docs/guide/dsl/tutorials.asciidoc | 251 ++++++++ docs/guide/dsl/update_by_query.asciidoc | 168 ++++++ docs/guide/elasticsearch-dsl.asciidoc | 48 ++ docs/guide/index-custom-title-page.html | 5 +- docs/guide/index.asciidoc | 4 +- docs/guide/overview.asciidoc | 18 +- 13 files changed, 2327 insertions(+), 17 deletions(-) create mode 100644 docs/guide/dsl/asyncio.asciidoc create mode 100644 docs/guide/dsl/configuration.asciidoc create mode 100644 docs/guide/dsl/examples.asciidoc create mode 100644 docs/guide/dsl/faceted_search.asciidoc create mode 100644 docs/guide/dsl/howto.asciidoc create mode 100644 docs/guide/dsl/persistence.asciidoc create mode 100644 docs/guide/dsl/search_dsl.asciidoc create mode 100644 docs/guide/dsl/tutorials.asciidoc create mode 100644 docs/guide/dsl/update_by_query.asciidoc create mode 100644 docs/guide/elasticsearch-dsl.asciidoc diff --git a/docs/guide/dsl/asyncio.asciidoc b/docs/guide/dsl/asyncio.asciidoc new file mode 100644 index 000000000..6e92e84ab --- /dev/null +++ b/docs/guide/dsl/asyncio.asciidoc @@ -0,0 +1,103 @@ +[[asyncio]] +==== Using asyncio with Elasticsearch DSL + +The DSL module supports async/await with +https://docs.python.org/3/library/asyncio.html[asyncio]. To ensure that +you have all the required dependencies, install the `[async]` +extra: + +[source,bash] +---- +$ python -m pip install "elasticsearch[async]" +---- + +===== Connections + +Use the `async_connections` module to manage your asynchronous +connections. + +[source,python] +---- +from elasticsearch.dsl import async_connections + +async_connections.create_connection(hosts=['localhost'], timeout=20) +---- + +All the options available in the `connections` module can be used with +`async_connections`. + +====== How to avoid 'Unclosed client session / connector' warnings on exit + +These warnings come from the `aiohttp` package, which is used internally +by the `AsyncElasticsearch` client. They appear often when the +application exits and are caused by HTTP connections that are open when +they are garbage collected. To avoid these warnings, make sure that you +close your connections. + +[source,python] +---- +es = async_connections.get_connection() +await es.close() +---- + +===== Search DSL + +Use the `AsyncSearch` class to perform asynchronous searches. + +[source,python] +---- +from elasticsearch.dsl import AsyncSearch + +s = AsyncSearch().query("match", title="python") +async for hit in s: + print(hit.title) +---- + +Instead of using the `AsyncSearch` object as an asynchronous iterator, +you can explicitly call the `execute()` method to get a `Response` +object. + +[source,python] +---- +s = AsyncSearch().query("match", title="python") +response = await s.execute() +for hit in response: + print(hit.title) +---- + +An `AsyncMultiSearch` is available as well. + +[source,python] +---- +from elasticsearch.dsl import AsyncMultiSearch + +ms = AsyncMultiSearch(index='blogs') + +ms = ms.add(AsyncSearch().filter('term', tags='python')) +ms = ms.add(AsyncSearch().filter('term', tags='elasticsearch')) + +responses = await ms.execute() + +for response in responses: + print("Results for query %r." % response.search.query) + for hit in response: + print(hit.title) +---- + +===== Asynchronous Documents, Indexes, and more + +The `Document`, `Index`, `IndexTemplate`, `Mapping`, `UpdateByQuery` and +`FacetedSearch` classes all have asynchronous versions that use the same +name with an `Async` prefix. These classes expose the same interfaces as +the synchronous versions, but any methods that perform I/O are defined +as coroutines. + +Auxiliary classes that do not perform I/O do not have asynchronous +versions. The same classes can be used in synchronous and asynchronous +applications. + +When using a custom analyzer in an asynchronous +application, use the `async_simulate()` method to invoke the Analyze +API on it. + +Consult the `api` section for details about each specific method. diff --git a/docs/guide/dsl/configuration.asciidoc b/docs/guide/dsl/configuration.asciidoc new file mode 100644 index 000000000..c8cb2b4f0 --- /dev/null +++ b/docs/guide/dsl/configuration.asciidoc @@ -0,0 +1,125 @@ +=== Configuration + +There are several ways to configure connections for the library. The +easiest and most useful approach is to define one default connection +that can be used every time an API call is made without explicitly +passing in other connections. + +[NOTE] +==== +Unless you want to access multiple clusters from your application, it is +highly recommended that you use the `create_connection` method and +all operations will use that connection automatically. +==== + +==== Default connection + +To define a default connection that can be used globally, use the +`connections` module and the `create_connection` method like this: + +[source,python] +---- +from elasticsearch.dsl import connections + +connections.create_connection(hosts=['localhost'], timeout=20) +---- + +===== Single connection with an alias + +You can define the `alias` or name of a connection so you can easily +refer to it later. The default value for `alias` is `default`. + +[source,python] +---- +from elasticsearch.dsl import connections + +connections.create_connection(alias='my_new_connection', hosts=['localhost'], timeout=60) +---- + +Additional keyword arguments (`hosts` and `timeout` in our example) will +be passed to the `Elasticsearch` class from `elasticsearch-py`. + +To see all possible configuration options refer to the +https://elasticsearch-py.readthedocs.io/en/latest/api/elasticsearch.html[documentation]. + +==== Multiple clusters + +You can define multiple connections to multiple clusters at the same +time using the `configure` method: + +[source,python] +---- +from elasticsearch.dsl import connections + +connections.configure( + default={'hosts': 'localhost'}, + dev={ + 'hosts': ['esdev1.example.com:9200'], + 'sniff_on_start': True + } +) +---- + +Such connections will be constructed lazily when requested for the first +time. + +You can alternatively define multiple connections by adding them one by +one as shown in the following example: + +[source,python] +---- +# if you have configuration options to be passed to Elasticsearch.__init__ +# this also shows creating a connection with the alias 'qa' +connections.create_connection('qa', hosts=['esqa1.example.com'], sniff_on_start=True) + +# if you already have an Elasticsearch instance ready +connections.add_connection('another_qa', my_client) +---- + +===== Using aliases + +When using multiple connections, you can refer to them using the string +alias specified when you created the connection. + +This example shows how to use an alias to a connection: + +[source,python] +---- +s = Search(using='qa') +---- + +A `KeyError` will be raised if there is no connection registered with +that alias. + +==== Manual + +If you don't want to supply a global configuration, you can always pass +in your own connection as an instance of `elasticsearch.Elasticsearch` +with the parameter `using` wherever it is accepted like this: + +[source,python] +---- +s = Search(using=Elasticsearch('localhost')) +---- + +You can even use this approach to override any connection the object +might be already associated with: + +[source,python] +---- +s = s.using(Elasticsearch('otherhost:9200')) +---- + +[NOTE] +==== +When using the `dsl` module, it is highly recommended that you +use the built-in serializer +(`elasticsearch.dsl.serializer.serializer`) to ensure your objects +are correctly serialized into `JSON` every time. The +`create_connection` method that is described here (and that the +`configure` method uses under the hood) will do that automatically for +you, unless you explicitly specify your own serializer. The built-in +serializer also allows you to serialize your own objects - just define a +`to_dict()` method on your objects and that method will be +automatically called when serializing your custom objects to `JSON`. +==== diff --git a/docs/guide/dsl/examples.asciidoc b/docs/guide/dsl/examples.asciidoc new file mode 100644 index 000000000..5d22f84e5 --- /dev/null +++ b/docs/guide/dsl/examples.asciidoc @@ -0,0 +1,5 @@ +=== Examples + +Please see the +https://github.com/elastic/elasticsearch-py/tree/master/examples/dsl[DSL examples] +directory to see some complex examples using the DSL module. diff --git a/docs/guide/dsl/faceted_search.asciidoc b/docs/guide/dsl/faceted_search.asciidoc new file mode 100644 index 000000000..6d05cae2d --- /dev/null +++ b/docs/guide/dsl/faceted_search.asciidoc @@ -0,0 +1,145 @@ +[[faceted_search]] +==== Faceted Search + +The library comes with a simple abstraction aimed at helping you develop +faceted navigation for your data. + +[NOTE] +==== +This API is experimental and will be subject to change. Any feedback is +welcome. +==== + +===== Configuration + +You can provide several configuration options (as class attributes) when +declaring a `FacetedSearch` subclass: + +- `index`: + the name of the index (as string) to search through, defaults to + `'_all'`. +- `doc_types`: + list of `Document` subclasses or strings to be used, defaults to + `['_all']`. +- `fields`: + list of fields on the document type to search through. The list will + be passes to `MultiMatch` query so can contain boost values + (`'title^5'`), defaults to `['*']`. +- `facets`: + dictionary of facets to display/filter on. The key is the name + displayed and values should be instances of any `Facet` subclass, for + example: `{'tags': TermsFacet(field='tags')}` + +====== Facets + +There are several different facets available: + +- `TermsFacet`: + provides an option to split documents into groups based on a value of + a field, for example `TermsFacet(field='category')` +- `DateHistogramFacet`: + split documents into time intervals, example: + `DateHistogramFacet(field="published_date", calendar_interval="day")` +- `HistogramFacet`: + similar to `DateHistogramFacet` but for numerical values: + `HistogramFacet(field="rating", interval=2)` +- `RangeFacet`: + allows you to define your own ranges for a numerical fields: + `RangeFacet(field="comment_count", ranges=[("few", (None, 2)), ("lots", (2, None))])` +- `NestedFacet`: + is just a simple facet that wraps another to provide access to nested + documents: + `NestedFacet('variants', TermsFacet(field='variants.color'))` + +By default facet results will only calculate document count, if you wish +for a different metric you can pass in any single value metric +aggregation as the `metric` kwarg +(`TermsFacet(field='tags', metric=A('max', field=timestamp))`). When +specifying `metric` the results will be, by default, sorted in +descending order by that metric. To change it to ascending specify +`metric_sort="asc"` and to just sort by document count use +`metric_sort=False`. + +====== Advanced + +If you require any custom behavior or modifications simply override one +or more of the methods responsible for the class' functions: + +- `search(self)`: + is responsible for constructing the `Search` object used. Override + this if you want to customize the search object (for example by adding + a global filter for published articles only). +- `query(self, search)`: + adds the query position of the search (if search input specified), by + default using `MultiField` query. Override this if you want to modify + the query type used. +- `highlight(self, search)`: + defines the highlighting on the `Search` object and returns a new one. + Default behavior is to highlight on all fields specified for search. + +===== Usage + +The custom subclass can be instantiated empty to provide an empty search +(matching everything) or with `query`, `filters` and `sort`. + +- `query`: + is used to pass in the text of the query to be performed. If `None` is + passed in (default) a `MatchAll` query will be used. For example + `'python web'` +- `filters`: + is a dictionary containing all the facet filters that you wish to + apply. Use the name of the facet (from `.facets` attribute) as the key + and one of the possible values as value. For example + `{'tags': 'python'}`. +- `sort`: + is a tuple or list of fields on which the results should be sorted. + The format of the individual fields are to be the same as those passed + to `~elasticsearch.dsl.Search.sort`. + +====== Response + +the response returned from the `FacetedSearch` object (by calling +`.execute()`) is a subclass of the standard `Response` class that adds a +property called `facets` which contains a dictionary with lists of +buckets -each represented by a tuple of key, document count and a flag +indicating whether this value has been filtered on. + +===== Example + +[source,python] +---- +from datetime import date + +from elasticsearch.dsl import FacetedSearch, TermsFacet, DateHistogramFacet + +class BlogSearch(FacetedSearch): + doc_types = [Article, ] + # fields that should be searched + fields = ['tags', 'title', 'body'] + + facets = { + # use bucket aggregations to define facets + 'tags': TermsFacet(field='tags'), + 'publishing_frequency': DateHistogramFacet(field='published_from', interval='month') + } + + def search(self): + # override methods to add custom pieces + s = super().search() + return s.filter('range', publish_from={'lte': 'now/h'}) + +bs = BlogSearch('python web', {'publishing_frequency': date(2015, 6)}) +response = bs.execute() + +# access hits and other attributes as usual +total = response.hits.total +print('total hits', total.relation, total.value) +for hit in response: + print(hit.meta.score, hit.title) + +for (tag, count, selected) in response.facets.tags: + print(tag, ' (SELECTED):' if selected else ':', count) + +for (month, count, selected) in response.facets.publishing_frequency: + print(month.strftime('%B %Y'), ' (SELECTED):' if selected else ':', count) +---- diff --git a/docs/guide/dsl/howto.asciidoc b/docs/guide/dsl/howto.asciidoc new file mode 100644 index 000000000..3328234bb --- /dev/null +++ b/docs/guide/dsl/howto.asciidoc @@ -0,0 +1,7 @@ +=== How-To Guides + +include::search_dsl.asciidoc[] +include::persistence.asciidoc[] +include::faceted_search.asciidoc[] +include::update_by_query.asciidoc[] +include::asyncio.asciidoc[] diff --git a/docs/guide/dsl/persistence.asciidoc b/docs/guide/dsl/persistence.asciidoc new file mode 100644 index 000000000..ff478dadf --- /dev/null +++ b/docs/guide/dsl/persistence.asciidoc @@ -0,0 +1,761 @@ +==== Persistence + +You can use the DSL module to define your mappings and a basic +persistent layer for your application. + +For more comprehensive examples have a look at the +https://github.com/elastic/elasticsearch-py/tree/main/examples/dsl[DSL examples] +directory in the repository. + +[[doc_type]] +===== Document + +If you want to create a model-like wrapper around your documents, use +the `Document` class. It can also be used to create all the necessary +mappings and settings in elasticsearch (see `life-cycle` for details). + +[source,python] +---- +from datetime import datetime +from elasticsearch.dsl import Document, Date, Nested, Boolean, \ + analyzer, InnerDoc, Completion, Keyword, Text + +html_strip = analyzer('html_strip', + tokenizer="standard", + filter=["standard", "lowercase", "stop", "snowball"], + char_filter=["html_strip"] +) + +class Comment(InnerDoc): + author = Text(fields={'raw': Keyword()}) + content = Text(analyzer='snowball') + created_at = Date() + + def age(self): + return datetime.now() - self.created_at + +class Post(Document): + title = Text() + title_suggest = Completion() + created_at = Date() + published = Boolean() + category = Text( + analyzer=html_strip, + fields={'raw': Keyword()} + ) + + comments = Nested(Comment) + + class Index: + name = 'blog' + + def add_comment(self, author, content): + self.comments.append( + Comment(author=author, content=content, created_at=datetime.now())) + + def save(self, ** kwargs): + self.created_at = datetime.now() + return super().save(** kwargs) +---- + +====== Data types + +The `Document` instances use native python types like `str` and +`datetime`. In case of `Object` or `Nested` fields an instance of the +`InnerDoc` subclass is used, as in the `add_comment` method in the +above example where we are creating an instance of the `Comment` class. + +There are some specific types that were created as part of this library +to make working with some field types easier, for example the `Range` +object used in any of the +https://www.elastic.co/guide/en/elasticsearch/reference/current/range.html[range +fields]: + +[source,python] +---- +from elasticsearch.dsl import Document, DateRange, Keyword, Range + +class RoomBooking(Document): + room = Keyword() + dates = DateRange() + + +rb = RoomBooking( + room='Conference Room II', + dates=Range( + gte=datetime(2018, 11, 17, 9, 0, 0), + lt=datetime(2018, 11, 17, 10, 0, 0) + ) +) + +# Range supports the in operator correctly: +datetime(2018, 11, 17, 9, 30, 0) in rb.dates # True + +# you can also get the limits and whether they are inclusive or exclusive: +rb.dates.lower # datetime(2018, 11, 17, 9, 0, 0), True +rb.dates.upper # datetime(2018, 11, 17, 10, 0, 0), False + +# empty range is unbounded +Range().lower # None, False +---- + +====== Python Type Hints + +Document fields can be defined using standard Python type hints if +desired. Here are some simple examples: + +[source,python] +---- +from typing import Optional + +class Post(Document): + title: str # same as title = Text(required=True) + created_at: Optional[datetime] # same as created_at = Date(required=False) + published: bool # same as published = Boolean(required=True) +---- + +It is important to note that when using `Field` subclasses such as +`Text`, `Date` and `Boolean`, they must be given in the right-side of an +assignment, as shown in examples above. Using these classes as type +hints will result in errors. + +Python types are mapped to their corresponding field type according to +the following table: + +.Python type to DSL field mappings +[cols=",",options="header",] +|=== +|Python type |DSL field +|`str` |`Text(required=True)` +|`bool` |`Boolean(required=True)` +|`int` |`Integer(required=True)` +|`float` |`Float(required=True)` +|`bytes` |`Binary(required=True)` +|`datetime` |`Date(required=True)` +|`date` |`Date(format="yyyy-MM-dd", required=True)` +|=== + +To type a field as optional, the standard `Optional` modifier from the +Python `typing` package can be used. When using Python 3.10 or newer, +"pipe" syntax can also be used, by adding `| None` to a type. The +`List` modifier can be added to a field to convert it to an array, +similar to using the `multi=True` argument on the field object. + +[source,python] +---- +from typing import Optional, List + +class MyDoc(Document): + pub_date: Optional[datetime] # same as pub_date = Date() + middle_name: str | None # same as middle_name = Text() + authors: List[str] # same as authors = Text(multi=True, required=True) + comments: Optional[List[str]] # same as comments = Text(multi=True) +---- + +A field can also be given a type hint of an `InnerDoc` subclass, in +which case it becomes an `Object` field of that class. When the +`InnerDoc` subclass is wrapped with `List`, a `Nested` field is created +instead. + +[source,python] +---- +from typing import List + +class Address(InnerDoc): + ... + +class Comment(InnerDoc): + ... + +class Post(Document): + address: Address # same as address = Object(Address, required=True) + comments: List[Comment] # same as comments = Nested(Comment, required=True) +---- + +Unfortunately it is impossible to have Python type hints that uniquely +identify every possible Elasticsearch field type. To choose a field type +that is different than the ones in the table above, the field instance +can be added explicitly as a right-side assignment in the field +declaration. The next example creates a field that is typed as +`Optional[str]`, but is mapped to `Keyword` instead of `Text`: + +[source,python] +---- +class MyDocument(Document): + category: Optional[str] = Keyword() +---- + +This form can also be used when additional options need to be given to +initialize the field, such as when using custom analyzer settings or +changing the `required` default: + +[source,python] +---- +class Comment(InnerDoc): + content: str = Text(analyzer='snowball', required=True) +---- + +When using type hints as above, subclasses of `Document` and `InnerDoc` +inherit some of the behaviors associated with Python dataclasses, as +defined by https://peps.python.org/pep-0681/[PEP 681] and the +https://typing.readthedocs.io/en/latest/spec/dataclasses.html#dataclass-transform[dataclass_transform +decorator]. To add per-field dataclass options such as `default` or +`default_factory`, the `mapped_field()` wrapper can be used on +the right side of a typed field declaration: + +[source,python] +---- +class MyDocument(Document): + title: str = mapped_field(default="no title") + created_at: datetime = mapped_field(default_factory=datetime.now) + published: bool = mapped_field(default=False) + category: str = mapped_field(Keyword(required=True), default="general") +---- + +When using the `mapped_field()` wrapper function, an explicit field +type instance can be passed as a first positional argument, as the +`category` field does in the example above. + +Static type checkers such as https://mypy-lang.org/[mypy] and +https://github.com/microsoft/pyright[pyright] can use the type hints and +the dataclass-specific options added to the `mapped_field()` +function to improve type inference and provide better real-time +suggestions in IDEs. + +One situation in which type checkers can't infer the correct type is +when using fields as class attributes. Consider the following example: + +[source,python] +---- +class MyDocument(Document): + title: str + +doc = MyDocument() +# doc.title is typed as "str" (correct) +# MyDocument.title is also typed as "str" (incorrect) +---- + +To help type checkers correctly identify class attributes as such, the +`M` generic must be used as a wrapper to the type hint, as shown in the +next examples: + +[source,python] +---- +from elasticsearch.dsl import M + +class MyDocument(Document): + title: M[str] + created_at: M[datetime] = mapped_field(default_factory=datetime.now) + +doc = MyDocument() +# doc.title is typed as "str" +# doc.created_at is typed as "datetime" +# MyDocument.title is typed as "InstrumentedField" +# MyDocument.created_at is typed as "InstrumentedField" +---- + +Note that the `M` type hint does not provide any runtime behavior and +its use is not required, but it can be useful to eliminate spurious type +errors in IDEs or type checking builds. + +The `InstrumentedField` objects returned when fields are accessed as +class attributes are proxies for the field instances that can be used +anywhere a field needs to be referenced, such as when specifying sort +options in a `Search` object: + +[source,python] +---- +# sort by creation date descending, and title ascending +s = MyDocument.search().sort(-MyDocument.created_at, MyDocument.title) +---- + +When specifying sorting order, the `{plus}` and `-` unary operators can +be used on the class field attributes to indicate ascending and +descending order. + +Finally, the `ClassVar` annotation can be used to define a regular class +attribute that should not be mapped to the Elasticsearch index: + +[source,python] +---- +from typing import ClassVar + +class MyDoc(Document): + title: M[str] created_at: M[datetime] = + mapped_field(default_factory=datetime.now) my_var: + ClassVar[str] # regular class variable, ignored by Elasticsearch +---- + +====== Note on dates + +The DSL module will always respect the timezone information (or +lack thereof) on the `datetime` objects passed in or stored in +Elasticsearch. Elasticsearch itself interprets all datetimes with no +timezone information as `UTC`. If you wish to reflect this in your +python code, you can specify `default_timezone` when instantiating a +`Date` field: + +[source,python] +---- +class Post(Document): + created_at = Date(default_timezone='UTC') +---- + +In that case any `datetime` object passed in (or parsed from +elasticsearch) will be treated as if it were in `UTC` timezone. + +[[life-cycle]] +====== Document life cycle + +Before you first use the `Post` document type, you need to create the +mappings in Elasticsearch. For that you can either use the `index` +object or create the mappings directly by calling the `init` class +method: + +[source,python] +---- +# create the mappings in Elasticsearch +Post.init() +---- + +This code will typically be run in the setup for your application during +a code deploy, similar to running database migrations. + +To create a new `Post` document just instantiate the class and pass in +any fields you wish to set, you can then use standard attribute setting +to change/add more fields. Note that you are not limited to the fields +defined explicitly: + +[source,python] +---- +# instantiate the document +first = Post(title='My First Blog Post, yay!', published=True) +# assign some field values, can be values or lists of values +first.category = ['everything', 'nothing'] +# every document has an id in meta +first.meta.id = 47 + + +# save the document into the cluster +first.save() +---- + +All the metadata fields (`id`, `routing`, `index` etc) can be accessed +(and set) via a `meta` attribute or directly using the underscored +variant: + +[source,python] +---- +post = Post(meta={'id': 42}) + +# prints 42 +print(post.meta.id) + +# override default index +post.meta.index = 'my-blog' +---- + +[NOTE] +==== +Having all metadata accessible through `meta` means that this name is +reserved and you shouldn't have a field called `meta` on your document. +If you, however, need it you can still access the data using the get +item (as opposed to attribute) syntax: `post['meta']`. +==== + +To retrieve an existing document use the `get` class method: + +[source,python] +---- +# retrieve the document +first = Post.get(id=42) +# now we can call methods, change fields, ... +first.add_comment('me', 'This is nice!') +# and save the changes into the cluster again +first.save() +---- + +The +https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html[Update +API] can also be used via the `update` method. By default any keyword +arguments, beyond the parameters of the API, will be considered fields +with new values. Those fields will be updated on the local copy of the +document and then sent over as partial document to be updated: + +[source,python] +---- +# retrieve the document +first = Post.get(id=42) +# you can update just individual fields which will call the update API +# and also update the document in place +first.update(published=True, published_by='me') +---- + +In case you wish to use a `painless` script to perform the update you +can pass in the script string as `script` or the `id` of a +https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-using.html#script-stored-scripts[stored script] +via `script_id`. All additional keyword arguments to the +`update` method will then be passed in as parameters of the script. The +document will not be updated in place. + +[source,python] +---- +# retrieve the document +first = Post.get(id=42) +# we execute a script in elasticsearch with additional kwargs being passed +# as params into the script +first.update(script='ctx._source.category.add(params.new_category)', + new_category='testing') +---- + +If the document is not found in elasticsearch an exception +(`elasticsearch.NotFoundError`) will be raised. If you wish to return +`None` instead just pass in `ignore=404` to suppress the exception: + +[source,python] +---- +p = Post.get(id='not-in-es', ignore=404) +p is None +---- + +When you wish to retrieve multiple documents at the same time by their +`id` you can use the `mget` method: + +[source,python] +---- +posts = Post.mget([42, 47, 256]) +---- + +`mget` will, by default, raise a `NotFoundError` if any of the documents +wasn't found and `RequestError` if any of the document had resulted in +error. You can control this behavior by setting parameters: + +- `raise_on_error`: + If `True` (default) then any error will cause an exception to be + raised. Otherwise all documents containing errors will be treated as + missing. +- `missing`: + Can have three possible values: `'none'` (default), `'raise'` and + `'skip'`. If a document is missing or errored it will either be + replaced with `None`, an exception will be raised or the document will + be skipped in the output list entirely. + +The index associated with the `Document` is accessible via the +`_index` class property which gives you access to the `index` class. + +The `_index` attribute is also home to the `load_mappings` +method which will update the mapping on the `Index` from elasticsearch. +This is very useful if you use dynamic mappings and want the class to be +aware of those fields (for example if you wish the `Date` fields to be +properly (de)serialized): + +[source,python] +---- +Post._index.load_mappings() +---- + +To delete a document just call its `delete` method: + +[source,python] +---- +first = Post.get(id=42) +first.delete() +---- + +====== Analysis + +To specify `analyzer` values for `Text` fields you can just use the name +of the analyzer (as a string) and either rely on the analyzer being +defined (like built-in analyzers) or define the analyzer yourself +manually. + +Alternatively you can create your own analyzer and have the persistence +layer handle its creation, from our example earlier: + +[source,python] +---- +from elasticsearch.dsl import analyzer, tokenizer + +my_analyzer = analyzer('my_analyzer', + tokenizer=tokenizer('trigram', 'nGram', min_gram=3, max_gram=3), + filter=['lowercase'] +) +---- + +Each analysis object needs to have a name (`my_analyzer` and +`trigram` in our example) and tokenizers, token filters and char filters +also need to specify type (`nGram` in our example). + +Once you have an instance of a custom `analyzer` you can also call the +https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html[analyze +API] on it by using the `simulate` method: + +[source,python] +---- +response = my_analyzer.simulate('Hello World!') + +# ['hel', 'ell', 'llo', 'lo ', 'o w', ' wo', 'wor', 'orl', 'rld', 'ld!'] +tokens = [t.token for t in response.tokens] +---- + +[NOTE] +==== +When creating a mapping which relies on a custom analyzer the index must +either not exist or be closed. To create multiple `Document`-defined +mappings you can use the `index` object. +==== + +====== Search + +To search for this document type, use the `search` class method: + +[source,python] +---- +# by calling .search we get back a standard Search object +s = Post.search() +# the search is already limited to the index and doc_type of our document +s = s.filter('term', published=True).query('match', title='first') + + +results = s.execute() + +# when you execute the search the results are wrapped in your document class (Post) +for post in results: + print(post.meta.score, post.title) +---- + +Alternatively you can just take a `Search` object and restrict it to +return our document type, wrapped in correct class: + +[source,python] +---- +s = Search() +s = s.doc_type(Post) +---- + +You can also combine document classes with standard doc types (just +strings), which will be treated as before. You can also pass in multiple +`Document` subclasses and each document in the response will be wrapped +in it's class. + +If you want to run suggestions, just use the `suggest` method on the +`Search` object: + +[source,python] +---- +s = Post.search() +s = s.suggest('title_suggestions', 'pyth', completion={'field': 'title_suggest'}) + +response = s.execute() + +for result in response.suggest.title_suggestions: + print('Suggestions for %s:' % result.text) + for option in result.options: + print(' %s (%r)' % (option.text, option.payload)) +---- + +====== `class Meta` options + +In the `Meta` class inside your document definition you can define +various metadata for your document: + +- `mapping`: + optional instance of `Mapping` class to use as base for the mappings + created from the fields on the document class itself. + +Any attributes on the `Meta` class that are instance of `MetaField` will +be used to control the mapping of the meta fields (`_all`, `dynamic` +etc). Just name the parameter (without the leading underscore) as the +field you wish to map and pass any parameters to the `MetaField` class: + +[source,python] +---- +class Post(Document): + title = Text() + + class Meta: + all = MetaField(enabled=False) + dynamic = MetaField('strict') +---- + +====== `class Index` options + +This section of the `Document` definition can contain any information +about the index, its name, settings and other attributes: + +- `name`: + name of the index to use, if it contains a wildcard (`*`) then it + cannot be used for any write operations and an `index` kwarg will have + to be passed explicitly when calling methods like `.save()`. +- `using`: + default connection alias to use, defaults to `'default'` +- `settings`: + dictionary containing any settings for the `Index` object like + `number_of_shards`. +- `analyzers`: + additional list of analyzers that should be defined on an index (see + `analysis` for details). +- `aliases`: + dictionary with any aliases definitions + +====== Document Inheritance + +You can use standard Python inheritance to extend models, this can be +useful in a few scenarios. For example if you want to have a +`BaseDocument` defining some common fields that several different +`Document` classes should share: + +[source,python] +---- +class User(InnerDoc): + username = Text(fields={'keyword': Keyword()}) + email = Text() + +class BaseDocument(Document): + created_by = Object(User) + created_date = Date() + last_updated = Date() + + def save(**kwargs): + if not self.created_date: + self.created_date = datetime.now() + self.last_updated = datetime.now() + return super(BaseDocument, self).save(**kwargs) + +class BlogPost(BaseDocument): + class Index: + name = 'blog' +---- + +Another use case would be using the +https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html[join +type] to have multiple different entities in a single index. You can see +an +https://github.com/elastic/elasticsearch-py/blob/master/examples/dsl/parent_child.py[example] +of this approach. Note that in this case, if the subclasses don't define +their own [.title-ref]#Index# classes, the mappings are merged and +shared between all the subclasses. + +===== Index + +In typical scenario using `class Index` on a `Document` class is +sufficient to perform any action. In a few cases though it can be useful +to manipulate an `Index` object directly. + +`Index` is a class responsible for holding all the metadata related to +an index in elasticsearch - mappings and settings. It is most useful +when defining your mappings since it allows for easy creation of +multiple mappings at the same time. This is especially useful when +setting up your elasticsearch objects in a migration: + +[source,python] +---- +from elasticsearch.dsl import Index, Document, Text, analyzer + +blogs = Index('blogs') + +# define custom settings +blogs.settings( + number_of_shards=1, + number_of_replicas=0 +) + +# define aliases +blogs.aliases( + old_blogs={} +) + +# register a document with the index +blogs.document(Post) + +# can also be used as class decorator when defining the Document +@blogs.document +class Post(Document): + title = Text() + +# You can attach custom analyzers to the index + +html_strip = analyzer('html_strip', + tokenizer="standard", + filter=["standard", "lowercase", "stop", "snowball"], + char_filter=["html_strip"] +) + +blogs.analyzer(html_strip) + +# delete the index, ignore if it doesn't exist +blogs.delete(ignore=404) + +# create the index in elasticsearch +blogs.create() +---- + +You can also set up a template for your indices and use the `clone` +method to create specific copies: + +[source,python] +---- +blogs = Index('blogs', using='production') +blogs.settings(number_of_shards=2) +blogs.document(Post) + +# create a copy of the index with different name +company_blogs = blogs.clone('company-blogs') + +# create a different copy on different cluster +dev_blogs = blogs.clone('blogs', using='dev') +# and change its settings +dev_blogs.setting(number_of_shards=1) +---- + +[[index-template]] +====== IndexTemplate + +The DSL module also exposes an option to manage +https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html[index +templates] in elasticsearch using the `IndexTemplate` class which has +very similar API to `Index`. + +Once an index template is saved in elasticsearch it's contents will be +automatically applied to new indices (existing indices are completely +unaffected by templates) that match the template pattern (any index +starting with `blogs-` in our example), even if the index is created +automatically upon indexing a document into that index. + +Potential workflow for a set of time based indices governed by a single +template: + +[source,python] +---- +from datetime import datetime + +from elasticsearch.dsl import Document, Date, Text + + +class Log(Document): + content = Text() + timestamp = Date() + + class Index: + name = "logs-*" + settings = { + "number_of_shards": 2 + } + + def save(self, **kwargs): + # assign now if no timestamp given + if not self.timestamp: + self.timestamp = datetime.now() + + # override the index to go to the proper timeslot + kwargs['index'] = self.timestamp.strftime('logs-%Y%m%d') + return super().save(**kwargs) + +# once, as part of application setup, during deploy/migrations: +logs = Log._index.as_template('logs', order=0) +logs.save() + +# to perform search across all logs: +search = Log.search() +---- + diff --git a/docs/guide/dsl/search_dsl.asciidoc b/docs/guide/dsl/search_dsl.asciidoc new file mode 100644 index 000000000..dfc3b256a --- /dev/null +++ b/docs/guide/dsl/search_dsl.asciidoc @@ -0,0 +1,704 @@ +[[search_dsl]] +==== Search DSL + +===== The `Search` object + +The `Search` object represents the entire search request: + +* queries +* filters +* aggregations +* k-nearest neighbor searches +* sort +* pagination +* highlighting +* suggestions +* collapsing +* additional parameters +* associated client + +The API is designed to be chainable. With the exception of the +aggregations functionality this means that the `Search` object is +immutable -all changes to the object will result in a shallow copy being +created which contains the changes. This means you can safely pass the +`Search` object to foreign code without fear of it modifying your +objects as long as it sticks to the `Search` object APIs. + +You can pass an instance of the +https://elasticsearch-py.readthedocs.io/[elasticsearch client] when +instantiating the `Search` object: + +[source,python] +---- +from elasticsearch import Elasticsearch +from elasticsearch.dsl import Search + +client = Elasticsearch() + +s = Search(using=client) +---- + +You can also define the client at a later time (for more options see the +`configuration` chapter): + +[source,python] +---- +s = s.using(client) +---- + +[NOTE] +==== +All methods return a _copy_ of the object, making it safe to pass to +outside code. +==== + +The API is chainable, allowing you to combine multiple method calls in +one statement: + +[source,python] +---- +s = Search().using(client).query("match", title="python") +---- + +To send the request to Elasticsearch: + +[source,python] +---- +response = s.execute() +---- + +If you just want to iterate over the hits returned by your search you +can iterate over the `Search` object: + +[source,python] +---- +for hit in s: + print(hit.title) +---- + +Search results will be cached. Subsequent calls to `execute` or trying +to iterate over an already executed `Search` object will not trigger +additional requests being sent to Elasticsearch. To force a request +specify `ignore_cache=True` when calling `execute`. + +For debugging purposes you can serialize the `Search` object to a `dict` +explicitly: + +[source,python] +---- +print(s.to_dict()) +---- + +====== Delete By Query + +You can delete the documents matching a search by calling `delete` on +the `Search` object instead of `execute` like this: + +[source,python] +---- +s = Search(index='i').query("match", title="python") +response = s.delete() +---- + +====== Queries + +The library provides classes for all Elasticsearch query types. Pass all +the parameters as keyword arguments. The classes accept any keyword +arguments, the dsl then takes all arguments passed to the constructor +and serializes them as top-level keys in the resulting dictionary (and +thus the resulting json being sent to elasticsearch). This means that +there is a clear one-to-one mapping between the raw query and its +equivalent in the DSL: + +[source,python] +---- +from elasticsearch.dsl.query import MultiMatch, Match + +# {"multi_match": {"query": "python django", "fields": ["title", "body"]}} +MultiMatch(query='python django', fields=['title', 'body']) + +# {"match": {"title": {"query": "web framework", "type": "phrase"}}} +Match(title={"query": "web framework", "type": "phrase"}) +---- + +[NOTE] +==== +In some cases this approach is not possible due to python's restriction +on identifiers - for example if your field is called `@timestamp`. In +that case you have to fall back to unpacking a dictionary: +`Range(*+ {'@timestamp': {'lt': 'now'}})` +==== + +You can use the `Q` shortcut to construct the instance using a name with +parameters or the raw `dict`: + +[source,python] +---- +from elasticsearch.dsl import Q + +Q("multi_match", query='python django', fields=['title', 'body']) +Q({"multi_match": {"query": "python django", "fields": ["title", "body"]}}) +---- + +To add the query to the `Search` object, use the `.query()` method: + +[source,python] +---- +q = Q("multi_match", query='python django', fields=['title', 'body']) +s = s.query(q) +---- + +The method also accepts all the parameters as the `Q` shortcut: + +[source,python] +---- +s = s.query("multi_match", query='python django', fields=['title', 'body']) +---- + +If you already have a query object, or a `dict` representing one, you +can just override the query used in the `Search` object: + +[source,python] +---- +s.query = Q('bool', must=[Q('match', title='python'), Q('match', body='best')]) +---- + +====== Dotted fields + +Sometimes you want to refer to a field within another field, either as a +multi-field (`title.keyword`) or in a structured `json` document like +`address.city`. To make it easier, the `Q` shortcut (as well as the +`query`, `filter`, and `exclude` methods on `Search` class) allows you +to use `_+` (double underscore) in place of a dot in a keyword +argument: + +[source,python] +---- +s = Search() +s = s.filter('term', category__keyword='Python') +s = s.query('match', address__city='prague') +---- + +Alternatively you can always fall back to python's kwarg unpacking if +you prefer: + +[source,python] +---- +s = Search() +s = s.filter('term', **{'category.keyword': 'Python'}) +s = s.query('match', **{'address.city': 'prague'}) +---- + +====== Query combination + +Query objects can be combined using logical operators: + +[source,python] +---- +Q("match", title='python') | Q("match", title='django') +# {"bool": {"should": [...]}} + +Q("match", title='python') & Q("match", title='django') +# {"bool": {"must": [...]}} + +~Q("match", title="python") +# {"bool": {"must_not": [...]}} +---- + +When you call the `.query()` method multiple times, the `&` operator +will be used internally: + +[source,python] +---- +s = s.query().query() +print(s.to_dict()) +# {"query": {"bool": {...}}} +---- + +If you want to have precise control over the query form, use the `Q` +shortcut to directly construct the combined query: + +[source,python] +---- +q = Q('bool', + must=[Q('match', title='python')], + should=[Q(...), Q(...)], + minimum_should_match=1 +) +s = Search().query(q) +---- + +====== Filters + +If you want to add a query in a +https://www.elastic.co/guide/en/elasticsearch/reference/2.0/query-filter-context.html[filter +context] you can use the `filter()` method to make things easier: + +[source,python] +---- +s = Search() +s = s.filter('terms', tags=['search', 'python']) +---- + +Behind the scenes this will produce a `Bool` query and place the +specified `terms` query into its `filter` branch, making it equivalent +to: + +[source,python] +---- +s = Search() +s = s.query('bool', filter=[Q('terms', tags=['search', 'python'])]) +---- + +If you want to use the post_filter element for faceted navigation, +use the `.post_filter()` method. + +You can also `exclude()` items from your query like this: + +[source,python] +---- +s = Search() +s = s.exclude('terms', tags=['search', 'python']) +---- + +which is shorthand for: +`s = s.query('bool', filter=[~Q('terms', tags=['search', 'python'])])` + +====== Aggregations + +To define an aggregation, you can use the `A` shortcut: + +[source,python] +---- +from elasticsearch.dsl import A + +A('terms', field='tags') +# {"terms": {"field": "tags"}} +---- + +To nest aggregations, you can use the `.bucket()`, `.metric()` and +`.pipeline()` methods: + +[source,python] +---- +a = A('terms', field='category') +# {'terms': {'field': 'category'}} + +a.metric('clicks_per_category', 'sum', field='clicks')\ + .bucket('tags_per_category', 'terms', field='tags') +# { +# 'terms': {'field': 'category'}, +# 'aggs': { +# 'clicks_per_category': {'sum': {'field': 'clicks'}}, +# 'tags_per_category': {'terms': {'field': 'tags'}} +# } +# } +---- + +To add aggregations to the `Search` object, use the `.aggs` property, +which acts as a top-level aggregation: + +[source,python] +---- +s = Search() +a = A('terms', field='category') +s.aggs.bucket('category_terms', a) +# { +# 'aggs': { +# 'category_terms': { +# 'terms': { +# 'field': 'category' +# } +# } +# } +# } +---- + +or + +[source,python] +---- +s = Search() +s.aggs.bucket('articles_per_day', 'date_histogram', field='publish_date', interval='day')\ + .metric('clicks_per_day', 'sum', field='clicks')\ + .pipeline('moving_click_average', 'moving_avg', buckets_path='clicks_per_day')\ + .bucket('tags_per_day', 'terms', field='tags') + +s.to_dict() +# { +# "aggs": { +# "articles_per_day": { +# "date_histogram": { "interval": "day", "field": "publish_date" }, +# "aggs": { +# "clicks_per_day": { "sum": { "field": "clicks" } }, +# "moving_click_average": { "moving_avg": { "buckets_path": "clicks_per_day" } }, +# "tags_per_day": { "terms": { "field": "tags" } } +# } +# } +# } +# } +---- + +You can access an existing bucket by its name: + +[source,python] +---- +s = Search() + +s.aggs.bucket('per_category', 'terms', field='category') +s.aggs['per_category'].metric('clicks_per_category', 'sum', field='clicks') +s.aggs['per_category'].bucket('tags_per_category', 'terms', field='tags') +---- + +[NOTE] +==== +When chaining multiple aggregations, there is a difference between what +`.bucket()` and `.metric()` methods return - `.bucket()` returns the +newly defined bucket while `.metric()` returns its parent bucket to +allow further chaining. +==== + +As opposed to other methods on the `Search` objects, defining +aggregations is done in-place (does not return a copy). + +====== K-Nearest Neighbor Searches + +To issue a kNN search, use the `.knn()` method: + +[source,python] +---- +s = Search() +vector = get_embedding("search text") + +s = s.knn( + field="embedding", + k=5, + num_candidates=10, + query_vector=vector +) +---- + +The `field`, `k` and `num_candidates` arguments can be given as +positional or keyword arguments and are required. In addition to these, +`query_vector` or `query_vector_builder` must be given as +well. + +The `.knn()` method can be invoked multiple times to include multiple +kNN searches in the request. + +====== Sorting + +To specify sorting order, use the `.sort()` method: + +[source,python] +---- +s = Search().sort( + 'category', + '-title', + {"lines" : {"order" : "asc", "mode" : "avg"}} +) +---- + +It accepts positional arguments which can be either strings or +dictionaries. String value is a field name, optionally prefixed by the +`-` sign to specify a descending order. + +To reset the sorting, just call the method with no arguments: + +[source,python] +---- +s = s.sort() +---- + +====== Pagination + +To specify the from/size parameters, use the Python slicing API: + +[source,python] +---- +s = s[10:20] +# {"from": 10, "size": 10} + +s = s[:20] +# {"size": 20} + +s = s[10:] +# {"from": 10} + +s = s[10:20][2:] +# {"from": 12, "size": 8} +---- + +If you want to access all the documents matched by your query you can +use the `scan` method which uses the scan/scroll elasticsearch API: + +[source,python] +---- +for hit in s.scan(): + print(hit.title) +---- + +Note that in this case the results won't be sorted. + +====== Highlighting + +To set common attributes for highlighting use the +`highlight_options` method: + +[source,python] +---- +s = s.highlight_options(order='score') +---- + +Enabling highlighting for individual fields is done using the +`highlight` method: + +[source,python] +---- +s = s.highlight('title') +# or, including parameters: +s = s.highlight('title', fragment_size=50) +---- + +The fragments in the response will then be available on each `Result` +object as `.meta.highlight.FIELD` which will contain the list of +fragments: + +[source,python] +---- +response = s.execute() +for hit in response: + for fragment in hit.meta.highlight.title: + print(fragment) +---- + +====== Suggestions + +To specify a suggest request on your `Search` object use the `suggest` +method: + +[source,python] +---- +# check for correct spelling +s = s.suggest('my_suggestion', 'pyhton', term={'field': 'title'}) +---- + +The first argument is the name of the suggestions (name under which it +will be returned), second is the actual text you wish the suggester to +work on and the keyword arguments will be added to the suggest's json +as-is which means that it should be one of `term`, `phrase` or +`completion` to indicate which type of suggester should be used. + +====== Collapsing + +To collapse search results use the `collapse` method on your `Search` +object: + +[source,python] +---- +s = Search().query("match", message="GET /search") +# collapse results by user_id +s = s.collapse("user_id") +---- + +The top hits will only include one result per `user_id`. You can +also expand each collapsed top hit with the `inner_hits` parameter, +`max_concurrent_group_searches` being the number of +concurrent requests allowed to retrieve the inner hits per group: + +[source,python] +---- +inner_hits = {"name": "recent_search", "size": 5, "sort": [{"@timestamp": "desc"}]} +s = s.collapse("user_id", inner_hits=inner_hits, max_concurrent_group_searches=4) +---- + +====== More Like This Query + +To use Elasticsearch's `more_like_this` functionality, you can use +the MoreLikeThis query type. + +A simple example is below + +[source,python] +---- +from elasticsearch.dsl.query import MoreLikeThis +from elasticsearch.dsl import Search + +my_text = 'I want to find something similar' + +s = Search() +# We're going to match based only on two fields, in this case text and title +s = s.query(MoreLikeThis(like=my_text, fields=['text', 'title'])) +# You can also exclude fields from the result to make the response quicker in the normal way +s = s.source(exclude=["text"]) +response = s.execute() + +for hit in response: + print(hit.title) +---- + +====== Extra properties and parameters + +To set extra properties of the search request, use the `.extra()` +method. This can be used to define keys in the body that cannot be +defined via a specific API method like `explain` or `search_after`: + +[source,python] +---- +s = s.extra(explain=True) +---- + +To set query parameters, use the `.params()` method: + +[source,python] +---- +s = s.params(routing="42") +---- + +If you need to limit the fields being returned by elasticsearch, use the +`source()` method: + +[source,python] +---- +# only return the selected fields +s = s.source(['title', 'body']) +# don't return any fields, just the metadata +s = s.source(False) +# explicitly include/exclude fields +s = s.source(includes=["title"], excludes=["user.*"]) +# reset the field selection +s = s.source(None) +---- + +====== Serialization and Deserialization + +The search object can be serialized into a dictionary by using the +`.to_dict()` method. + +You can also create a `Search` object from a `dict` using the +`from_dict` class method. This will create a new `Search` object and +populate it using the data from the dict: + +[source,python] +---- +s = Search.from_dict({"query": {"match": {"title": "python"}}}) +---- + +If you wish to modify an existing `Search` object, overriding it's +properties, instead use the `update_from_dict` method that +alters an instance *in-place*: + +[source,python] +---- +s = Search(index='i') +s.update_from_dict({"query": {"match": {"title": "python"}}, "size": 42}) +---- + +===== Response + +You can execute your search by calling the `.execute()` method that will +return a `Response` object. The `Response` object allows you access to +any key from the response dictionary via attribute access. It also +provides some convenient helpers: + +[source,python] +---- +response = s.execute() + +print(response.success()) +# True + +print(response.took) +# 12 + +print(response.hits.total.relation) +# eq +print(response.hits.total.value) +# 142 + +print(response.suggest.my_suggestions) +---- + +If you want to inspect the contents of the `response` objects, just use +its `to_dict` method to get access to the raw data for pretty +printing. + +====== Hits + +To access to the hits returned by the search, access the `hits` property +or just iterate over the `Response` object: + +[source,python] +---- +response = s.execute() +print('Total %d hits found.' % response.hits.total) +for h in response: + print(h.title, h.body) +---- + +[NOTE] +==== +If you are only seeing partial results (e.g. 10000 or even 10 results), +consider using the option `s.extra(track_total_hits=True)` to +get a full hit count. +==== + +====== Result + +The individual hits is wrapped in a convenience class that allows +attribute access to the keys in the returned dictionary. All the +metadata for the results are accessible via `meta` (without the leading +`_`): + +[source,python] +---- +response = s.execute() +h = response.hits[0] +print('/%s/%s/%s returned with score %f' % ( + h.meta.index, h.meta.doc_type, h.meta.id, h.meta.score)) +---- + +[NOTE] +==== +If your document has a field called `meta` you have to access it using +the get item syntax: `hit['meta']`. +==== + +====== Aggregations + +Aggregations are available through the `aggregations` property: + +[source,python] +---- +for tag in response.aggregations.per_tag.buckets: + print(tag.key, tag.max_lines.value) +---- + +===== `MultiSearch` + +If you need to execute multiple searches at the same time you can use +the `MultiSearch` class which will use the `_msearch` API: + +[source,python] +---- +from elasticsearch.dsl import MultiSearch, Search + +ms = MultiSearch(index='blogs') + +ms = ms.add(Search().filter('term', tags='python')) +ms = ms.add(Search().filter('term', tags='elasticsearch')) + +responses = ms.execute() + +for response in responses: + print("Results for query %r." % response._search.query) + for hit in response: + print(hit.title) +---- + +===== `EmptySearch` + +The `EmptySearch` class can be used as a fully compatible version of +`Search` that will return no results, regardless of any queries +configured. + diff --git a/docs/guide/dsl/tutorials.asciidoc b/docs/guide/dsl/tutorials.asciidoc new file mode 100644 index 000000000..1b5ff0e2c --- /dev/null +++ b/docs/guide/dsl/tutorials.asciidoc @@ -0,0 +1,251 @@ +=== Tutorials + +==== Search + +Let's have a typical search request written directly as a `dict`: + +[source,python] +---- +from elasticsearch import Elasticsearch +client = Elasticsearch("https://localhost:9200") + +response = client.search( + index="my-index", + body={ + "query": { + "bool": { + "must": [{"match": {"title": "python"}}], + "must_not": [{"match": {"description": "beta"}}], + "filter": [{"term": {"category": "search"}}] + } + }, + "aggs" : { + "per_tag": { + "terms": {"field": "tags"}, + "aggs": { + "max_lines": {"max": {"field": "lines"}} + } + } + } + } +) + +for hit in response['hits']['hits']: + print(hit['_score'], hit['_source']['title']) + +for tag in response['aggregations']['per_tag']['buckets']: + print(tag['key'], tag['max_lines']['value']) +---- + +The problem with this approach is that it is very verbose, prone to +syntax mistakes like incorrect nesting, hard to modify (eg. adding +another filter) and definitely not fun to write. + +Let's rewrite the example using the DSL module: + +[source,python] +---- +from elasticsearch import Elasticsearch +from elasticsearch.dsl import Search + +client = Elasticsearch("https://localhost:9200") + +s = Search(using=client, index="my-index") \ + .filter("term", category="search") \ + .query("match", title="python") \ + .exclude("match", description="beta") + +s.aggs.bucket('per_tag', 'terms', field='tags') \ + .metric('max_lines', 'max', field='lines') + +response = s.execute() + +for hit in response: + print(hit.meta.score, hit.title) + +for tag in response.aggregations.per_tag.buckets: + print(tag.key, tag.max_lines.value) +---- + +As you see, the library took care of: + +* creating appropriate `Query` objects by name (eq. "match") +* composing queries into a compound `bool` query +* putting the `term` query in a filter context of the `bool` query +* providing a convenient access to response data +* no curly or square brackets everywhere + +==== Persistence + +Let's have a simple Python class representing an article in a blogging +system: + +[source,python] +---- +from datetime import datetime +from elasticsearch.dsl import Document, Date, Integer, Keyword, Text, connections + +# Define a default Elasticsearch client +connections.create_connection(hosts="https://localhost:9200") + +class Article(Document): + title = Text(analyzer='snowball', fields={'raw': Keyword()}) + body = Text(analyzer='snowball') + tags = Keyword() + published_from = Date() + lines = Integer() + + class Index: + name = 'blog' + settings = { + "number_of_shards": 2, + } + + def save(self, ** kwargs): + self.lines = len(self.body.split()) + return super(Article, self).save(** kwargs) + + def is_published(self): + return datetime.now() > self.published_from + +# create the mappings in elasticsearch +Article.init() + +# create and save and article +article = Article(meta={'id': 42}, title='Hello world!', tags=['test']) +article.body = ''' looong text ''' +article.published_from = datetime.now() +article.save() + +article = Article.get(id=42) +print(article.is_published()) + +# Display cluster health +print(connections.get_connection().cluster.health()) +---- + +In this example you can see: + +* providing a default connection +* defining fields with mapping configuration +* setting index name +* defining custom methods +* overriding the built-in `.save()` method to hook into the persistence +life cycle +* retrieving and saving the object into Elasticsearch +* accessing the underlying client for other APIs + +You can see more in the `persistence` chapter. + +==== Pre-built Faceted Search + +If you have your `Document`s defined you can very easily create a +faceted search class to simplify searching and filtering. + +[NOTE] +==== +This feature is experimental and may be subject to change. +==== + +[source,python] +---- +from elasticsearch.dsl import FacetedSearch, TermsFacet, DateHistogramFacet + +class BlogSearch(FacetedSearch): + doc_types = [Article, ] + # fields that should be searched + fields = ['tags', 'title', 'body'] + + facets = { + # use bucket aggregations to define facets + 'tags': TermsFacet(field='tags'), + 'publishing_frequency': DateHistogramFacet(field='published_from', interval='month') + } + +# empty search +bs = BlogSearch() +response = bs.execute() + +for hit in response: + print(hit.meta.score, hit.title) + +for (tag, count, selected) in response.facets.tags: + print(tag, ' (SELECTED):' if selected else ':', count) + +for (month, count, selected) in response.facets.publishing_frequency: + print(month.strftime('%B %Y'), ' (SELECTED):' if selected else ':', count) +---- + +You can find more details in the `faceted_search` chapter. + +==== Update By Query + +Let's resume the simple example of articles on a blog, and let's assume +that each article has a number of likes. For this example, imagine we +want to increment the number of likes by 1 for all articles that match a +certain tag and do not match a certain description. Writing this as a +`dict`, we would have the following code: + +[source,python] +---- +from elasticsearch import Elasticsearch +client = Elasticsearch() + +response = client.update_by_query( + index="my-index", + body={ + "query": { + "bool": { + "must": [{"match": {"tag": "python"}}], + "must_not": [{"match": {"description": "beta"}}] + } + }, + "script"={ + "source": "ctx._source.likes++", + "lang": "painless" + } + }, + ) +---- + +Using the DSL, we can now express this query as such: + +[source,python] +---- +from elasticsearch import Elasticsearch +from elasticsearch.dsl import Search, UpdateByQuery + +client = Elasticsearch() +ubq = UpdateByQuery(using=client, index="my-index") \ + .query("match", title="python") \ + .exclude("match", description="beta") \ + .script(source="ctx._source.likes++", lang="painless") + +response = ubq.execute() +---- + +As you can see, the `Update By Query` object provides many of the +savings offered by the `Search` object, and additionally allows one to +update the results of the search based on a script assigned in the same +manner. + +==== Migration from the standard client + +You don't have to port your entire application to get the benefits of +the DSL module, you can start gradually by creating a `Search` object +from your existing `dict`, modifying it using the API and serializing it +back to a `dict`: + +[source,python] +---- +body = {...} # insert complicated query here + +# Convert to Search object +s = Search.from_dict(body) + +# Add some filters, aggregations, queries, ... +s.filter("term", tags="python") + +# Convert back to dict to plug back into existing code +body = s.to_dict() +---- diff --git a/docs/guide/dsl/update_by_query.asciidoc b/docs/guide/dsl/update_by_query.asciidoc new file mode 100644 index 000000000..b4a550a77 --- /dev/null +++ b/docs/guide/dsl/update_by_query.asciidoc @@ -0,0 +1,168 @@ +[[update_by_query]] +==== Update By Query + +===== The `Update By Query` object + +The `Update By Query` object enables the use of the +https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html[_update_by_query] +endpoint to perform an update on documents that match a search query. + +The object is implemented as a modification of the `Search` object, +containing a subset of its query methods, as well as a script method, +which is used to make updates. + +The `Update By Query` object implements the following `Search` query +types: + +* queries +* filters +* excludes + +For more information on queries, see the `search_dsl` chapter. + +Like the `Search` object, the API is designed to be chainable. This +means that the `Update By Query` object is immutable: all changes to the +object will result in a shallow copy being created which contains the +changes. This means you can safely pass the `Update By Query` object to +foreign code without fear of it modifying your objects as long as it +sticks to the `Update By Query` object APIs. + +You can define your client in a number of ways, but the preferred method +is to use a global configuration. For more information on defining a +client, see the `configuration` chapter. + +Once your client is defined, you can instantiate a copy of the +`Update By Query` object as seen below: + +[source,python] +---- +from elasticsearch.dsl import UpdateByQuery + +ubq = UpdateByQuery().using(client) +# or +ubq = UpdateByQuery(using=client) +---- + +[NOTE] +==== +All methods return a _copy_ of the object, making it safe to pass to +outside code. +==== + +The API is chainable, allowing you to combine multiple method calls in +one statement: + +[source,python] +---- +ubq = UpdateByQuery().using(client).query("match", title="python") +---- + +To send the request to Elasticsearch: + +[source,python] +---- +response = ubq.execute() +---- + +It should be noted, that there are limits to the chaining using the +script method: calling script multiple times will overwrite the previous +value. That is, only a single script can be sent with a call. An attempt +to use two scripts will result in only the second script being stored. + +Given the below example: + +[source,python] +---- +ubq = UpdateByQuery() \ + .using(client) \ + .script(source="ctx._source.likes++") \ + .script(source="ctx._source.likes+=2") +---- + +This means that the stored script by this client will be +`'source': 'ctx._source.likes{plus}=2'` and the previous call will +not be stored. + +For debugging purposes you can serialize the `Update By Query` object to +a `dict` explicitly: + +[source,python] +---- +print(ubq.to_dict()) +---- + +Also, to use variables in script see below example: + +[source,python] +---- +ubq.script( + source="ctx._source.messages.removeIf(x -> x.somefield == params.some_var)", + params={ + 'some_var': 'some_string_val' + } +) +---- + +====== Serialization and Deserialization + +The search object can be serialized into a dictionary by using the +`.to_dict()` method. + +You can also create a `Update By Query` object from a `dict` using the +`from_dict` class method. This will create a new `Update By Query` +object and populate it using the data from the dict: + +[source,python] +---- +ubq = UpdateByQuery.from_dict({"query": {"match": {"title": "python"}}}) +---- + +If you wish to modify an existing `Update By Query` object, overriding +it's properties, instead use the `update_from_dict` method that +alters an instance *in-place*: + +[source,python] +---- +ubq = UpdateByQuery(index='i') +ubq.update_from_dict({"query": {"match": {"title": "python"}}, "size": 42}) +---- + +====== Extra properties and parameters + +To set extra properties of the search request, use the `.extra()` +method. This can be used to define keys in the body that cannot be +defined via a specific API method like `explain`: + +[source,python] +---- +ubq = ubq.extra(explain=True) +---- + +To set query parameters, use the `.params()` method: + +[source,python] +---- +ubq = ubq.params(routing="42") +---- + +===== Response + +You can execute your search by calling the `.execute()` method that will +return a `Response` object. The `Response` object allows you access to +any key from the response dictionary via attribute access. It also +provides some convenient helpers: + +[source,python] +---- +response = ubq.execute() + +print(response.success()) +# True + +print(response.took) +# 12 +---- + +If you want to inspect the contents of the `response` objects, just use +its `to_dict` method to get access to the raw data for pretty +printing. diff --git a/docs/guide/elasticsearch-dsl.asciidoc b/docs/guide/elasticsearch-dsl.asciidoc new file mode 100644 index 000000000..a45a79aca --- /dev/null +++ b/docs/guide/elasticsearch-dsl.asciidoc @@ -0,0 +1,48 @@ +[[elasticsearch-dsl]] +== Elasticsearch DSL + +Elasticsearch DSL is a module of the official Python client that aims to help +with writing and running queries against Elasticsearch in a more convenient and +idiomatic way. It stays close to the Elasticsearch JSON DSL, mirroring its +terminology and structure. It exposes the whole range of the DSL from +Python either directly using defined classes or a queryset-like +expressions. Here is an example: + +[source,python] +.... +from elasticsearch.dsl import Search + +s = Search(index="my-index") \ + .filter("term", category="search") \ + .query("match", title="python") \ + .exclude("match", description="beta") +for hit in s: + print(hit.title) +.... + +Or with asynchronous Python: + +[source,python] +.... +from elasticsearch.dsl import AsyncSearch + +async def run_query(): + s = AsyncSearch(index="my-index") \ + .filter("term", category="search") \ + .query("match", title="python") \ + .exclude("match", description="beta") + async for hit in s: + print(hit.title) +.... + +It also provides an optional wrapper for working with documents as +Python objects: defining mappings, retrieving and saving documents, +wrapping the document data in user-defined classes. + +To use the other Elasticsearch APIs (eg. cluster health) just use the +regular client. + +include::dsl/configuration.asciidoc[] +include::dsl/tutorials.asciidoc[] +include::dsl/howto.asciidoc[] +include::dsl/examples.asciidoc[] diff --git a/docs/guide/index-custom-title-page.html b/docs/guide/index-custom-title-page.html index bd1d84081..219bc716b 100644 --- a/docs/guide/index-custom-title-page.html +++ b/docs/guide/index-custom-title-page.html @@ -135,6 +135,9 @@

  • Integrations
  • +
  • + Elasticsearch DSL +
  • Client helpers
  • @@ -186,4 +189,4 @@

    -

    View all Elastic docs

    \ No newline at end of file +

    View all Elastic docs

    diff --git a/docs/guide/index.asciidoc b/docs/guide/index.asciidoc index d2ae4ab63..5607a9f24 100644 --- a/docs/guide/index.asciidoc +++ b/docs/guide/index.asciidoc @@ -22,6 +22,8 @@ include::integrations.asciidoc[] include::examples.asciidoc[] +include::elasticsearch-dsl.asciidoc[] + include::helpers.asciidoc[] -include::release-notes.asciidoc[] \ No newline at end of file +include::release-notes.asciidoc[] diff --git a/docs/guide/overview.asciidoc b/docs/guide/overview.asciidoc index af7581eb0..f989a5fc3 100644 --- a/docs/guide/overview.asciidoc +++ b/docs/guide/overview.asciidoc @@ -84,18 +84,6 @@ some of the more engaging tasks like bulk indexing and reindexing. [discrete] === Elasticsearch DSL -For a more high level client library with more limited scope, have a look at -https://elasticsearch-dsl.readthedocs.org/[elasticsearch-dsl] - a more Pythonic library -sitting on top of `elasticsearch-py`. - -It provides a more convenient and idiomatic way to write and manipulate -https://elasticsearch-dsl.readthedocs.org/en/latest/search_dsl.html[queries]. It -stays close to the Elasticsearch JSON DSL, mirroring its terminology and -structure while exposing the whole range of the DSL from Python either directly -using defined classes or a queryset-like expressions. - -It also provides an optional -https://elasticsearch-dsl.readthedocs.org/en/latest/persistence.html#doctype[persistence -layer] for working with documents as Python objects in an ORM-like fashion: -defining mappings, retrieving and saving documents, wrapping the document data -in user-defined classes. +For a higher level access with more limited scope, have a look at the DSL module, +which provides a more convenient and idiomatic way to write and manipulate +queries. From a1bf979fce770f940b794943cd48ca98f5489193 Mon Sep 17 00:00:00 2001 From: Miguel Grinberg Date: Fri, 24 Jan 2025 16:40:39 +0000 Subject: [PATCH 2/2] clarify that this is a Python DSL --- docs/guide/dsl/asyncio.asciidoc | 2 +- docs/guide/elasticsearch-dsl.asciidoc | 2 +- docs/guide/index-custom-title-page.html | 2 +- docs/guide/overview.asciidoc | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/guide/dsl/asyncio.asciidoc b/docs/guide/dsl/asyncio.asciidoc index 6e92e84ab..ff7799dfb 100644 --- a/docs/guide/dsl/asyncio.asciidoc +++ b/docs/guide/dsl/asyncio.asciidoc @@ -1,5 +1,5 @@ [[asyncio]] -==== Using asyncio with Elasticsearch DSL +==== Using asyncio with Elasticsearch Python DSL The DSL module supports async/await with https://docs.python.org/3/library/asyncio.html[asyncio]. To ensure that diff --git a/docs/guide/elasticsearch-dsl.asciidoc b/docs/guide/elasticsearch-dsl.asciidoc index a45a79aca..bd3fb5d19 100644 --- a/docs/guide/elasticsearch-dsl.asciidoc +++ b/docs/guide/elasticsearch-dsl.asciidoc @@ -1,5 +1,5 @@ [[elasticsearch-dsl]] -== Elasticsearch DSL +== Elasticsearch Python DSL Elasticsearch DSL is a module of the official Python client that aims to help with writing and running queries against Elasticsearch in a more convenient and diff --git a/docs/guide/index-custom-title-page.html b/docs/guide/index-custom-title-page.html index 219bc716b..b7fd0f405 100644 --- a/docs/guide/index-custom-title-page.html +++ b/docs/guide/index-custom-title-page.html @@ -136,7 +136,7 @@

    Integrations
  • - Elasticsearch DSL + Elasticsearch Python DSL
  • Client helpers diff --git a/docs/guide/overview.asciidoc b/docs/guide/overview.asciidoc index f989a5fc3..2b331e08c 100644 --- a/docs/guide/overview.asciidoc +++ b/docs/guide/overview.asciidoc @@ -82,7 +82,7 @@ some of the more engaging tasks like bulk indexing and reindexing. [discrete] -=== Elasticsearch DSL +=== Elasticsearch Python DSL For a higher level access with more limited scope, have a look at the DSL module, which provides a more convenient and idiomatic way to write and manipulate