diff --git a/040_Distributed_CRUD/00_Intro.asciidoc b/040_Distributed_CRUD/00_Intro.asciidoc index 07c694f41..b996e61f2 100644 --- a/040_Distributed_CRUD/00_Intro.asciidoc +++ b/040_Distributed_CRUD/00_Intro.asciidoc @@ -1,25 +1,21 @@ [[distributed-docs]] -== Distributed Document Store +== 分布式文档存储 -In the preceding chapter, we looked at all the ways to put data into your index and -then retrieve it. But we glossed over many technical details surrounding how -the data is distributed and fetched from the cluster. This separation is done -on purpose; you don't really need to know how data is distributed to work -with Elasticsearch. It just works. +在前面的章节,我们介绍了如何索引和查询数据,不过我们忽略了很多底层的技术细节, +例如文件是如何分布到集群的,又是如何从集群中获取的。 +Elasticsearch 本意就是隐藏这些底层细节,让我们好专注在业务开发中,所以其实你不必了解这么深入也无妨。 -In this chapter, we dive into those internal, technical details -to help you understand how your data is stored in a distributed system. +在这个章节中,我们将深入探索这些核心的技术细节,这能帮助你更好地理解数据如何被存储到这个分布式系统中。 -.Content Warning + +.注意 **** -The information presented in this chapter is for your interest. You are not required to -understand and remember all the detail in order to use Elasticsearch. The -options that are discussed are for advanced users only. +这个章节包含了一些高级话题,上面也提到过,就算你不记住和理解所有的细节仍然能正常使用 Elasticsearch。 +如果你有兴趣的话,这个章节可以作为你的课外兴趣读物,扩展你的知识面。 -Read the section to gain a taste for how things work, and to know where the -information is in case you need to refer to it in the future, but don't be -overwhelmed by the details. +如果你在阅读这个章节的时候感到很吃力,也不用担心。 +这个章节仅仅只是用来告诉你 Elasticsearch 是如何工作的, +将来在工作中如果你需要用到这个章节提供的知识,可以再回过头来翻阅。 **** - diff --git a/050_Search/00_Intro.asciidoc b/050_Search/00_Intro.asciidoc index 6b6516a6c..1c11ebd53 100644 --- a/050_Search/00_Intro.asciidoc +++ b/050_Search/00_Intro.asciidoc @@ -1,60 +1,43 @@ [[search]] -== Searching--The Basic Tools +== 搜索——最基本的工具 -So far, we have learned how to use Elasticsearch as a simple NoSQL-style -distributed document store. We can ((("searching")))throw JSON documents at Elasticsearch and -retrieve each one by ID. But the real power of Elasticsearch lies in its -ability to make sense out of chaos -- to turn Big Data into Big Information. +现在,我们已经学会了如何使用 Elasticsearch 作为一个简单的 NoSQL 风格的分布式文档存储系统。我们可以((("searching")))将一个 JSON 文档扔到 Elasticsearch 里,然后根据 ID 检索。但 Elasticsearch 真正强大之处在于可以从无规律的数据中找出有意义的信息——从“大数据”到“大信息”。 -This is the reason that we use structured JSON documents, rather than -amorphous blobs of data. Elasticsearch not only _stores_ the document, but -also _indexes_ the content of the document in order to make it searchable. +Elasticsearch 不只会_存储(stores)_ 文档,为了能被搜索到也会为文档添加_索引(indexes)_ ,这也是为什么我们使用结构化的 JSON 文档,而不是无结构的二进制数据。 -_Every field in a document is indexed and can be queried_. ((("indexing"))) And it's not just -that. During a single query, Elasticsearch can use _all_ of these indices, to -return results at breath-taking speed. That's something that you could never -consider doing with a traditional database. +_文档中的每个字段都将被索引并且可以被查询_ 。((("indexing")))不仅如此,在简单查询时,Elasticsearch 可以使用 _所有(all)_ 这些索引字段,以惊人的速度返回结果。这是你永远不会考虑用传统数据库去做的一些事情。 -A _search_ can be any of the following: +_搜索(search)_ 可以做到: -* A structured query on concrete fields((("fields", "searching on")))((("searching", "types of searches"))) like `gender` or `age`, sorted by - a field like `join_date`, similar to the type of query that you could construct - in SQL +* 在类似于 `gender` 或者 `age` 这样的字段((("fields", "searching on")))((("searching", "types of searches")))上使用结构化查询,`join_date` 这样的字段上使用排序,就像SQL的结构化查询一样。 -* A full-text query, which finds all documents matching the search keywords, - and returns them sorted by _relevance_ +* 全文检索,找出所有匹配关键字的文档并按照_相关性(relevance)_ 排序后返回结果。 -* A combination of the two +* 以上二者兼而有之。 -While many searches will just work out of((("full text search"))) the box, to use Elasticsearch to -its full potential, you need to understand three subjects: +很多搜索都是开箱即用的((("full text search"))),为了充分挖掘 Elasticsearch 的潜力,你需要理解以下三个概念: - _Mapping_:: - How the data in each field is interpreted - - _Analysis_:: - How full text is processed to make it searchable - - _Query DSL_:: - The flexible, powerful query language used by Elasticsearch + _映射(Mapping)_ :: + 描述数据在每个字段内如何存储 -Each of these is a big subject in its own right, and we explain them in -detail in <>. The chapters in this section introduce the -basic concepts of all three--just enough to help you to get an overall -understanding of how search works. + _分析(Analysis)_ :: + 全文是如何处理使之可以被搜索的 -We will start by explaining the `search` API in its simplest form. + _领域特定查询语言(Query DSL)_ :: + Elasticsearch 中强大灵活的查询语言 -.Test Data +以上提到的每个点都是一个大话题,我们将在 <> 一章详细阐述它们。本章节我们将介绍这三点的一些基本概念——仅仅帮助你大致了解搜索是如何工作的。 + +我们将使用最简单的形式开始介绍 `search` API。 + +.测试数据 **** -The documents that we will use for test purposes in this chapter can be found -in this gist: https://gist.github.com/clintongormley/8579281. +本章节的测试数据可以在这里找到: https://gist.github.com/clintongormley/8579281 。 -You can copy the commands and paste them into your shell in order to follow -along with this chapter. +你可以把这些命令复制到终端中执行来实践本章的例子。 -Alternatively, if you're in the online version of this book, you can link:sense_widget.html?snippets/050_Search/Test_data.json[click here to open in Sense]. +另外,如果你读的是在线版本,可以 link:sense_widget.html?snippets/050_Search/Test_data.json[点击这个链接] 感受下。 **** diff --git a/060_Distributed_Search/00_Intro.asciidoc b/060_Distributed_Search/00_Intro.asciidoc index a6098a6c5..93b6303c1 100644 --- a/060_Distributed_Search/00_Intro.asciidoc +++ b/060_Distributed_Search/00_Intro.asciidoc @@ -1,34 +1,29 @@ [[distributed-search]] -== Distributed Search Execution +== 执行分布式检索 -Before moving on, we are going to take a detour and talk about how search is -executed in a distributed environment.((("distributed search execution"))) It is a bit more complicated than the -basic _create-read-update-delete_ (CRUD) requests((("CRUD (create-read-update-delete) operations"))) that we discussed in -<>. +在继续之前,我们将绕道讨论一下在分布式环境中搜索是怎么执行的。 +((("distributed search execution"))) 这比我们在 <> 章节讨论的基本的 _增-删-改-查_ (CRUD)((("CRUD (create-read-update-delete) operations")))请求要复杂一些。 -.Content Warning + +.内容提示 **** -The information presented in this chapter is for your interest. You are not required to -understand and remember all the detail in order to use Elasticsearch. +你可以根据兴趣阅读本章内容。你并不需要为了使用 Elasticsearch 而理解和记住所有的细节。 -Read this chapter to gain a taste for how things work, and to know where the -information is in case you need to refer to it in the future, but don't be -overwhelmed by the detail. +这章的阅读目的只为初步了解下工作原理,以便将来需要时可以及时找到这些知识, +但是不要被细节所困扰。 **** -A CRUD operation deals with a single document that has a unique combination of -`_index`, `_type`, and <> (which defaults to the -document's `_id`). This means that we know exactly which shard in the cluster -holds that document. +一个 CRUD 操作只对单个文档进行处理,文档的唯一性由 `_index`, `_type`, +和 <> (通常默认是该文档的 `_id` )的组合来确定。 +这表示我们确切的知道集群中哪个分片含有此文档。 + + +搜索需要一种更加复杂的执行模型因为我们不知道查询会命中哪些文档: 这些文档有可能在集群的任何分片上。 +一个搜索请求必须询问我们关注的索引(index or indices)的所有分片的某个副本来确定它们是否含有任何匹配的文档。 -Search requires a more complicated execution model because we don't know which -documents will match the query: they could be on any shard in the cluster. A -search request has to consult a copy of every shard in the index or indices -we're interested in to see if they have any matching documents. -But finding all matching documents is only half the story. Results from -multiple shards must be combined into a single sorted list before the `search` -API can return a ``page'' of results. For this reason, search is executed in a -two-phase process called _query then fetch_. +但是找到所有的匹配文档仅仅完成事情的一半。 +在 `search` 接口返回一个 ``page`` 结果之前,多分片中的结果必须组合成单个排序列表。 +为此,搜索被执行成一个两阶段过程,我们称之为 _query then fetch_ 。 diff --git a/070_Index_Mgmt/10_Settings.asciidoc b/070_Index_Mgmt/10_Settings.asciidoc index ac7373fa6..439f7ff5f 100644 --- a/070_Index_Mgmt/10_Settings.asciidoc +++ b/070_Index_Mgmt/10_Settings.asciidoc @@ -1,28 +1,22 @@ -=== Index Settings +[[index-settings]] +=== 索引设置 -There are many many knobs((("index settings"))) that you can twiddle to -customize index behavior, which you can read about in the -{ref}/index-modules.html[Index Modules reference documentation], -but... +你可以通过修改配置来((("index settings")))自定义索引行为,详细配置参照 +{ref}/index-modules.html[索引模块] -TIP: Elasticsearch comes with good defaults. Don't twiddle these knobs until -you understand what they do and why you should change them. +TIP: Elasticsearch 提供了优化好的默认配置。 除非你理解这些配置的作用并且知道为什么要去修改,否则不要随意修改。 -Two of the most important((("shards", "number_of_shards index setting")))((("number_of_shards setting")))((("index settings", "number_of_shards"))) settings are as follows: +下面是两个((("shards", "number_of_shards index setting")))((("number_of_shards setting")))((("index settings", "number_of_shards"))) 最重要的设置: `number_of_shards`:: - The number of primary shards that an index should have, - which defaults to `5`. This setting cannot be changed - after index creation. + 每个索引的主分片数,默认值是 `5` 。这个配置在索引创建后不能修改。 `number_of_replicas`:: - The number of replica shards (copies) that each primary shard - should have, which defaults to `1`. This setting can be changed - at any time on a live index. + 每个主分片的副本数,默认值是 `1` 。对于活动的索引库,这个配置可以随时修改。 -For instance, we could create a small index--just((("index settings", "number_of_replicas")))((("replica shards", "number_of_replicas index setting"))) one primary shard--and no replica shards with the following request: +例如,我们可以创建只有((("index settings", "number_of_replicas")))((("replica shards", "number_of_replicas index setting"))) 一个主分片,没有副本的小索引: [source,js] -------------------------------------------------- @@ -36,8 +30,8 @@ PUT /my_temp_index -------------------------------------------------- // SENSE: 070_Index_Mgmt/10_Settings.json -Later, we can change the number of replica shards dynamically using the -`update-index-settings` API as((("update-index-settings API"))) follows: +然后,我们可以用 +`update-index-settings` API ((("update-index-settings API"))) 动态修改副本数: [source,js] -------------------------------------------------- @@ -47,5 +41,3 @@ PUT /my_temp_index/_settings } -------------------------------------------------- // SENSE: 070_Index_Mgmt/10_Settings.json - - diff --git a/070_Index_Mgmt/32_Metadata_all.asciidoc b/070_Index_Mgmt/32_Metadata_all.asciidoc index d4b61c819..82dfe1677 100644 --- a/070_Index_Mgmt/32_Metadata_all.asciidoc +++ b/070_Index_Mgmt/32_Metadata_all.asciidoc @@ -1,15 +1,9 @@ [[all-field]] -==== Metadata: _all Field +==== 元数据: _all 字段 -In <>, we introduced the `_all` field: a special field that -indexes the ((("metadata, document", "_all field")))((("_all field", sortas="all field")))values from all other fields as one big string. The `query_string` -query clause (and searches performed as `?q=john`) defaults to searching in -the `_all` field if no other field is specified. +在 <> 中,我们介绍了 `_all` 字段:一个把其它字段值((("metadata, document", "_all field")))((("_all field", sortas="all field")))当作一个大字符串来索引的特殊字段。 `query_string` 查询子句(搜索 `?q=john` )在没有指定字段时默认使用 `_all` 字段。 -The `_all` field is useful during the exploratory phase of a new application, -while you are still unsure about the final structure that your documents will -have. You can throw any query string at it and you have a good chance of -finding the document you're after: +`_all` 字段在新应用的探索阶段,当你还不清楚文档的最终结构时是比较有用的。你可以使用这个字段来做任何查询,并且有很大可能找到需要的文档: [source,js] -------------------------------------------------- @@ -22,24 +16,14 @@ GET /_search -------------------------------------------------- -As your application evolves and your search requirements become more exacting, -you will find yourself using the `_all` field less and less. The `_all` field -is a shotgun approach to search. By querying individual fields, you have more -flexbility, power, and fine-grained control over which results are considered -to be most relevant. +随着应用的发展,搜索需求变得更加明确,你会发现自己越来越少使用 `_all` 字段。 `_all` 字段是搜索的应急之策。通过查询指定字段,你的查询更加灵活、强大,你也可以对相关性最高的搜索结果进行更细粒度的控制。 [NOTE] ==== -One of the important factors taken into account by the -<> -is the length of the field: the shorter the field, the more important. A term -that appears in a short `title` field is likely to be more important than the -same term that appears somewhere in a long `content` field. This distinction -between field lengths disappears in the `_all` field. +<> 考虑的一个最重要的原则是字段的长度:字段越短越重要。 在较短的 `title` 字段中出现的短语可能比在较长的 `content` 字段中出现的短语更加重要。字段长度的区别在 `_all` 字段中不会出现。 ==== -If you decide that you no longer need the `_all` field, you can disable it -with this mapping: +如果你不再需要 `_all` 字段,你可以通过下面的映射来禁用: [source,js] -------------------------------------------------- @@ -51,17 +35,9 @@ PUT /my_index/_mapping/my_type } -------------------------------------------------- +通过 `include_in_all` 设置来逐个控制字段是否要包含在 `_all` 字段中,((("include_in_all setting")))默认值是 `true`。在一个对象(或根对象)上设置 `include_in_all` 可以修改这个对象中的所有字段的默认行为。 -Inclusion in the `_all` field can be controlled on a field-by-field basis -by using the `include_in_all` setting, ((("include_in_all setting")))which defaults to `true`. Setting -`include_in_all` on an object (or on the root object) changes the -default for all fields within that object. - -You may find that you want to keep the `_all` field around to use -as a catchall full-text field just for specific fields, such as -`title`, `overview`, `summary`, and `tags`. Instead of disabling the `_all` -field completely, disable `include_in_all` for all fields by default, -and enable it only on the fields you choose: +你可能想要保留 `_all` 字段作为一个只包含某些特定字段的全文字段,例如只包含 `title`,`overview`,`summary` 和 `tags`。 相对于完全禁用 `_all` 字段,你可以为所有字段默认禁用 `include_in_all` 选项,仅在你选择的字段上启用: [source,js] -------------------------------------------------- @@ -81,11 +57,7 @@ PUT /my_index/my_type/_mapping -------------------------------------------------- -Remember that the `_all` field is just((("analyzers", "configuring for all field"))) an analyzed `string` field. It -uses the default analyzer to analyze its values, regardless of which -analyzer has been set on the fields where the values originate. And -like any `string` field, you can configure which analyzer the `_all` -field should use: +记住,`_all` 字段仅仅是一个((("analyzers", "configuring for all field"))) 经过分词的 `string` 字段。它使用默认分词器来分析它的值,不管这个值原本所在字段指定的分词器。就像所有 `string` 字段,你可以配置 `_all` 字段使用的分词器: [source,js] -------------------------------------------------- diff --git a/070_Index_Mgmt/33_Metadata_ID.asciidoc b/070_Index_Mgmt/33_Metadata_ID.asciidoc index 094d1ccda..12146b0be 100644 --- a/070_Index_Mgmt/33_Metadata_ID.asciidoc +++ b/070_Index_Mgmt/33_Metadata_ID.asciidoc @@ -1,25 +1,22 @@ -==== Metadata: Document Identity +==== 元数据:文档标识 -There are four metadata fields ((("metadata, document", "identity")))associated with document identity: +文档标识与四个元数据字段((("metadata, document", "identity")))相关: `_id`:: - The string ID of the document + 文档的 ID 字符串 `_type`:: - The type name of the document + 文档的类型名 `_index`:: - The index where the document lives + 文档所在的索引 `_uid`:: - The `_type` and `_id` concatenated together as `type#id` + `_type` 和 `_id` 连接在一起构造成 `type#id` -By default, the `_uid` field is((("id field"))) stored (can be retrieved) and -indexed (searchable). The `_type` field((("type field")))((("index field")))((("uid field"))) is indexed but not stored, -and the `_id` and `_index` fields are neither indexed nor stored, meaning -they don't really exist. +默认情况下, `_uid` 字段是被((("id field")))存储(可取回)和索引(可搜索)的。 +`_type` 字段((("type field")))((("index field")))((("uid field")))被索引但是没有存储, +`_id` 和 `_index` 字段则既没有被索引也没有被存储,这意味着它们并不是真实存在的。 -In spite of this, you can query the `_id` field as though it were a real -field. Elasticsearch uses the `_uid` field to derive the `_id`. Although you -can change the `index` and `store` settings for these fields, you almost -never need to do so. +尽管如此,你仍然可以像真实字段一样查询 `_id` 字段。Elasticsearch 使用 `_uid` 字段来派生出 `_id` 。 +虽然你可以修改这些字段的 `index` 和 `store` 设置,但是基本上不需要这么做。 diff --git a/070_Index_Mgmt/45_Default_Mapping.asciidoc b/070_Index_Mgmt/45_Default_Mapping.asciidoc index a122e7d6b..7ae86bd18 100644 --- a/070_Index_Mgmt/45_Default_Mapping.asciidoc +++ b/070_Index_Mgmt/45_Default_Mapping.asciidoc @@ -1,15 +1,9 @@ [[default-mapping]] -=== Default Mapping +=== 缺省映射 -Often, all types in an index share similar fields and settings. ((("mapping (types)", "default")))((("default mapping"))) It can be -more convenient to specify these common settings in the `_default_` mapping, -instead of having to repeat yourself every time you create a new type. The -`_default_` mapping acts as a template for new types. All types created -_after_ the `_default_` mapping will include all of these default settings, -unless explicitly overridden in the type mapping itself. +通常,一个索引中的所有类型共享相同的字段和设置。 ((("mapping (types)", "default")))((("default mapping"))) `_default_` 映射更加方便地指定通用设置,而不是每次创建新类型时都要重复设置。 `_default_` 映射是新类型的模板。在设置 `_default_` 映射之后创建的所有类型都将应用这些缺省的设置,除非类型在自己的映射中明确覆盖这些设置。 -For instance, we can disable the `_all` field for all types,((("_all field", sortas="all field"))) using the -`_default_` mapping, but enable it just for the `blog` type, as follows: +例如,我们可以使用 `_default_` 映射为所有的类型禁用 `_all` 字段,((("_all field", sortas="all field"))) 而只在 `blog` 类型启用: [source,js] -------------------------------------------------- @@ -28,5 +22,4 @@ PUT /my_index // SENSE: 070_Index_Mgmt/45_Default_mapping.json -The `_default_` mapping can also be a good place to specify index-wide -<>. +`_default_` 映射也是一个指定索引 <> 的好方法。 diff --git a/070_Index_Mgmt/50_Reindexing.asciidoc b/070_Index_Mgmt/50_Reindexing.asciidoc index a0d54ed14..de15cd59f 100644 --- a/070_Index_Mgmt/50_Reindexing.asciidoc +++ b/070_Index_Mgmt/50_Reindexing.asciidoc @@ -1,32 +1,24 @@ [[reindex]] -=== Reindexing Your Data +=== 重新索引你的数据 -Although you can add new types to an index, or add new fields to a type, you -can't add new analyzers or make changes to existing fields.((("reindexing")))((("indexing", "reindexing your data"))) If you were to do -so, the data that had already been indexed would be incorrect and your -searches would no longer work as expected. +尽管可以增加新的类型到索引中,或者增加新的字段到类型中,但是不能添加新的分析器或者对现有的字段做改动。 + ((("reindexing")))((("indexing", "reindexing your data"))) 如果你那么做的话,结果就是那些已经被索引的数据就不正确, +搜索也不能正常工作。 -The simplest way to apply these changes to your existing data is to -reindex: create a new index with the new settings and copy all of your -documents from the old index to the new index. +对现有数据的这类改变最简单的办法就是重新索引:用新的设置创建新的索引并把文档从旧的索引复制到新的索引。 -One of the advantages of the `_source` field is that you already have the -whole document available to you in Elasticsearch itself. You don't have to -rebuild your index from the database, which is usually much slower. +字段 `_source` 的一个优点是在Elasticsearch中已经有整个文档。你不必从源数据中重建索引,而且那样通常比较慢。 -To reindex all of the documents from the old index efficiently, use -<> to retrieve batches((("using in reindexing documents"))) of documents from the old index, -and the <> to push them into the new index. +为了有效的重新索引所有在旧的索引中的文档,用 <> 从旧的索引检索批量文档 ((("using in reindexing documents"))) , +然后用 <> 把文档推送到新的索引中。 -Beginning with Elasticsearch v2.3.0, a {ref}/docs-reindex.html[Reindex API] has been introduced. It enables you -to reindex your documents without requiring any plugin nor external tool. +从Elasticsearch v2.3.0开始, {ref}/docs-reindex.html[Reindex API] 被引入。它能够对文档重建索引而不需要任何插件或外部工具。 -.Reindexing in Batches +.批量重新索引 **** -You can run multiple reindexing jobs at the same time, but you obviously don't -want their results to overlap. Instead, break a big reindex down into smaller -jobs by filtering on a date or timestamp field: +同时并行运行多个重建索引任务,但是你显然不希望结果有重叠。正确的做法是按日期或者时间 +这样的字段作为过滤条件把大的重建索引分成小的任务: [source,js] -------------------------------------------------- @@ -46,11 +38,8 @@ GET /old_index/_search?scroll=1m -------------------------------------------------- -If you continue making changes to the old index, you will want to make -sure that you include the newly added documents in your new index as well. -This can be done by rerunning the reindex process, but again filtering -on a date field to match only documents that have been added since the -last reindex process started. +如果旧的索引持续会有变化,你希望新的索引中也包括那些新加的文档。那就可以对新加的文档做重新索引, +但还是要用日期类字段过滤来匹配那些新加的文档。 **** diff --git a/080_Structured_Search/10_compoundfilters.asciidoc b/080_Structured_Search/10_compoundfilters.asciidoc index dfbd3989d..f11578316 100644 --- a/080_Structured_Search/10_compoundfilters.asciidoc +++ b/080_Structured_Search/10_compoundfilters.asciidoc @@ -1,9 +1,7 @@ [[combining-filters]] -=== Combining Filters +=== 组合过滤器 -The previous two examples showed a single filter in use.((("structured search", "combining filters")))((("filters", "combining"))) -In practice, you will probably need to filter on multiple values or fields. -For example, how would you express this SQL in Elasticsearch? +前面的两个例子都是单个过滤器(filter)的使用方式。((("structured search", "combining filters")))((("filters", "combining"))) 在实际应用中,我们很有可能会过滤多个值或字段。比方说,怎样用 Elasticsearch 来表达下面的 SQL ? [source,sql] -------------------------------------------------- @@ -13,14 +11,12 @@ WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND (price != 30) -------------------------------------------------- -In these situations, you will need to use a `bool` query((("filters", "combining", "in bool query")))((("bool query"))) -inside the `constant_score` query. This allows us to build -filters that can have multiple components in boolean combinations. +这种情况下,我们需要 `bool` (布尔)过滤器。((("filters", "combining", "in bool filter")))((("bool filter"))) 这是个 _复合过滤器(compound filter)_ ,它可以接受多个其他过滤器作为参数,并将这些过滤器结合成各式各样的布尔(逻辑)组合。 [[bool-filter]] -==== Bool Filter +==== 布尔过滤器 -Recall that the `bool` query is composed of four sections: +一个 `bool` 过滤器由三部分组成: [source,js] -------------------------------------------------- @@ -29,45 +25,34 @@ Recall that the `bool` query is composed of four sections: "must" : [], "should" : [], "must_not" : [], - "filter": [] } } -------------------------------------------------- `must`:: - All of these clauses _must_ match. The equivalent of `AND`. + 所有的语句都 _必须(must)_ 匹配,与 `AND` 等价。 `must_not`:: - All of these clauses _must not_ match. The equivalent of `NOT`. + 所有的语句都 _不能(must not)_ 匹配,与 `NOT` 等价。 `should`:: - At least one of these clauses must match. The equivalent of `OR`. + 至少有一个语句要匹配,与 `OR` 等价。 - `filter`:: - Clauses that _must_ match, but are run in non-scoring, filtering mode. - -In this secondary boolean query, we can ignore the `filter` clause: the queries -are already running in non-scoring mode, so the extra `filter` clause is useless. +就这么简单!((("should clause", "in bool filters")))((("must_not clause", "in bool filters")))((("must clause", "in bool filters"))) 当我们需要多个过滤器时,只须将它们置入 `bool` 过滤器的不同部分即可。 [NOTE] ==== -Each section of the `bool` filter is optional (for example, you can have a `must` -clause and nothing else), and each section can contain a single query or an -array of queries. +一个 `bool` 过滤器的每个部分都是可选的(例如,我们可以只有一个 `must` 语句),而且每个部分内部可以只有一个或一组过滤器。 ==== -To replicate the preceding SQL example, we will take the two `term` queries that -we used((("term query", "placing inside bool query"))) -((("bool query", "with two term query in should clause and must_not clause"))) previously and -place them inside the `should` clause of a `bool` query, and add another clause -to deal with the `NOT` condition: +用 Elasticsearch 来表示本部分开始处的 SQL 例子,将两个 `term` 过滤器置入 `bool` 过滤器的 `should` 语句内,再增加一个语句处理 `NOT` 非的条件: [source,js] -------------------------------------------------- GET /my_store/products/_search { "query" : { - "constant_score" : { <1> + "filtered" : { <1> "filter" : { "bool" : { "should" : [ @@ -85,19 +70,11 @@ GET /my_store/products/_search -------------------------------------------------- // SENSE: 080_Structured_Search/10_Bool_filter.json -<1> Note that we still need to use a `constant_score` query to wrap everything with its -`filter` clause. This is what enables non-scoring mode -<2> These two `term` queries are _children_ of the `bool` query, and since they - are placed inside the `should` clause, at least one of them needs to match. -<3> If a product has a price of `30`, it is automatically excluded because it - matches a `must_not` clause. - -Notice how boolean is placed inside the `constant_score`, but the individual term -queries are just placed in the `should` and `must_not`. Because everything is wrapped -with the `constant_score`, the rest of the queries are executing in filtering mode. +<1> 注意,我们仍然需要一个 `filtered` 查询将所有的东西包起来。 +<2> 在 `should` 语句块里面的两个 `term` 过滤器与 `bool` 过滤器是父子关系,两个 `term` 条件需要匹配其一。 +<3> 如果一个产品的价格是 `30` ,那么它会自动被排除,因为它处于 `must_not` 语句里面。 -Our search results return two hits, each document satisfying a different clause -in the `bool` query: +我们搜索的结果返回了 2 个命中结果,两个文档分别匹配了 `bool` 过滤器其中的一个条件: [source,json] -------------------------------------------------- @@ -120,17 +97,14 @@ in the `bool` query: } ] -------------------------------------------------- -<1> Matches the `term` query for `productID = "XHDK-A-1293-#fJ3"` -<2> Matches the `term` query for `price = 20` +<1> 与 `term` 过滤器中 `productID = "XHDK-A-1293-#fJ3"` 条件匹配 +<2> 与 `term` 过滤器中 `price = 20` 条件匹配 -==== Nesting Boolean Queries +==== 嵌套布尔过滤器 -You can already see how nesting boolean queries together can give rise to more -sophisticated boolean logic. If you need to perform more complex operations, you -can continue nesting boolean queries in any combination, giving rise to -arbitrarily complex boolean logic. +尽管 `bool` 是一个复合的过滤器,可以接受多个子过滤器,需要注意的是 `bool` 过滤器本身仍然还只是一个过滤器。((("filters", "combining", "nesting bool filters")))((("bool filter", "nesting in another bool filter"))) 这意味着我们可以将一个 `bool` 过滤器置于其他 `bool` 过滤器内部,这为我们提供了对任意复杂布尔逻辑进行处理的能力。 -For example, if we have this SQL statement: +对于以下这个 SQL 语句: [source,sql] -------------------------------------------------- @@ -141,14 +115,14 @@ WHERE productID = "KDKE-B-9947-#kL5" AND price = 30 ) -------------------------------------------------- -We can translate it into a pair of nested `bool` filters: +我们将其转换成一组嵌套的 `bool` 过滤器: [source,js] -------------------------------------------------- GET /my_store/products/_search { "query" : { - "constant_score" : { + "filtered" : { "filter" : { "bool" : { "should" : [ @@ -168,14 +142,10 @@ GET /my_store/products/_search -------------------------------------------------- // SENSE: 080_Structured_Search/10_Bool_filter.json -<1> Because the `term` and the `bool` are sibling clauses inside the - Boolean `should`, at least one of these queries must match for a document - to be a hit. - -<2> These two `term` clauses are siblings in a `must` clause, so they both - have to match for a document to be returned as a hit. +<1> 因为 `term` 和 `bool` 过滤器是兄弟关系,他们都处于外层的布尔逻辑 `should` 的内部,返回的命中文档至少须匹配其中一个过滤器的条件。 +<2> 这两个 `term` 语句作为兄弟关系,同时处于 `must` 语句之中,所以返回的命中文档要必须都能同时匹配这两个条件。 -The results show us two documents, one matching each of the `should` clauses: +得到的结果有两个文档,它们各匹配 `should` 语句中的一个条件: [source,json] -------------------------------------------------- @@ -198,8 +168,7 @@ The results show us two documents, one matching each of the `should` clauses: } ] -------------------------------------------------- -<1> This `productID` matches the `term` in the first `bool`. -<2> These two fields match the `term` filters in the nested `bool`. +<1> 这个 `productID` 与外层的 `bool` 过滤器 `should` 里的唯一一个 `term` 匹配。 +<2> 这两个字段与嵌套的 `bool` 过滤器 `must` 里的两个 `term` 匹配。 -This was a simple example, but it demonstrates how Boolean queries can be -used as building blocks to construct complex logical conditions. +这只是个简单的例子,但足以展示布尔过滤器可以用来作为构造复杂逻辑条件的基本构建模块。 diff --git a/100_Full_Text_Search/10_Multi_word_queries.asciidoc b/100_Full_Text_Search/10_Multi_word_queries.asciidoc index dcb5fa7d9..8ee757037 100644 --- a/100_Full_Text_Search/10_Multi_word_queries.asciidoc +++ b/100_Full_Text_Search/10_Multi_word_queries.asciidoc @@ -1,9 +1,7 @@ [[match-multi-word]] -=== Multiword Queries +=== 多词查询 -If we could search for only one word at a time, full-text search would be -pretty inflexible. Fortunately, the `match` query((("full text search", "multi-word queries")))((("match query", "multi-word query"))) makes multiword queries -just as simple: +如果我们一次只能搜索一个词,那么全文搜索就会不太灵活,幸运的是 `match` 查询让多词查询变得简单:((("full text search", "multi-word queries")))((("match query", "multi-word query"))) [source,js] -------------------------------------------------- @@ -18,7 +16,7 @@ GET /my_index/my_type/_search -------------------------------------------------- // SENSE: 100_Full_Text_Search/05_Match_query.json -The preceding query returns all four documents in the results list: +上面这个查询返回所有四个文档: [source,js] -------------------------------------------------- @@ -56,33 +54,22 @@ The preceding query returns all four documents in the results list: } -------------------------------------------------- -<1> Document 4 is the most relevant because it contains `"brown"` twice and `"dog"` - once. +<1> 文档 4 最相关,因为它包含词 `"brown"` 两次以及 `"dog"` 一次。 -<2> Documents 2 and 3 both contain `brown` and `dog` once each, and the `title` - field is the same length in both docs, so they have the same score. +<2> 文档 2、3 同时包含 `brown` 和 `dog` 各一次,而且它们 `title` 字段的长度相同,所以具有相同的评分。 -<3> Document 1 matches even though it contains only `brown`, not `dog`. +<3> 文档 1 也能匹配,尽管它只有 `brown` 没有 `dog` 。 -Because the `match` query has to look for two terms—`["brown","dog"]`—internally it has to execute two `term` queries and combine their individual -results into the overall result. To do this, it wraps the two `term` queries -in a `bool` query, which we examine in detail in <>. +因为 `match` 查询必须查找两个词( `["brown","dog"]` ),它在内部实际上先执行两次 `term` 查询,然后将两次查询的结果合并作为最终结果输出。为了做到这点,它将两个 `term` 查询包入一个 `bool` 查询中,详细信息见 <>。 -The important thing to take away from this is that any document whose -`title` field contains _at least one of the specified terms_ will match the -query. The more terms that match, the more relevant the document. +以上示例告诉我们一个重要信息:即任何文档只要 `title` 字段里包含 _指定词项中的至少一个词_ 就能匹配,被匹配的词项越多,文档就越相关。 [[match-improving-precision]] -==== Improving Precision +==== 提高精度 -Matching any document that contains _any_ of the query terms may result in a -long tail of seemingly irrelevant results. ((("full text search", "multi-word queries", "improving precision")))((("precision", "improving for full text search multi-word queries"))) It's a shotgun approach to search. -Perhaps we want to show only documents that contain _all_ of the query terms. -In other words, instead of `brown OR dog`, we want to return only documents -that match `brown AND dog`. +用 _任意_ 查询词项匹配文档可能会导致结果中出现不相关的长尾。((("full text search", "multi-word queries", "improving precision")))((("precision", "improving for full text search multi-word queries")))这是种散弹式搜索。可能我们只想搜索包含 _所有_ 词项的文档,也就是说,不去匹配 `brown OR dog` ,而通过匹配 `brown AND dog` 找到所有文档。 -The `match` query accepts an `operator` parameter((("match query", "operator parameter")))((("or operator", "in match queries")))((("and operator", "in match queries"))) that defaults to `or`. -You can change it to `and` to require that all specified terms must match: +`match` 查询还可以接受 `operator` 操作符作为输入参数,默认情况下该操作符是 `or` 。我们可以将它修改成 `and` 让所有指定词项都必须匹配: [source,js] -------------------------------------------------- @@ -100,27 +87,18 @@ GET /my_index/my_type/_search -------------------------------------------------- // SENSE: 100_Full_Text_Search/05_Match_query.json -<1> The structure of the `match` query has to change slightly in order to - accommodate the `operator` parameter. +<1> `match` 查询的结构需要做稍许调整才能使用 `operator` 操作符参数。 -This query would exclude document 1, which contains only one of the two terms. +这个查询可以把文档 1 排除在外,因为它只包含两个词项中的一个。 [[match-precision]] -==== Controlling Precision +==== 控制精度 -The choice between _all_ and _any_ is a bit((("full text search", "multi-word queries", "controlling precision"))) too black-or-white. What if the -user specified five query terms, and a document contains only four of them? -Setting `operator` to `and` would exclude this document. +在 _所有_ 与 _任意_ 间二选一有点过于非黑即白。((("full text search", "multi-word queries", "controlling precision")))如果用户给定 5 个查询词项,想查找只包含其中 4 个的文档,该如何处理?将 `operator` 操作符参数设置成 `and` 只会将此文档排除。 -Sometimes that is exactly what you want, but for most full-text search use -cases, you want to include documents that may be relevant but exclude those -that are unlikely to be relevant. In other words, we need something -in-between. +有时候这正是我们期望的,但在全文搜索的大多数应用场景下,我们既想包含那些可能相关的文档,同时又排除那些不太相关的。换句话说,我们想要处于中间某种结果。 -The `match` query supports((("match query", "minimum_should_match parameter")))((("minimum_should_match parameter"))) the `minimum_should_match` parameter, which allows -you to specify the number of terms that must match for a document to be considered -relevant. While you can specify an absolute number of terms, it usually makes -sense to specify a percentage instead, as you have no control over the number of words the user may enter: +`match` 查询支持 `minimum_should_match` 最小匹配参数,((("match query", "minimum_should_match parameter")))((("minimum_should_match parameter")))这让我们可以指定必须匹配的词项数用来表示一个文档是否相关。我们可以将其设置为某个具体数字,更常用的做法是将其设置为一个百分数,因为我们无法控制用户搜索时输入的单词数量: [source,js] -------------------------------------------------- @@ -138,18 +116,12 @@ GET /my_index/my_type/_search -------------------------------------------------- // SENSE: 100_Full_Text_Search/05_Match_query.json -When specified as a percentage, `minimum_should_match` does the right thing: -in the preceding example with three terms, `75%` would be rounded down to `66.6%`, -or two out of the three terms. No matter what you set it to, at least one term -must match for a document to be considered a match. +当给定百分比的时候, `minimum_should_match` 会做合适的事情:在之前三词项的示例中, `75%` 会自动被截断成 `66.6%` ,即三个里面两个词。无论这个值设置成什么,至少包含一个词项的文档才会被认为是匹配的。 [NOTE] ==== -The `minimum_should_match` parameter is flexible, and different rules can -be applied depending on the number of terms the user enters. For the full -documentation see the +参数 `minimum_should_match` 的设置非常灵活,可以根据用户输入词项的数目应用不同的规则。完整的信息参考文档 {ref}/query-dsl-minimum-should-match.html#query-dsl-minimum-should-match ==== -To fully understand how the `match` query handles multiword queries, we need -to look at how to combine multiple queries with the `bool` query. +为了完全理解 `match` 是如何处理多词查询的,我们就需要查看如何使用 `bool` 查询将多个查询条件组合在一起。 diff --git a/100_Full_Text_Search/15_Combining_queries.asciidoc b/100_Full_Text_Search/15_Combining_queries.asciidoc index ee02fadba..8064a109b 100644 --- a/100_Full_Text_Search/15_Combining_queries.asciidoc +++ b/100_Full_Text_Search/15_Combining_queries.asciidoc @@ -1,16 +1,11 @@ [[bool-query]] -=== Combining Queries +=== 组合查询 -In <> we discussed how to((("full text search", "combining queries"))) use the `bool` filter to combine -multiple filter clauses with `and`, `or`, and `not` logic. In query land, the -`bool` query does a similar job but with one important difference. +在 <> 中,我们讨论过如何使用 `bool` 过滤器通过 `and` 、 `or` 和 `not` 逻辑组合将多个过滤器进行组合。在查询中, `bool` 查询有类似的功能,只有一个重要的区别。 -Filters make a binary decision: should this document be included in the -results list or not? Queries, however, are more subtle. They decide not only -whether to include a document, but also how _relevant_ that document is. +过滤器做二元判断:文档是否应该出现在结果中?但查询更精妙,它除了决定一个文档是否应该被包括在结果中,还会计算文档的 _相关程度_ 。 -Like the filter equivalent, the `bool` query accepts((("bool query"))) multiple query clauses -under the `must`, `must_not`, and `should` parameters. For instance: +与过滤器一样, `bool` 查询也可以接受 `must` 、 `must_not` 和 `should` 参数下的多个查询语句。((("bool query")))比如: [source,js] -------------------------------------------------- @@ -30,13 +25,9 @@ GET /my_index/my_type/_search -------------------------------------------------- // SENSE: 100_Full_Text_Search/15_Bool_query.json -The results from the preceding query include any document whose `title` field -contains the term `quick`, except for those that also contain `lazy`. So -far, this is pretty similar to how the `bool` filter works. +以上的查询结果返回 `title` 字段包含词项 `quick` 但不包含 `lazy` 的任意文档。目前为止,这与 `bool` 过滤器的工作方式非常相似。 -The difference comes in with the two `should` clauses, which say that: a document -is _not required_ to contain ((("should clause", "in bool queries")))either `brown` or `dog`, but if it does, then -it should be considered _more relevant_: +区别就在于两个 `should` 语句,也就是说:一个文档不必包含((("should clause", "in bool queries"))) `brown` 或 `dog` 这两个词项,但如果一旦包含,我们就认为它们 _更相关_ : [source,js] -------------------------------------------------- @@ -60,28 +51,19 @@ it should be considered _more relevant_: } -------------------------------------------------- -<1> Document 3 scores higher because it contains both `brown` and `dog`. +<1> 文档 3 会比文档 1 有更高评分是因为它同时包含 `brown` 和 `dog` 。 -==== Score Calculation +==== 评分计算 -The `bool` query calculates((("relevance scores", "calculation in bool queries")))((("bool query", "score calculation"))) the relevance `_score` for each document by adding -together the `_score` from all of the matching `must` and `should` clauses, -and then dividing by the total number of `must` and `should` clauses. +`bool` 查询会为每个文档计算相关度评分 `_score` ,((("relevance scores", "calculation in bool queries")))((("bool query", "score calculation")))再将所有匹配的 `must` 和 `should` 语句的分数 `_score` 求和,最后除以 `must` 和 `should` 语句的总数。 -The `must_not` clauses do not affect ((("must_not clause", "in bool queries")))the score; their only purpose is to -exclude documents that might otherwise have been included. +`must_not` 语句不会影响评分;((("must_not clause", "in bool queries")))它的作用只是将不相关的文档排除。 -==== Controlling Precision +==== 控制精度 -All the `must` clauses must match, and all the `must_not` clauses must not -match, but how many `should` clauses((("bool query", "controlling precision")))((("full text search", "combining queries", "controlling precision")))((("precision", "controlling for bool query"))) should match? By default, none of the `should` clauses are required to match, with one -exception: if there are no `must` clauses, then at least one `should` clause -must match. +所有 `must` 语句必须匹配,所有 `must_not` 语句都必须不匹配,但有多少 `should` 语句应该匹配呢?((("bool query", "controlling precision")))((("full text search", "combining queries", "controlling precision")))((("precision", "controlling for bool query")))默认情况下,没有 `should` 语句是必须匹配的,只有一个例外:那就是当没有 `must` 语句的时候,至少有一个 `should` 语句必须匹配。 -Just as we can control the <>, -we can control how many `should` clauses need to match by using the -`minimum_should_match` parameter,((("minimum_should_match parameter", "in bool queries"))) either as an absolute number or as a -percentage: +就像我们能控制 <> 一样,我们可以通过 `minimum_should_match` 参数控制需要匹配的 `should` 语句的数量,((("minimum_should_match parameter", "in bool queries")))它既可以是一个绝对的数字,又可以是个百分比: [source,js] -------------------------------------------------- @@ -101,9 +83,7 @@ GET /my_index/my_type/_search -------------------------------------------------- // SENSE: 100_Full_Text_Search/15_Bool_query.json -<1> This could also be expressed as a percentage. +<1> 这也可以用百分比表示。 -The results would include only documents whose `title` field contains `"brown" -AND "fox"`, `"brown" AND "dog"`, or `"fox" AND "dog"`. If a document contains -all three, it would be considered more relevant than those that contain -just two of the three. +这个查询结果会将所有满足以下条件的文档返回: `title` 字段包含 `"brown" +AND "fox"` 、 `"brown" AND "dog"` 或 `"fox" AND "dog"` 。如果有文档包含所有三个条件,它会比只包含两个的文档更相关。 diff --git a/110_Multi_Field_Search/00_Intro.asciidoc b/110_Multi_Field_Search/00_Intro.asciidoc index d6090dd5e..a0ea2ea0c 100644 --- a/110_Multi_Field_Search/00_Intro.asciidoc +++ b/110_Multi_Field_Search/00_Intro.asciidoc @@ -1,17 +1,8 @@ [[multi-field-search]] -== Multifield Search +== 多字段搜索 -Queries are seldom simple one-clause `match` queries. ((("multifield search"))) We frequently need to -search for the same or different query strings in one or more fields, which -means that we need to be able to combine multiple query clauses and their -relevance scores in a way that makes sense. +查询很少是简单一句话的 `match` 匹配查询。((("multifield search")))通常我们需要用相同或不同的字符串查询一个或多个字段,也就是说,需要对多个查询语句以及它们相关度评分进行合理的合并。 -Perhaps we're looking for a book called _War and Peace_ by an author called -Leo Tolstoy. Perhaps we're searching the Elasticsearch documentation -for ``minimum should match,'' which might be in the title or the body of a -page. Or perhaps we're searching for users with first name John and last -name Smith. +有时候或许我们正查找作者 Leo Tolstoy 写的一本名为 _War and Peace_(战争与和平)的书。或许我们正用 “minimum should match” (最少应该匹配)的方式在文档中对标题或页面内容进行搜索,或许我们正在搜索所有名字为 John Smith 的用户。 -In this chapter, we present the available tools for constructing multiclause -searches and how to figure out which solution you should apply to your -particular use case. +在本章,我们会介绍构造多语句搜索的工具及在特定场景下应该采用的解决方案。 diff --git a/130_Partial_Matching/05_Postcodes.asciidoc b/130_Partial_Matching/05_Postcodes.asciidoc index d3a47a907..0b22b17e9 100644 --- a/130_Partial_Matching/05_Postcodes.asciidoc +++ b/130_Partial_Matching/05_Postcodes.asciidoc @@ -1,22 +1,20 @@ -=== Postcodes and Structured Data +[[postcodes-and-structured-data]] +=== 邮编与结构化数据 -We will use United Kingdom postcodes (postal codes in the United States) to illustrate how((("partial matching", "postcodes and structured data"))) to use partial matching with -structured data. UK postcodes have a well-defined structure. For instance, the -postcode `W1V 3DG` can((("postcodes (UK), partial matching with"))) be broken down as follows: +我们会使用美国目前使用的邮编形式(United Kingdom postcodes 标准)来说明如何用部分匹配查询结构化数据。((("partial matching", "postcodes and structured data")))这种邮编形式有很好的结构定义。例如,邮编 `W1V 3DG` 可以分解成如下形式:((("postcodes (UK), partial matching with"))) -* `W1V`: This outer part identifies the postal area and district: +* `W1V` :这是邮编的外部,它定义了邮件的区域和行政区: -** `W` indicates the area (one or two letters) -** `1V` indicates the district (one or two numbers, possibly followed by a letter) +** `W` 代表区域( 1 或 2 个字母) +** `1V` 代表行政区( 1 或 2 个数字,可能跟着一个字符) -* `3DG`: This inner part identifies a street or building: +* `3DG` :内部定义了街道或建筑: -** `3` indicates the sector (one number) -** `DG` indicates the unit (two letters) +** `3` 代表街区区块( 1 个数字) +** `DG` 代表单元( 2 个字母) -Let's assume that we are indexing postcodes as exact-value `not_analyzed` -fields, so we could create our index as follows: +假设将邮编作为 `not_analyzed` 的精确值字段索引,所以可以为其创建索引,如下: [source,js] -------------------------------------------------- @@ -36,7 +34,7 @@ PUT /my_index -------------------------------------------------- // SENSE: 130_Partial_Matching/10_Prefix_query.json -And index some ((("indexing", "postcodes")))postcodes: +然后索引一些邮编:((("indexing", "postcodes"))) [source,js] -------------------------------------------------- @@ -57,4 +55,4 @@ PUT /my_index/address/5 -------------------------------------------------- // SENSE: 130_Partial_Matching/10_Prefix_query.json -Now our data is ready to be queried. +现在这些数据已可查询。 diff --git a/170_Relevance/05_Intro.asciidoc b/170_Relevance/05_Intro.asciidoc index 6b13bc923..f304f2147 100644 --- a/170_Relevance/05_Intro.asciidoc +++ b/170_Relevance/05_Intro.asciidoc @@ -1,30 +1,14 @@ [[controlling-relevance]] -== Controlling Relevance +== 控制相关度 -Databases that deal purely in structured data (such as dates, numbers, and -string enums) have it easy: they((("relevance", "controlling"))) just have to check whether a document (or a -row, in a relational database) matches the query. +处理结构化数据(比如:时间、数字、字符串、枚举)的数据库,((("relevance", "controlling")))只需检查文档(或关系数据库里的行)是否与查询匹配。 -While Boolean yes/no matches are an essential part of full-text search, they -are not enough by themselves. Instead, we also need to know how relevant each -document is to the query. Full-text search engines have to not only find the -matching documents, but also sort them by relevance. +布尔的是/非匹配是全文搜索的基础,但不止如此,我们还要知道每个文档与查询的相关度,在全文搜索引擎中不仅需要找到匹配的文档,还需根据它们相关度的高低进行排序。 -Full-text relevance ((("similarity algorithms")))formulae, or _similarity algorithms_, combine several -factors to produce a single relevance `_score` for each document. In this -chapter, we examine the various moving parts and discuss how they can be -controlled. +全文相关的公式或 _相似算法(similarity algorithms)_ ((("similarity algorithms")))会将多个因素合并起来,为每个文档生成一个相关度评分 `_score` 。本章中,我们会验证各种可变部分,然后讨论如何来控制它们。 -Of course, relevance is not just about full-text queries; it may need to -take structured data into account as well. Perhaps we are looking for a -vacation home with particular features (air-conditioning, sea view, free -WiFi). The more features that a property has, the more relevant it is. Or -perhaps we want to factor in sliding scales like recency, price, popularity, or -distance, while still taking the relevance of a full-text query into account. +当然,相关度不只与全文查询有关,也需要将结构化的数据考虑其中。可能我们正在找一个度假屋,需要一些的详细特征(空调、海景、免费 WiFi ),匹配的特征越多相关度越高。可能我们还希望有一些其他的考虑因素,如回头率、价格、受欢迎度或距离,当然也同时考虑全文查询的相关度。 -All of this is possible thanks to the powerful scoring infrastructure -available in Elasticsearch. +所有的这些都可以通过 Elasticsearch 强大的评分基础来实现。 -We will start by looking at the theoretical side of how Lucene calculates -relevance, and then move on to practical examples of how you can control the -process. +本章会先从理论上介绍 Lucene 是如何计算相关度的,然后通过实际例子说明如何控制相关度的计算过程。 diff --git a/170_Relevance/65_Script_score.asciidoc b/170_Relevance/65_Script_score.asciidoc index ca914ea49..6b202ec1f 100644 --- a/170_Relevance/65_Script_score.asciidoc +++ b/170_Relevance/65_Script_score.asciidoc @@ -1,22 +1,15 @@ [[script-score]] -=== Scoring with Scripts +=== 脚本评分 -Finally, if none of the `function_score`'s built-in functions suffice, you can -implement the logic that you need with a script, using the `script_score` -function.((("function_score query", "using script_score function")))((("script_score function")))((("relevance", "controlling", "scoring with scripts"))) +最后,如果所有 `function_score` 内置的函数都无法满足应用场景,可以使用 `script_score` 函数自行实现逻辑。((("function_score query", "using script_score function")))((("script_score function")))((("relevance", "controlling", "scoring with scripts"))) -For an example, let's say that we want to factor our profit margin into the -relevance calculation. In our business, the profit margin depends on three -factors: +举个例子,想将利润空间作为因子加入到相关度评分计算,在业务中,利润空间和以下三点相关: -* The `price` per night of the vacation home. -* The user's membership level--some levels get a percentage `discount` - above a certain price per night `threshold`. -* The negotiated `margin` as a percentage of the price-per-night, after user - discounts. +* `price` 度假屋每晚的价格。 +* 会员用户的级别——某些等级的用户可以在每晚房价高于某个 `threshold` 阀值价格的时候享受折扣 `discount` 。 +* 用户享受折扣后,经过议价的每晚房价的利润 `margin` 。 -The algorithm that we will use to calculate the profit for each home is as -follows: +计算每个度假屋利润的算法如下: [source,groovy] ------------------------- @@ -27,11 +20,8 @@ if (price < threshold) { } ------------------------- -We probably don't want to use the absolute profit as a score; it would -overwhelm the other factors like location, popularity and features. Instead, -we can express the profit as a percentage of our `target` profit. A profit -margin above our target will have a positive score (greater than `1.0`), and a profit margin below our target will have a negative score (less than -`1.0`): +我们很可能不想用绝对利润作为评分,这会弱化其他如地点、受欢迎度和特性等因子的作用,而是将利润用目标利润 `target` 的百分比来表示,高于 +目标的利润空间会有一个正向评分(大于 `1.0` ),低于目标的利润空间会有一个负向分数(小于 `1.0` ): [source,groovy] ------------------------- @@ -43,9 +33,7 @@ if (price < threshold) { return profit / target ------------------------- -The default scripting language in Elasticsearch is -http://groovy.codehaus.org/[Groovy], which for the most part looks a lot like -JavaScript.((("Groovy", "script factoring profit margins into relevance calculations"))) The preceding algorithm as a Groovy script would look like this: +Elasticsearch 里使用 http://groovy.codehaus.org/[Groovy] 作为默认的脚本语言,它与JavaScript很像,((("Groovy", "script factoring profit margins into relevance calculations")))上面这个算法用 Groovy 脚本表示如下: [source,groovy] ------------------------- @@ -57,13 +45,10 @@ if (price < threshold) { <2> } return price * (1 - discount) * margin / target <2> ------------------------- -<1> The `price` and `margin` variables are extracted from the `price` and - `margin` fields in the document. -<2> The `threshold`, `discount`, and `target` variables we will pass in as - `params`. +<1> `price` 和 `margin` 变量可以分别从文档的 `price` 和 `margin` 字段提取。 +<2> `threshold` 、 `discount` 和 `target` 是作为参数 `params` 传入的。 -Finally, we can add our `script_score` function to the list of other functions -that we are already using: +最终我们将 `script_score` 函数与其他函数一起使用: [source,json] ------------------------- @@ -89,35 +74,22 @@ GET /_search } } ------------------------- -<1> The `location` and `price` clauses refer to the example explained in - <>. -<2> By passing in these variables as `params`, we can change their values - every time we run this query without having to recompile the script. -<3> JSON cannot include embedded newline characters. Newline characters in - the script should either be escaped as `\n` or replaced with semicolons. +<1> `location` 和 `price` 语句在 <> 中解释过。 +<2> 将这些变量作为参数 `params` 传递,我们可以查询时动态改变脚本无须重新编译。 +<3> JSON 不能接受内嵌的换行符,脚本中的换行符可以用 `\n` 或 `;` 符号替代。 -This query would return the documents that best satisfy the user's -requirements for location and price, while still factoring in our need to make -a profit. +这个查询根据用户对地点和价格的需求,返回用户最满意的文档,同时也考虑到我们对于盈利的要求。 [TIP] ======================================== -The `script_score` function provides enormous flexibility.((("scripts", "performance and"))) Within a script, -you have access to the fields of the document, to the current `_score`, and -even to the term frequencies, inverse document frequencies, and field length -norms (see {ref}/modules-advanced-scripting.html[Text scoring in scripts]). +`script_score` 函数提供了巨大的灵活性,((("scripts", "performance and")))可以通过脚本访问文档里的所有字段、当前评分 `_score` 甚至词频、逆向文档频率和字段长度规范值这样的信息(参见 see {ref}/modules-advanced-scripting.html[脚本对文本评分])。 -That said, scripts can have a performance impact. If you do find that your -scripts are not quite fast enough, you have three options: +有人说使用脚本对性能会有影响,如果确实发现脚本执行较慢,可以有以下三种选择: -* Try to precalculate as much information as possible and include it in each - document. -* Groovy is fast, but not quite as fast as Java.((("Java", "scripting in"))) You could reimplement your - script as a native Java script. (See - {ref}/modules-scripting-native.html[Native Java Scripts]). -* Use the `rescore` functionality((("rescoring"))) described in <> to apply - your script to only the best-scoring documents. +* 尽可能多的提前计算各种信息并将结果存入每个文档中。 +* Groovy 很快,但没 Java 快。((("Java", "scripting in")))可以将脚本用原生的 Java 脚本重新实现。(参见 + {ref}/modules-scripting.html#native-java-scripts[原生 Java 脚本])。 +* 仅对那些最佳评分的文档应用脚本,使用 <> 中提到的 `rescore` 功能。((("rescoring"))) ======================================== - diff --git a/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc b/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc index 2a32ef3a4..8dc1ac471 100644 --- a/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc +++ b/270_Fuzzy_matching/40_Fuzzy_match_query.asciidoc @@ -1,7 +1,7 @@ [[fuzzy-match-query]] -=== Fuzzy match Query +=== 模糊匹配查询 -The `match` query supports ((("typoes and misspellings", "fuzzy match query")))((("match query", "fuzzy matching")))((("fuzzy matching", "match query")))fuzzy matching out of the box: +`match` 查询支持((("typoes and misspellings", "fuzzy match query")))((("match query", "fuzzy matching")))((("fuzzy matching", "match query")))开箱即用的模糊匹配: [source,json] ----------------------------------- @@ -19,11 +19,9 @@ GET /my_index/my_type/_search } ----------------------------------- -The query string is first analyzed, to produce the terms `[surprize, me]`, and -then each term is fuzzified using the specified `fuzziness`. +查询字符串首先进行分析,会产生词项 `[surprize, me]` ,并且每个词项根据指定的 `fuzziness` 进行模糊化。 -Similarly, the `multi_match` query also ((("multi_match queries", "fuzziness support")))supports `fuzziness`, but only when -executing with type `best_fields` or `most_fields`: +同样, `multi_match` 查询也((("multi_match queries", "fuzziness support")))支持 `fuzziness` ,但只有当执行查询时类型是 `best_fields` 或者 `most_fields` : [source,json] ----------------------------------- @@ -39,9 +37,6 @@ GET /my_index/my_type/_search } ----------------------------------- -Both the `match` and `multi_match` queries also support the `prefix_length` -and `max_expansions` parameters. - -TIP: Fuzziness works only with the basic `match` and `multi_match` queries. It -doesn't work with phrase matching, common terms, or `cross_fields` matches. +`match` 和 `multi_match` 查询都支持 `prefix_length` 和 `max_expansions` 参数。 +TIP: 模糊性(Fuzziness)只能在 `match` and `multi_match` 查询中使用。不能使用在短语匹配、常用词项或 `cross_fields` 匹配。 diff --git a/320_Geohashes/40_Geohashes.asciidoc b/320_Geohashes/40_Geohashes.asciidoc index e756ab090..35bdbb820 100644 --- a/320_Geohashes/40_Geohashes.asciidoc +++ b/320_Geohashes/40_Geohashes.asciidoc @@ -1,34 +1,15 @@ [[geohashes]] == Geohashes -http://en.wikipedia.org/wiki/Geohash[Geohashes] are a way of encoding -`lat/lon` points as strings.((("geohashes")))((("latitude/longitude pairs", "encoding lat/lon points as strings with geohashes")))((("strings", "geohash"))) The original intention was to have a -URL-friendly way of specifying geolocations, but geohashes have turned out to -be a useful way of indexing geo-points and geo-shapes in databases. - -Geohashes divide the world into a grid of 32 cells--4 rows and 8 columns--each represented by a letter or number. The `g` cell covers half of -Greenland, all of Iceland, and most of Great Britian. Each cell can be further -divided into another 32 cells, which can be divided into another 32 cells, -and so on. The `gc` cell covers Ireland and England, `gcp` covers most of -London and part of Southern England, and `gcpuuz94k` is the entrance to -Buckingham Palace, accurate to about 5 meters. - -In other words, the longer the geohash string, the more accurate it is. If -two geohashes share a prefix— and `gcpuuz`—then it implies that -they are near each other. The longer the shared prefix, the closer they -are. - -That said, two locations that are right next to each other may have completely -different geohashes. For instance, the -http://en.wikipedia.org/wiki/Millennium_Dome[Millenium Dome] in London has -geohash `u10hbp`, because it falls into the `u` cell, the next top-level cell -to the east of the `g` cell. - -Geo-points can index their associated geohashes automatically, but more -important, they can also index all geohash _prefixes_. Indexing the location -of the entrance to Buckingham Palace--latitude `51.501568` and longitude -`-0.141257`—would index all of the geohashes listed in the following table, -along with the approximate dimensions of each geohash cell: +http://en.wikipedia.org/wiki/Geohash[Geohashes] 是一种将经纬度坐标( `lat/lon` )编码成字符串的方式。((("geohashes")))((("latitude/longitude pairs", "encoding lat/lon points as strings with geohashes")))((("strings", "geohash")))这么做的初衷只是为了让地理位置在 url 上呈现的形式更加友好,但现在 geohashes 已经变成一种在数据库中有效索引地理坐标点和地理形状的方式。 + +Geohashes 把整个世界分为 32 个单元的格子 —— 4 行 8 列 —— 每一个格子都用一个字母或者数字标识。比如 `g` 这个单元覆盖了半个格林兰,冰岛的全部和大不列颠的大部分。每一个单元还可以进一步被分解成新的 32 个单元,这些单元又可以继续被分解成 32 个更小的单元,不断重复下去。 `gc` 这个单元覆盖了爱尔兰和英格兰, `gcp` 覆盖了伦敦的大部分和部分南英格兰, `gcpuuz94k` 是白金汉宫的入口,精确到约 5 米。 + +换句话说, geohash 的长度越长,它的精度就越高。如果两个 geohashes 有一个共同的前缀— `gcpuuz`—就表示他们挨得很近。共同的前缀越长,距离就越近。 + +这也意味着,两个刚好相邻的位置,可能会有完全不同的 geohash 。比如,伦敦 http://en.wikipedia.org/wiki/Millennium_Dome[Millenium Dome] 的 geohash 是 `u10hbp` ,因为它落在了 `u` 这个单元里,而紧挨着它东边的最大的单元是 `g` 。 + +地理坐标点可以自动索引相关的 geohashes ,更重要的是,他们也可以索引所有的 geohashes _前缀_ 。如索引白金汉宫入口位置——纬度 `51.501568` ,经度 `-0.141257`—将会索引下面表格中列出的所有 geohashes ,表格中也给出了各个 geohash 单元的近似尺寸: [cols="1m,1m,3d",options="header"] |============================================= @@ -47,6 +28,5 @@ along with the approximate dimensions of each geohash cell: |gcpuuz94kkp5 |12 | ~ 3.7cm x 1.8cm |============================================= -The {ref}/query-dsl-geohash-cell-query.html[`geohash_cell` filter] can use -these geohash prefixes((("geohash_cell filter")))((("filters", "geohash_cell"))) to find locations near a specified `lat/lon` point. +{ref}/query-dsl-geohash-cell-query.html[`geohash单元` 过滤器] 可以使用这些 geohash 前缀((("geohash_cell filter")))((("filters", "geohash_cell")))来找出与指定坐标点( `lat/lon` )相邻的位置。 diff --git a/400_Relationships/20_Denormalization.asciidoc b/400_Relationships/20_Denormalization.asciidoc index 9b72605f5..2a39c4e91 100644 --- a/400_Relationships/20_Denormalization.asciidoc +++ b/400_Relationships/20_Denormalization.asciidoc @@ -1,15 +1,11 @@ [[denormalization]] -=== Denormalizing Your Data +=== 非规范化你的数据 -The way to get the best search performance out of Elasticsearch is to use it -as it is intended, by((("relationships", "denormalizing your data")))((("denormalization", "denormalizing data at index time"))) -http://en.wikipedia.org/wiki/Denormalization[denormalizing] your data at index -time. Having redundant copies of data in each document that requires access to -it removes the need for joins. -If we want to be able to find a blog post by the name of the user who wrote it, -include the user's name in the blog-post document itself: +使用 Elasticsearch 得到最好的搜索性能的方法是有目的的通过在索引时进行非规范化 ((("relationships", "denormalizing your data")))((("denormalization", "denormalizing data at index time"))) +http://en.wikipedia.org/wiki/Denormalization[denormalizing]。对每个文档保持一定数量的冗余副本可以在需要访问时避免进行关联。 +如果我们希望能够通过某个用户姓名找到他写的博客文章,可以在博客文档中包含这个用户的姓名: [source,json] -------------------------------- @@ -30,10 +26,9 @@ PUT /my_index/blogpost/2 } } -------------------------------- -<1> Part of the user's data has been denormalized into the `blogpost` document. +<1> 这部分用户的字段数据已被冗余到 `blogpost` 文档中。 -Now, we can find blog posts about `relationships` by users called `John` -with a single query: +现在,我们通过单次查询就能够通过 `relationships` 找到用户 `John` 的博客文章。 [source,json] -------------------------------- @@ -50,7 +45,4 @@ GET /my_index/blogpost/_search } -------------------------------- -The advantage of data denormalization is speed. Because each document -contains all of the information that is required to determine whether it -matches the query, there is no need for expensive joins. - +数据非规范化的优点是速度快。因为每个文档都包含了所需的所有信息,当这些信息需要在查询进行匹配时,并不需要进行昂贵的联接操作。 diff --git a/404_Parent_Child/40_Parent_child.asciidoc b/404_Parent_Child/40_Parent_child.asciidoc index 15b64a23e..87cc431f1 100644 --- a/404_Parent_Child/40_Parent_child.asciidoc +++ b/404_Parent_Child/40_Parent_child.asciidoc @@ -1,54 +1,26 @@ [[parent-child]] -== Parent-Child Relationship +== 父-子关系文档 -The _parent-child_ relationship is ((("relationships", "parent-child")))((("parent-child relationship")))similar in nature to the -<>: both allow you to associate one entity -with another. ((("nested objects", "parent-child relationships versus")))The difference is that, with nested objects, all entities live -within the same document while, with parent-child, the parent and children -are completely separate documents. +父-子关系文档 ((("relationships", "parent-child"))) ((("parent-child relationship"))) 在实质上类似于 <> :允许将一个对象实体和另外一个对象实体关联起来。((("nested objects", "parent-child relationships versus")))而这两种类型的主要区别是:在 <> 文档中,所有对象都是在同一个文档中,而在父-子关系文档中,父对象和子对象都是完全独立的文档。 -The parent-child functionality allows you to associate one document type with -another, in a _one-to-many_ relationship--one parent to many children.((("one-to-many relationships"))) The -advantages that parent-child has over <> are as follows: +父-子关系的主要作用是允许把一个 type 的文档和另外一个 type 的文档关联起来,构成一对多的关系:一个父文档可以对应多个子文档 ((("one-to-many relationships"))) 。与 <> 相比,父-子关系的主要优势有: -* The parent document can be updated without reindexing the children. +* 更新父文档时,不会重新索引子文档。 +* 创建,修改或删除子文档时,不会影响父文档或其他子文档。这一点在这种场景下尤其有用:子文档数量较多,并且子文档创建和修改的频率高时。 +* 子文档可以作为搜索结果独立返回。 -* Child documents can be added, changed, or deleted without affecting either - the parent or other children. This is especially useful when child documents - are large in number and need to be added or changed frequently. - -* Child documents can be returned as the results of a search request. - -Elasticsearch maintains a map of which parents are associated with -which children. It is thanks to this map that query-time joins are fast, but -it does place a limitation on the parent-child relationship: _the parent -document and all of its children must live on the same shard_. - -The parent-child ID maps are stored in <>, which allows them to execute -quickly when fully hot in memory, but scalable enough to spill to disk when -the map is very large. +Elasticsearch 维护了一个父文档和子文档的映射关系,得益于这个映射,父-子文档关联查询操作非常快。但是这个映射也对父-子文档关系有个限制条件:父文档和其所有子文档,都必须要存储在同一个分片中。 +父-子文档ID映射存储在 <> 中。当映射完全在内存中时, <> 提供对映射的快速处理能力,另一方面当映射非常大时,可以通过溢出到磁盘提供足够的扩展能力 [[parent-child-mapping]] -=== Parent-Child Mapping +=== 父-子关系文档映射 -All that is needed in order to establish the parent-child relationship is to -specify which document type should be the parent of a child type.((("mapping (types)", "parent-child")))((("parent-child relationship", "parent-child mapping"))) This must -be done at index creation time, or with the `update-mapping` API before the -child type has been created. +建立父-子文档映射关系时只需要指定某一个文档 type 是另一个文档 type 的父亲。 ((("mapping (types)", "parent-child"))) ((("parent-child relationship", "parent-child mapping"))) 该关系可以在如下两个时间点设置:1)创建索引时;2)在子文档 type 创建之前更新父文档的 mapping。 -As an example, let's say that we have a company that has branches in many -cities. We would like to associate employees with the branch where they work. -We need to be able to search for branches, individual employees, and employees -who work for particular branches, so the nested model will not help. We -could, of course, -use <> or -<> here instead, but for demonstration -purposes we will use parent-child. +举例说明,有一个公司在多个城市有分公司,并且每一个分公司下面都有很多员工。有这样的需求:按照分公司、员工的维度去搜索,并且把员工和他们工作的分公司联系起来。针对该需求,用嵌套模型是无法实现的。当然,如果使用 <> 或者 <> 也是可以实现的,但是为了演示的目的,在这里我们使用父-子文档。 -All that we have to do is to tell Elasticsearch that the `employee` type has -the `branch` document type as its `_parent`, which we can do when we create -the index: +我们需要告诉Elasticsearch,在创建员工 `employee` 文档 type 时,指定分公司 `branch` 的文档 type 为其父亲。 [source,json] ------------------------- @@ -64,4 +36,4 @@ PUT /company } } ------------------------- -<1> Documents of type `employee` are children of type `branch`. +<1> `employee` 文档 是 `branch` 文档的子文档。 diff --git a/404_Parent_Child/45_Indexing_parent_child.asciidoc b/404_Parent_Child/45_Indexing_parent_child.asciidoc index 26fa7210a..9d6b9e15f 100644 --- a/404_Parent_Child/45_Indexing_parent_child.asciidoc +++ b/404_Parent_Child/45_Indexing_parent_child.asciidoc @@ -1,8 +1,7 @@ [[indexing-parent-child]] -=== Indexing Parents and Children +=== 构建父-子文档索引 -Indexing parent documents is no different from any other document. Parents -don't need to know anything about their children: +为父文档创建索引与为普通文档创建索引没有区别。父文档并不需要知道它有哪些子文档。 [source,json] ------------------------- @@ -15,8 +14,7 @@ POST /company/branch/_bulk { "name": "Champs Élysées", "city": "Paris", "country": "France" } ------------------------- -When indexing child documents, you must specify the ID of the associated -parent document: +创建子文档时,用户必须要通过 `parent` 参数来指定该子文档的父文档 ID: [source,json] ------------------------- @@ -27,31 +25,19 @@ PUT /company/employee/1?parent=london <1> "hobby": "hiking" } ------------------------- -<1> This `employee` document is a child of the `london` branch. +<1> 当前 `employee` 文档的父文档 ID 是 `london` 。 -This `parent` ID serves two purposes: it creates the link between the parent -and the child, and it ensures that the child document is stored on the same -shard as the parent. +父文档 ID 有两个作用:创建了父文档和子文档之间的关系,并且保证了父文档和子文档都在同一个分片上。 -In <>, we explained how Elasticsearch uses a routing value, -which defaults to the `_id` of the document, to decide which shard a document -should belong to. The routing value is plugged into this simple formula: +在 <> 中,我们解释了 Elasticsearch 如何通过路由值来决定该文档属于哪一个分片,路由值默认为该文档的 `_id` 。分片路由的计算公式如下: shard = hash(routing) % number_of_primary_shards -However, if a `parent` ID is specified, it is used as the routing value -instead of the `_id`. In other words, both the parent and the child use the -same routing value--the `_id` of the parent--and so they are both stored -on the same shard. +如果指定了父文档的 ID,那么就会使用父文档的 ID 进行路由,而不会使用当前文档 `_id` 。也就是说,如果父文档和子文档都使用相同的值进行路由,那么父文档和子文档都会确定分布在同一个分片上。 -The `parent` ID needs to be specified on all single-document requests: -when retrieving a child document with a `GET` request, or when indexing, -updating, or deleting a child document. Unlike a search request, which is -forwarded to all shards in an index, these single-document requests are -forwarded only to the shard that holds the document--if the `parent` ID is -not specified, the request will probably be forwarded to the wrong shard. +在执行单文档的请求时需要指定父文档的 ID,单文档请求包括:通过 `GET` 请求获取一个子文档;创建、更新或删除一个子文档。而执行搜索请求时是不需要指定父文档的ID,这是因为搜索请求是向一个索引中的所有分片发起请求,而单文档的操作是只会向存储该文档的分片发送请求。因此,如果操作单个子文档时不指定父文档的 ID,那么很有可能会把请求发送到错误的分片上。 -The `parent` ID should also be specified when using the `bulk` API: +父文档的 ID 应该在 `bulk` API 中指定 [source,json] ------------------------- @@ -64,8 +50,4 @@ POST /company/employee/_bulk { "name": "Adrien Grand", "dob": "1987-05-11", "hobby": "horses" } ------------------------- -WARNING: If you want to change the `parent` value of a child document, it is -not sufficient to just reindex or update the child document--the new parent -document may be on a different shard. Instead, you must first delete the old -child, and then index the new child. - +WARNING: 如果你想要改变一个子文档的 `parent` 值,仅通过更新这个子文档是不够的,因为新的父文档有可能在另外一个分片上。因此,你必须要先把子文档删除,然后再重新索引这个子文档。 diff --git a/404_Parent_Child/50_Has_child.asciidoc b/404_Parent_Child/50_Has_child.asciidoc index b448b202a..eb01e36f6 100644 --- a/404_Parent_Child/50_Has_child.asciidoc +++ b/404_Parent_Child/50_Has_child.asciidoc @@ -1,10 +1,7 @@ [[has-child]] -=== Finding Parents by Their Children - -The `has_child` query and filter can be used to find parent documents based on -the contents of their children.((("has_child query and filter")))((("parent-child relationship", "finding parents by their children"))) For instance, we could find all branches that -have employees born after 1980 with a query like this: +=== 通过子文档查询父文档 +`has_child` 的查询和过滤可以通过子文档的内容来查询父文档。((("has_child query and filter")))((("parent-child relationship", "finding parents by their children")))例如,我们根据如下查询,可查出所有80后员工所在的分公司: [source,json] ------------------------- GET /company/branch/_search @@ -24,16 +21,10 @@ GET /company/branch/_search } ------------------------- -Like the <>, the `has_child` query could -match several child documents,((("has_child query and filter", "query"))) each with a different relevance -score. How these scores are reduced to a single score for the parent document -depends on the `score_mode` parameter. The default setting is `none`, which -ignores the child scores and assigns a score of `1.0` to the parents, but it -also accepts `avg`, `min`, `max`, and `sum`. +类似于 <> ,`has_child` 查询可以匹配多个子文档((("has_child query and filter", "query"))),并且每一个子文档的评分都不同。但是由于每一个子文档都带有评分,这些评分如何规约成父文档的总得分取决于 `score_mode` 这个参数。该参数有多种取值策略:默认为 `none` ,会忽略子文档的评分,并且会给父文档评分设置为 `1.0` ; +除此以外还可以设置成 `avg` 、 `min` 、 `max` 和 `sum` 。 -The following query will return both `london` and `liverpool`, but `london` -will get a better score because `Alice Smith` is a better match than -`Barry Smith`: +下面的查询将会同时返回 `london` 和 `liverpool` ,不过由于 `Alice Smith` 要比 `Barry Smith` 更加匹配查询条件,因此 `london` 会得到一个更高的评分。 [source,json] ------------------------- @@ -53,19 +44,14 @@ GET /company/branch/_search } ------------------------- -TIP: The default `score_mode` of `none` is significantly faster than the other -modes because Elasticsearch doesn't need to calculate the score for each child -document. Set it to `avg`, `min`, `max`, or `sum` only if you care about the -score.((("parent-child relationship", "finding parents by their children", "min_children and max_children"))) +TIP: `score_mode` 为默认的 `none` 时,会显著地比其模式要快,这是因为Elasticsearch不需要计算每一个子文档的评分。只有当你真正需要关心评分结果时,才需要为 `source_mode` 设值,例如设成 `avg` 、 `min` 、 `max` 或 `sum` 。((("parent-child relationship", "finding parents by their children", "min_children and max_children"))) [[min-max-children]] -==== min_children and max_children +==== min_children 和 max_children -The `has_child` query and filter both accept the `min_children` and -`max_children` parameters,((("min_children parameter")))((("max_children parameter")))((("has_child query and filter", "min_children or max_children parameters"))) which will return the parent document only if the -number of matching children is within the specified range. +`has_child` 的查询和过滤都可以接受这两个参数:`min_children` 和 `max_children` 。 ((("min_children parameter")))((("max_children parameter")))((("has_child query and filter", "min_children or max_children parameters"))) 使用这两个参数时,只有当子文档数量在指定范围内时,才会返回父文档。 -This query will match only branches that have at least two employees: +如下查询只会返回至少有两个雇员的分公司: [source,json] ------------------------- @@ -82,21 +68,14 @@ GET /company/branch/_search } } ------------------------- -<1> A branch must have at least two employees in order to match. +<1> 至少有两个雇员的分公司才会符合查询条件。 -The performance of a `has_child` query or filter with the `min_children` or -`max_children` parameters is much the same as a `has_child` query with scoring -enabled. +带有 `min_children` 和 `max_children` 参数的 `has_child` 查询或过滤,和允许评分的 `has_child` 查询的性能非常接近。 .has_child Filter ************************** -The `has_child` filter works((("has_child query and filter", "filter"))) in the same way as the `has_child` query, except -that it doesn't support the `score_mode` parameter. It can be used only in -_filter context_—such as inside a `filtered` query--and behaves -like any other filter: it includes or excludes, but doesn't score. - -While the results of a `has_child` filter are not cached, the usual caching -rules apply to the filter _inside_ the `has_child` filter. +`has_child` 查询和过滤在运行机制上类似,((("has_child query and filter", "filter")))区别是 `has_child` 过滤不支持 `source_mode` 参数。`has_child` 过滤仅用于筛选内容--如内部的一个 `filtered` 查询--和其他过滤行为类似:包含或者排除,但没有进行评分。 +`has_child` 过滤的结果没有被缓存,但是 `has_child` 过滤内部的过滤方法适用于通常的缓存规则。 ************************** diff --git a/404_Parent_Child/55_Has_parent.asciidoc b/404_Parent_Child/55_Has_parent.asciidoc index fcc37e404..d616e2d17 100644 --- a/404_Parent_Child/55_Has_parent.asciidoc +++ b/404_Parent_Child/55_Has_parent.asciidoc @@ -1,14 +1,9 @@ [[has-parent]] -=== Finding Children by Their Parents +=== 通过父文档查询子文档 -While a `nested` query can always ((("parent-child relationship", "finding children by their parents")))return only the root document as a result, -parent and child documents are independent and each can be queried -independently. The `has_child` query allows us to return parents based on -data in their children, and the `has_parent` query returns children based on -data in their parents.((("has_parent query and filter", "query"))) +虽然 `nested` 查询只能返回最顶层的文档 ((("parent-child relationship", "finding children by their parents"))),但是父文档和子文档本身是彼此独立并且可被单独查询的。我们使用 `has_child` 语句可以基于子文档来查询父文档,使用 `has_parent` 语句可以基于子文档来查询父文档。 ((("has_parent query and filter", "query"))) -It looks very similar to the `has_child` query. This example returns -employees who work in the UK: +`has_parent` 和 `has_child` 非常相似,下面的查询将会返回所有在 UK 工作的雇员: [source,json] ------------------------- @@ -26,19 +21,13 @@ GET /company/employee/_search } } ------------------------- -<1> Returns children who have parents of type `branch` +<1> 返回父文档 `type` 是 `branch` 的所有子文档 -The `has_parent` query also supports the `score_mode`,((("score_mode parameter"))) but it accepts only two -settings: `none` (the default) and `score`. Each child can have only one -parent, so there is no need to reduce multiple scores into a single score for -the child. The choice is simply between using the score (`score`) or not -(`none`). +`has_parent` 查询也支持 `score_mode` 这个参数,((("score_mode parameter")))但是该参数只支持两种值: `none` (默认)和 `score` 。每个子文档都只有一个父文档,因此这里不存在将多个评分规约为一个的情况, `score_mode` 的取值仅为 `score` 和 `none` 。 -.Non-scoring has_parent Query +.不带评分的 has_parent 查询 ************************** -When used in non-scoring mode (e.g. inside a `filter` clause), the `has_parent` -query no longer supports the `score_mode` parameter. Because it is merely -including/excluding documents and not scoring, the `score_mode` parameter -no longer applies. +当 `has_parent` 查询用于非评分模式(比如 filter 查询语句)时, `score_mode` 参数就不再起作用了。因为这种模式只是简单地包含或排除文档,没有评分,那么 `score_mode` 参数也就没有意义了。 + ************************** diff --git a/404_Parent_Child/60_Children_agg.asciidoc b/404_Parent_Child/60_Children_agg.asciidoc index 6af80f0ec..0363fa2e6 100644 --- a/404_Parent_Child/60_Children_agg.asciidoc +++ b/404_Parent_Child/60_Children_agg.asciidoc @@ -1,14 +1,10 @@ [[children-agg]] -=== Children Aggregation +=== 子文档聚合 -Parent-child supports a -http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-children-aggregation.html[`children` aggregation] as ((("aggregations", "children aggregation")))((("children aggregation")))((("parent-child relationship", "children aggregation")))a direct analog to the `nested` aggregation discussed in -<>. A parent aggregation (the equivalent of -`reverse_nested`) is not supported. - -This example demonstrates how we could determine the favorite hobbies of our -employees by country: +在父-子文档中支持 +http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-children-aggregation.html[子文档聚合],这一点和((("aggregations", "children aggregation")))((("children aggregation")))((("parent-child relationship", "children aggregation"))) <> 类似。但是,对于父文档的聚合查询是不支持的(和 `reverse_nested` 类似)。 +我们通过下面的例子来演示按照国家维度查看最受雇员欢迎的业余爱好: [source,json] ------------------------- GET /company/branch/_search @@ -37,7 +33,6 @@ GET /company/branch/_search } } ------------------------- -<1> The `country` field in the `branch` documents. -<2> The `children` aggregation joins the parent documents with - their associated children of type `employee`. -<3> The `hobby` field from the `employee` child documents. +<1> `country` 是 `branch` 文档的一个字段。 +<2> 子文档聚合查询通过 `employee` type 的子文档将其父文档聚合在一起。 +<3> `hobby` 是 `employee` 子文档的一个字段。 diff --git a/410_Scaling/10_Intro.asciidoc b/410_Scaling/10_Intro.asciidoc index fc4d8ec0d..8cd3e87c8 100644 --- a/410_Scaling/10_Intro.asciidoc +++ b/410_Scaling/10_Intro.asciidoc @@ -1,29 +1,17 @@ [[scale]] -== Designing for Scale +== 扩容设计 -Elasticsearch is used by some companies to index ((("scaling", "designing for scale")))and search petabytes of data -every day, but most of us start out with something a little more humble in -size. Even if we aspire to be the next Facebook, it is unlikely that our bank -balance matches our aspirations. We need to build for what we have today, but -in a way that will allow us to scale out flexibly and rapidly. +一些公司每天使用 Elasticsearch((("scaling", "designing for scale"))) 索引检索 PB 级数据, +但我们中的大多数都起步于规模稍逊的项目。即使我们立志成为下一个 Facebook,我们的银行卡余额却也跟不上梦想的脚步。 +我们需要为今日所需而构建,但也要允许我们可以灵活而又快速地进行水平扩展。 -Elasticsearch is built to scale. It will run very happily on your laptop or -in a cluster containing hundreds of nodes, and the experience is almost -identical. Growing from a small cluster to a large cluster is almost entirely -automatic and painless. Growing from a large cluster to a very large cluster -requires a bit more planning and design, but it is still relatively painless. +Elasticsearch 为了可扩展性而生。它可以良好地运行于你的笔记本电脑又或者一个拥有数百节点的集群,同时用户体验基本相同。 +由小规模集群增长为大规模集群的过程几乎完全自动化并且无痛。由大规模集群增长为超大规模集群需要一些规划和设计,但还是相对地无痛。 -Of course, it is not magic. Elasticsearch has its limitations too. If you -are aware of those limitations and work with them, the growing process will be -pleasant. If you treat Elasticsearch badly, you could be in for a world of -pain. +当然这一切并不是魔法。Elasticsearch 也有它的局限性。如果你了解这些局限性并能够与之相处,集群扩容的过程将会是愉快的。 +如果你对 Elasticsearch 处理不当,那么你将处于一个充满痛苦的世界。 -The default settings in Elasticsearch will take you a long way, but to get the -most bang for your buck, you need to think about how data flows through your -system. We will talk about two common data flows: time-based data (such as log -events or social network streams, where relevance is driven by recency), and -user-based data (where a large document collection can be subdivided by user or -customer). +Elasticsearch 的默认设置会伴你走过很长的一段路,但为了发挥它最大的效用,你需要考虑数据是如何流经你的系统的。 +我们将讨论两种常见的数据流:时序数据(时间驱动相关性,例如日志或社交网络数据流),以及基于用户的数据(拥有很大的文档集但可以按用户或客户细分)。 -This chapter will help you make the right decisions up front, to avoid -nasty surprises later. +这一章将帮助你在遇到不愉快之前做出正确的选择。 diff --git a/500_Cluster_Admin/10_intro.asciidoc b/500_Cluster_Admin/10_intro.asciidoc index e9517685d..32729c0c5 100644 --- a/500_Cluster_Admin/10_intro.asciidoc +++ b/500_Cluster_Admin/10_intro.asciidoc @@ -1,15 +1,6 @@ -Elasticsearch is often deployed as a cluster of nodes.((("clusters", "administration"))) A variety of -APIs let you manage and monitor the cluster itself, rather than interact -with the data stored within the cluster. +Elasticsearch 经常以多节点集群的方式部署。((("clusters", "administration")))有多种 API 让你可以管理和监控集群本身,而不用和集群里存储的数据打交道。 -As with most functionality in Elasticsearch, there is an overarching design goal -that tasks should be performed through an API rather than by modifying static -configuration files. This becomes especially important as your cluster scales. -Even with a provisioning system (such as Puppet, Chef, and Ansible), a single HTTP API call -is often simpler than pushing new configurations to hundreds of physical machines. +和 Elasticsearch 里绝大多数功能一样,我们有一个总体的设计目标,即任务应该通过 API 执行,而不是通过修改静态的配置文件。这一点在你的集群扩容时尤为重要。即便通过配置管理系统(比如 Puppet,Chef 或者 Ansible),一个简单的 HTTP API 调用,也比往上百台物理设备上推送新配置文件简单多了。 -To that end, this chapter presents the various APIs that allow you to -dynamically tweak, tune, and configure your cluster. It also covers a -host of APIs that provide statistics about the cluster itself so you can -monitor for health and performance. +因此,本章将介绍各种可以让你动态调整、调优和调配集群的 API。同时,还会介绍一系列提供集群自身统计数据的 API,你可以用这些接口来监控集群健康状态和性能。 diff --git a/510_Deployment/20_hardware.asciidoc b/510_Deployment/20_hardware.asciidoc index ec3954462..d7feccb06 100644 --- a/510_Deployment/20_hardware.asciidoc +++ b/510_Deployment/20_hardware.asciidoc @@ -1,116 +1,69 @@ [[hardware]] -=== Hardware - -If you've been following the normal development path, you've probably been playing((("deployment", "hardware")))((("hardware"))) -with Elasticsearch on your laptop or on a small cluster of machines lying around. -But when it comes time to deploy Elasticsearch to production, there are a few -recommendations that you should consider. Nothing is a hard-and-fast rule; -Elasticsearch is used for a wide range of tasks and on a bewildering array of -machines. But these recommendations provide good starting points based on our experience with -production clusters. - -==== Memory - -If there is one resource that you will run out of first, it will likely be memory.((("hardware", "memory")))((("memory"))) -Sorting and aggregations can both be memory hungry, so enough heap space to -accommodate these is important.((("heap"))) Even when the heap is comparatively small, -extra memory can be given to the OS filesystem cache. Because many data structures -used by Lucene are disk-based formats, Elasticsearch leverages the OS cache to -great effect. - -A machine with 64 GB of RAM is the ideal sweet spot, but 32 GB and 16 GB machines -are also common. Less than 8 GB tends to be counterproductive (you end up -needing many, many small machines), and greater than 64 GB has problems that we will -discuss in <>. +=== 硬件 + +按照正常的流程,((("deployment", "hardware")))((("hardware")))你可能已经在自己的笔记本电脑或集群上使用了 Elasticsearch。 +但是当要部署 Elasticsearch 到生产环境时,有一些建议是你需要考虑的。这里没有什么必须要遵守的准则,Elasticsearch 被用于在众多的机器上处理各种任务。基于我们在生产环境使用 Elasticsearch 集群的经验,这些建议可以为你提供一个好的起点。 + +==== 内存 + +如果有一种资源是最先被耗尽的,它可能是内存。((("hardware", "memory")))((("memory")))排序和聚合都很耗内存,所以有足够的堆空间来应付它们是很重要的。((("heap")))即使堆空间是比较小的时候, +也能为操作系统文件缓存提供额外的内存。因为 Lucene 使用的许多数据结构是基于磁盘的格式,Elasticsearch 利用操作系统缓存能产生很大效果。 + +64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反(你最终需要很多很多的小机器),大于64 GB 的机器也会有问题, +我们将在 <> 中讨论。 ==== CPUs -Most Elasticsearch deployments tend to be rather light on CPU requirements. As -such,((("CPUs (central processing units)")))((("hardware", "CPUs"))) the exact processor setup matters less than the other resources. You should -choose a modern processor with multiple cores. Common clusters utilize two- to eight-core machines. +大多数 Elasticsearch 部署往往对 CPU 要求不高。因此,((("CPUs (central processing units)")))((("hardware", "CPUs")))相对其它资源,具体配置多少个(CPU)不是那么关键。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。 -If you need to choose between faster CPUs or more cores, choose more cores. The -extra concurrency that multiple cores offers will far outweigh a slightly faster -clock speed. +如果你要在更快的 CPUs 和更多的核心之间选择,选择更多的核心更好。多个内核提供的额外并发远胜过稍微快一点点的时钟频率。 -==== Disks +==== 硬盘 -Disks are important for all clusters,((("disks")))((("hardware", "disks"))) and doubly so for indexing-heavy clusters -(such as those that ingest log data). Disks are the slowest subsystem in a server, -which means that write-heavy clusters can easily saturate their disks, which in -turn become the bottleneck of the cluster. +硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks")))对大量写入的集群更是加倍重要(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。 -If you can afford SSDs, they are by far superior to any spinning media. SSD-backed -nodes see boosts in both query and indexing performance. If you can afford it, -SSDs are the way to go. +如果你负担得起 SSD,它将远远超出任何旋转介质(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起,SSD 是一个好的选择。 -.Check Your I/O Scheduler -**** -If you are using SSDs, make sure your OS I/O scheduler is((("I/O scheduler"))) configured correctly. -When you write data to disk, the I/O scheduler decides when that data is -_actually_ sent to the disk. The default under most *nix distributions is a -scheduler called `cfq` (Completely Fair Queuing). - -This scheduler allocates _time slices_ to each process, and then optimizes the -delivery of these various queues to the disk. It is optimized for spinning media: -the nature of rotating platters means it is more efficient to write data to disk -based on physical layout. - -This is inefficient for SSD, however, since there are no spinning platters -involved. Instead, `deadline` or `noop` should be used instead. The deadline -scheduler optimizes based on how long writes have been pending, while `noop` -is just a simple FIFO queue. - -This simple change can have dramatic impacts. We've seen a 500-fold improvement -to write throughput just by using the correct scheduler. +.检查你的 I/O 调度程序 **** +如果你正在使用 SSDs,确保你的系统 I/O 调度程序是((("I/O scheduler")))配置正确的。 +当你向硬盘写数据,I/O 调度程序决定何时把数据实际发送到硬盘。 +大多数默认 *nix 发行版下的调度程序都叫做 `cfq`(完全公平队列)。 + +调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转介质优化的: +机械硬盘的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。 -If you use spinning media, try to obtain the fastest disks possible (high-performance server disks, 15k RPM drives). +这对 SSD 来说是低效的,尽管这里没有涉及到机械硬盘。但是,`deadline` 或者 `noop` 应该被使用。`deadline` 调度程序基于写入等待时间进行优化, +`noop` 只是一个简单的 FIFO 队列。 -Using RAID 0 is an effective way to increase disk speed, for both spinning disks -and SSD. There is no need to use mirroring or parity variants of RAID, since -high availability is built into Elasticsearch via replicas. +这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。 +**** -Finally, avoid network-attached storage (NAS). People routinely claim their -NAS solution is faster and more reliable than local drives. Despite these claims, -we have never seen NAS live up to its hype. NAS is often slower, displays -larger latencies with a wider deviation in average latency, and is a single -point of failure. +如果你使用旋转介质,尝试获取尽可能快的硬盘(高性能服务器硬盘,15k RPM 驱动器)。 -==== Network +使用 RAID 0 是提高硬盘速度的有效途径,对机械硬盘和 SSD 来说都是如此。没有必要使用镜像或其它 RAID 变体,因为高可用已经通过 replicas 内建于 Elasticsearch 之中。 -A fast and reliable network is obviously important to performance in a distributed((("hardware", "network")))((("network"))) -system. Low latency helps ensure that nodes can communicate easily, while -high bandwidth helps shard movement and recovery. Modern data-center networking -(1 GbE, 10 GbE) is sufficient for the vast majority of clusters. +最后,避免使用网络附加存储(NAS)。人们常声称他们的 NAS 解决方案比本地驱动器更快更可靠。除却这些声称, +我们从没看到 NAS 能配得上它的大肆宣传。NAS 常常很慢,显露出更大的延时和更宽的平均延时方差,而且它是单点故障的。 -Avoid clusters that span multiple data centers, even if the data centers are -colocated in close proximity. Definitely avoid clusters that span large geographic -distances. +==== 网络 -Elasticsearch clusters assume that all nodes are equal--not that half the nodes -are actually 150ms distant in another data center. Larger latencies tend to -exacerbate problems in distributed systems and make debugging and resolution -more difficult. +快速可靠的网络显然对分布式系统的性能是很重要的((("hardware", "network")))((("network")))。 +低延时能帮助确保节点间能容易的通讯,大带宽能帮助分片移动和恢复。现代数据中心网络(1 GbE, 10 GbE)对绝大多数集群都是足够的。 -Similar to the NAS argument, everyone claims that their pipe between data centers is -robust and low latency. This is true--until it isn't (a network failure will -happen eventually; you can count on it). From our experience, the hassle of -managing cross–data center clusters is simply not worth the cost. +即使数据中心们近在咫尺,也要避免集群跨越多个数据中心。绝对要避免集群跨越大的地理距离。 -==== General Considerations +Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节点在150ms 外的另一数据中心而有所不同。更大的延时会加重分布式系统中的问题而且使得调试和排错更困难。 -It is possible nowadays to obtain truly enormous machines:((("hardware", "general considerations"))) hundreds of gigabytes -of RAM with dozens of CPU cores. Conversely, it is also possible to spin up -thousands of small virtual machines in cloud platforms such as EC2. Which -approach is best? +和 NAS 的争论类似,每个人都声称他们的数据中心间的线路都是健壮和低延时的。这是真的--直到它不是时(网络失败终究是会发生的,你可以相信它)。 +从我们的经验来看,处理跨数据中心集群的麻烦事是根本不值得的。 -In general, it is better to prefer medium-to-large boxes. Avoid small machines, -because you don't want to manage a cluster with a thousand nodes, and the overhead -of simply running Elasticsearch is more apparent on such small boxes. +==== 总则 -At the same time, avoid the truly enormous machines. They often lead to imbalanced -resource usage (for example, all the memory is being used, but none of the CPU) and can -add logistical complexity if you have to run multiple nodes per machine. +获取真正的高配机器在今天是可能的:((("hardware", "general considerations")))成百 GB 的 RAM 和几十个 CPU 核心。 +反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪种方式是最好的? +通常,选择中配或者高配机器更好。避免使用低配机器, +因为你不会希望去管理拥有上千个节点的集群,而且在这些低配机器上运行 Elasticsearch 的开销也是显著的。 +与此同时,避免使用真正的高配机器。它们通常会导致资源使用不均衡(例如,所有的内存都被使用,但 CPU 却没有)而且在单机上运行多个节点时,会增加逻辑复杂度。 diff --git a/510_Deployment/40_config.asciidoc b/510_Deployment/40_config.asciidoc index 94ea5404b..5d3183c0f 100644 --- a/510_Deployment/40_config.asciidoc +++ b/510_Deployment/40_config.asciidoc @@ -1,53 +1,43 @@ [[important-configuration-changes]] -=== Important Configuration Changes -Elasticsearch ships with _very good_ defaults,((("deployment", "configuration changes, important")))((("configuration changes, important"))) especially when it comes to performance- -related settings and options. When in doubt, just leave -the settings alone. We have witnessed countless dozens of clusters ruined -by errant settings because the administrator thought he could turn a knob -and gain 100-fold improvement. + +=== 重要配置的修改 +Elasticsearch 已经有了 _很好_ 的默认值,((("deployment", "configuration changes, important")))((("configuration changes, important")))特别是涉及到性能相关的配置或者选项。 +如果你有疑问,最好就不要动它。我们已经目睹了数十个因为错误的设置而导致毁灭的集群, +因为它的管理者总认为改动一个配置或者选项就可以带来 100 倍的提升。 [NOTE] ==== -Please read this entire section! All configurations presented are equally -important, and are not listed in any particular order. Please read -through all configuration options and apply them to your cluster. +请阅读整节文章,所有的配置项都同等重要,和描述顺序无关,请阅读所有的配置选项,并应用到你的集群中。 ==== -Other databases may require tuning, but by and large, Elasticsearch does not. -If you are hitting performance problems, the solution is usually better data -layout or more nodes. There are very few "magic knobs" in Elasticsearch. -If there were, we'd have turned them already! - -With that said, there are some _logistical_ configurations that should be changed -for production. These changes are necessary either to make your life easier, or because -there is no way to set a good default (because it depends on your cluster layout). +其它数据库可能需要调优,但总得来说,Elasticsearch 不需要。 +如果你遇到了性能问题,解决方法通常是更好的数据布局或者更多的节点。 +在 Elasticsearch 中很少有“神奇的配置项”, +如果存在,我们也已经帮你优化了! +另外,有些 _逻辑上的_ 配置在生产环境中是应该调整的。 +这些调整可能会让你的工作更加轻松,又或者因为没办法设定一个默认值(它取决于你的集群布局)。 -==== Assign Names +==== 指定名字 -Elasticseach by default starts a cluster named `elasticsearch`. ((("configuration changes, important", "assigning names"))) It is wise -to rename your production cluster to something else, simply to prevent accidents -whereby someone's laptop joins the cluster. A simple change to `elasticsearch_production` -can save a lot of heartache. +Elasticsearch 默认启动的集群名字叫 `elasticsearch` 。((("configuration changes, important", "assigning names")))你最好给你的生产环境的集群改个名字,改名字的目的很简单, +就是防止某人的笔记本电脑加入了集群这种意外。简单修改成 `elasticsearch_production` 会很省心。 -This can be changed in your `elasticsearch.yml` file: +你可以在你的 `elasticsearch.yml` 文件中修改: [source,yaml] ---- cluster.name: elasticsearch_production ---- -Similarly, it is wise to change the names of your nodes. As you've probably -noticed by now, Elasticsearch assigns a random Marvel superhero name -to your nodes at startup. This is cute in development--but less cute when it is -3a.m. and you are trying to remember which physical machine was Tagak the Leopard Lord. +同样,最好也修改你的节点名字。就像你现在可能发现的那样, +Elasticsearch 会在你的节点启动的时候随机给它指定一个名字。你可能会觉得这很有趣,但是当凌晨 3 点钟的时候, +你还在尝试回忆哪台物理机是 Tagak the Leopard Lord 的时候,你就不觉得有趣了。 -More important, since these names are generated on startup, each time you -restart your node, it will get a new name. This can make logs confusing, -since the names of all the nodes are constantly changing. +更重要的是,这些名字是在启动的时候产生的,每次启动节点, +它都会得到一个新的名字。这会使日志变得很混乱,因为所有节点的名称都是不断变化的。 -Boring as it might be, we recommend you give each node a name that makes sense -to you--a plain, descriptive name. This is also configured in your `elasticsearch.yml`: +这可能会让你觉得厌烦,我们建议给每个节点设置一个有意义的、清楚的、描述性的名字,同样你可以在 `elasticsearch.yml` 中配置: [source,yaml] ---- @@ -55,19 +45,17 @@ node.name: elasticsearch_005_data ---- -==== Paths +==== 路径 -By default, Elasticsearch will place the plug-ins,((("configuration changes, important", "paths"))) -((("paths"))) logs, and--most important--your data in the installation directory. This can lead to -unfortunate accidents, whereby the installation directory is accidentally overwritten -by a new installation of Elasticsearch. If you aren't careful, you can erase all your data. +默认情况下,((("configuration changes, important", "paths")))((("paths")))Elasticsearch 会把插件、日志以及你最重要的数据放在安装目录下。这会带来不幸的事故, +如果你重新安装 Elasticsearch 的时候不小心把安装目录覆盖了。如果你不小心,你就可能把你的全部数据删掉了。 -Don't laugh--we've seen it happen more than a few times. +不要笑,这种情况,我们见过很多次了。 -The best thing to do is relocate your data directory outside the installation -location. You can optionally move your plug-in and log directories as well. +最好的选择就是把你的数据目录配置到安装目录以外的地方, +同样你也可以选择转移你的插件和日志目录。 -This can be changed as follows: +可以更改如下: [source,yaml] ---- @@ -79,77 +67,60 @@ path.logs: /path/to/logs # Path to where plugins are installed: path.plugins: /path/to/plugins ---- -<1> Notice that you can specify more than one directory for data by using comma-separated lists. +<1> 注意:你可以通过逗号分隔指定多个目录。 -Data can be saved to multiple directories, and if each directory -is mounted on a different hard drive, this is a simple and effective way to -set up a software RAID 0. Elasticsearch will automatically stripe -data between the different directories, boosting performance. +数据可以保存到多个不同的目录, +如果将每个目录分别挂载不同的硬盘,这可是一个简单且高效实现一个软磁盘阵列( RAID 0 )的办法。Elasticsearch 会自动把条带化(注:RAID 0 又称为 Stripe(条带化),在磁盘阵列中,数据是以条带的方式贯穿在磁盘阵列所有硬盘中的) +数据分隔到不同的目录,以便提高性能。 -.Multiple data path safety and performance +.多个数据路径的安全性和性能 [WARNING] ==================== -Like any RAID 0 configuration, only a single copy of your data is saved to the -hard drives. If you lose a hard drive, you are _guaranteed_ to lose a portion -of your data on that machine. With luck you'll have replicas elsewhere in the -cluster which can recover the data, and/or a recent <>. - -Elasticsearch attempts to minimize the extent of data loss by striping entire -shards to a drive. That means that `Shard 0` will be placed entirely on a single -drive. Elasticsearch will not stripe a shard across multiple drives, since the -loss of one drive would corrupt the entire shard. - -This has ramifications for performance: if you are adding multiple drives -to improve the performance of a single index, it is unlikely to help since -most nodes will only have one shard, and thus one active drive. Multiple data -paths only helps if you have many indices/shards on a single node. - -Multiple data paths is a nice convenience feature, but at the end of the day, -Elasticsearch is not a software RAID package. If you need more advanced configuration, -robustness and flexibility, we encourage you to use actual software RAID packages -instead of the multiple data path feature. +如同任何磁盘阵列( RAID 0 )的配置,只有单一的数据拷贝保存到硬盘驱动器。如果你失去了一个硬盘驱动器,你 _肯定_ 会失去该计算机上的一部分数据。 +运气好的话你的副本在集群的其他地方,可以用来恢复数据和最近的备份。 + +Elasticsearch 试图将全部的条带化分片放到单个驱动器来保证最小程度的数据丢失。这意味着 `分片 0` 将完全被放置在单个驱动器上。 +Elasticsearch 没有一个条带化的分片跨越在多个驱动器,因为一个驱动器的损失会破坏整个分片。 + +这对性能产生的影响是:如果您添加多个驱动器来提高一个单独索引的性能,可能帮助不大,因为 +大多数节点只有一个分片和这样一个积极的驱动器。多个数据路径只是帮助如果你有许多索引/分片在单个节点上。 + +多个数据路径是一个非常方便的功能,但到头来,Elasticsearch 并不是软磁盘阵列( software RAID )的软件。如果你需要更高级的、稳健的、灵活的配置, +我们建议你使用软磁盘阵列( software RAID )的软件,而不是多个数据路径的功能。 ==================== -==== Minimum Master Nodes +==== 最小主节点数 -The `minimum_master_nodes` setting is _extremely_ important to the -stability of your cluster.((("configuration changes, important", "minimum_master_nodes setting")))((("minimum_master_nodes setting"))) This setting helps prevent _split brains_, the existence of two masters in a single cluster. +`minimum_master_nodes` 设定对你的集群的稳定 _极其_ 重要。 +((("configuration changes, important", "minimum_master_nodes setting")))((("minimum_master_nodes setting"))) +当你的集群中有两个 masters(注:主节点)的时候,这个配置有助于防止 _脑裂_ ,一种两个主节点同时存在于一个集群的现象。 -When you have a split brain, your cluster is at danger of losing data. Because -the master is considered the supreme ruler of the cluster, it decides -when new indices can be created, how shards are moved, and so forth. If you have _two_ -masters, data integrity becomes perilous, since you have two nodes -that think they are in charge. +如果你的集群发生了脑裂,那么你的集群就会处在丢失数据的危险中,因为主节点被认为是这个集群的最高统治者,它决定了什么时候新的索引可以创建,分片是如何移动的等等。如果你有 _两个_ masters 节点, +你的数据的完整性将得不到保证,因为你有两个节点认为他们有集群的控制权。 -This setting tells Elasticsearch to not elect a master unless there are enough -master-eligible nodes available. Only then will an election take place. +这个配置就是告诉 Elasticsearch 当没有足够 master 候选节点的时候,就不要进行 master 节点选举,等 master 候选节点足够了才进行选举。 -This setting should always be configured to a quorum (majority) of your master-eligible nodes.((("quorum"))) A quorum is `(number of master-eligible nodes / 2) + 1`. -Here are some examples: +此设置应该始终被配置为 master 候选节点的法定个数(大多数个)。((("quorum")))法定个数就是 `( master 候选节点个数 / 2) + 1` 。 +这里有几个例子: -- If you have ten regular nodes (can hold data, can become master), a quorum is -`6`. -- If you have three dedicated master nodes and a hundred data nodes, the quorum is `2`, -since you need to count only nodes that are master eligible. -- If you have two regular nodes, you are in a conundrum. A quorum would be -`2`, but this means a loss of one node will make your cluster inoperable. A -setting of `1` will allow your cluster to function, but doesn't protect against -split brain. It is best to have a minimum of three nodes in situations like this. +- 如果你有 10 个节点(能保存数据,同时能成为 master),法定数就是 `6` 。 +- 如果你有 3 个候选 master 节点,和 100 个 date 节点,法定数就是 `2` ,你只要数数那些可以做 master 的节点数就可以了。 +- 如果你有两个节点,你遇到难题了。法定数当然是 `2` ,但是这意味着如果有一个节点挂掉,你整个集群就不可用了。 +设置成 `1` 可以保证集群的功能,但是就无法保证集群脑裂了,像这样的情况,你最好至少保证有 3 个节点。 -This setting can be configured in your `elasticsearch.yml` file: +你可以在你的 `elasticsearch.yml` 文件中这样配置: [source,yaml] ---- discovery.zen.minimum_master_nodes: 2 ---- -But because Elasticsearch clusters are dynamic, you could easily add or remove -nodes that will change the quorum. It would be extremely irritating if you had -to push new configurations to each node and restart your whole cluster just to -change the setting. +但是由于 ELasticsearch 是动态的,你可以很容易的添加和删除节点, +但是这会改变这个法定个数。 +你不得不修改每一个索引节点的配置并且重启你的整个集群只是为了让配置生效,这将是非常痛苦的一件事情。 -For this reason, `minimum_master_nodes` (and other settings) can be configured -via a dynamic API call. You can change the setting while your cluster is online: +基于这个原因, `minimum_master_nodes` (还有一些其它配置)允许通过 API 调用的方式动态进行配置。 +当你的集群在线运行的时候,你可以这样修改配置: [source,js] ---- @@ -161,56 +132,36 @@ PUT /_cluster/settings } ---- -This will become a persistent setting that takes precedence over whatever is -in the static configuration. You should modify this setting whenever you add or -remove master-eligible nodes. +这将成为一个永久的配置,并且无论你配置项里配置的如何,这个将优先生效。当你添加和删除 master 节点的时候,你需要更改这个配置。 -==== Recovery Settings +==== 集群恢复方面的配置 -Several settings affect the behavior of shard recovery when -your cluster restarts.((("recovery settings")))((("configuration changes, important", "recovery settings"))) First, we need to understand what happens if nothing is -configured. +当你集群重启时,几个配置项影响你的分片恢复的表现。((("recovery settings")))((("configuration changes, important", "recovery settings")))首先,我们需要明白如果什么也没配置将会发生什么。 -Imagine you have ten nodes, and each node holds a single shard--either a primary -or a replica--in a 5 primary / 1 replica index. You take your -entire cluster offline for maintenance (installing new drives, for example). When you -restart your cluster, it just so happens that five nodes come online before -the other five. +想象一下假设你有 10 个节点,每个节点只保存一个分片,这个分片是一个主分片或者是一个副本分片,或者说有一个有 5 个主分片/1 个副本分片的索引。有时你需要为整个集群做离线维护(比如,为了安装一个新的驱动程序), +当你重启你的集群,恰巧出现了 5 个节点已经启动,还有 5 个还没启动的场景。 -Maybe the switch to the other five is being flaky, and they didn't -receive the restart command right away. Whatever the reason, you have five nodes -online. These five nodes will gossip with each other, elect a master, and form a -cluster. They notice that data is no longer evenly distributed, since five -nodes are missing from the cluster, and immediately start replicating new -shards between each other. +假设其它 5 个节点出问题,或者他们根本没有收到立即重启的命令。不管什么原因,你有 5 个节点在线上,这五个节点会相互通信,选出一个 master,从而形成一个集群。 +他们注意到数据不再均匀分布,因为有 5 个节点在集群中丢失了,所以他们之间会立即启动分片复制。 -Finally, your other five nodes turn on and join the cluster. These nodes see -that _their_ data is being replicated to other nodes, so they delete their local -data (since it is now redundant, and may be outdated). Then the cluster starts -to rebalance even more, since the cluster size just went from five to ten. +最后,你的其它 5 个节点打开加入了集群。这些节点会发现 _它们_ 的数据正在被复制到其他节点,所以他们删除本地数据(因为这份数据要么是多余的,要么是过时的)。 +然后整个集群重新进行平衡,因为集群的大小已经从 5 变成了 10。 -During this whole process, your nodes are thrashing the disk and network, moving -data around--for no good reason. For large clusters with terabytes of data, -this useless shuffling of data can take a _really long time_. If all the nodes -had simply waited for the cluster to come online, all the data would have been -local and nothing would need to move. +在整个过程中,你的节点会消耗磁盘和网络带宽,来回移动数据,因为没有更好的办法。对于有 TB 数据的大集群, +这种无用的数据传输需要 _很长时间_ 。如果等待所有的节点重启好了,整个集群再上线,所有的本地的数据都不需要移动。 -Now that we know the problem, we can configure a few settings to alleviate it. -First, we need to give Elasticsearch a hard limit: +现在我们知道问题的所在了,我们可以修改一些设置来缓解它。 +首先我们要给 ELasticsearch 一个严格的限制: [source,yaml] ---- gateway.recover_after_nodes: 8 ---- -This will prevent Elasticsearch from starting a recovery until at least eight (data or master) nodes -are present. The value for this setting is a matter of personal preference: how -many nodes do you want present before you consider your cluster functional? -In this case, we are setting it to `8`, which means the cluster is inoperable -unless there are at least eight nodes. +这将防止 Elasticsearch 从一开始就进行数据恢复,在存在 8 个节点(数据节点或者 master 节点)之前。 +这个值的设定取决于个人喜好:整个集群提供服务之前你希望有多少个节点在线?这种情况下,我们设置为 8,这意味着至少要有 8 个节点,该集群才可用。 -Then we tell Elasticsearch how many nodes _should_ be in the cluster, and how -long we want to wait for all those nodes: +现在我们要告诉 Elasticsearch 集群中 _应该_ 有多少个节点,以及我们愿意为这些节点等待多长时间: [source,yaml] ---- @@ -218,50 +169,35 @@ gateway.expected_nodes: 10 gateway.recover_after_time: 5m ---- -What this means is that Elasticsearch will do the following: +这意味着 Elasticsearch 会采取如下操作: -- Wait for eight nodes to be present -- Begin recovering after 5 minutes _or_ after ten nodes have joined the cluster, -whichever comes first. +- 等待集群至少存在 8 个节点 +- 等待 5 分钟,或者10 个节点上线后,才进行数据恢复,这取决于哪个条件先达到。 -These three settings allow you to avoid the excessive shard swapping that can -occur on cluster restarts. It can literally make recovery take seconds instead -of hours. +这三个设置可以在集群重启的时候避免过多的分片交换。这可能会让数据恢复从数个小时缩短为几秒钟。 -NOTE: These settings can only be set in the `config/elasticsearch.yml` file or on -the command line (they are not dynamically updatable) and they are only relevant -during a full cluster restart. +注意:这些配置只能设置在 `config/elasticsearch.yml` 文件中或者是在命令行里(它们不能动态更新)它们只在整个集群重启的时候有实质性作用。 [[unicast]] -==== Prefer Unicast over Multicast - -Elasticsearch is configured to use unicast discovery out of the box to prevent -nodes from accidentally joining a cluster. Only nodes running on the same -machine will automatically form cluster. - -While multicast is still https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-multicast.html[provided -as a plugin], it should never be used in production. The -last thing you want is for nodes to accidentally join your production network, simply -because they received an errant multicast ping. There is nothing wrong with -multicast _per se_. Multicast simply leads to silly problems, and can be a bit -more fragile (for example, a network engineer fiddles with the network without telling -you--and all of a sudden nodes can't find each other anymore). - -To use unicast, you provide Elasticsearch a list of nodes that it should try to contact. -When a node contacts a member of the unicast list, it receives a full cluster -state that lists all of the nodes in the cluster. It then contacts -the master and joins the cluster. - -This means your unicast list does not need to include all of the nodes in your cluster. -It just needs enough nodes that a new node can find someone to talk to. If you -use dedicated masters, just list your three dedicated masters and call it a day. -This setting is configured in `elasticsearch.yml`: + +==== 最好使用单播代替组播 + +Elasticsearch 默认被配置为使用单播发现,以防止节点无意中加入集群。只有在同一台机器上运行的节点才会自动组成集群。 + +虽然组播仍然 https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-multicast.html[作为插件提供], +但它应该永远不被使用在生产环境了,否在你得到的结果就是一个节点意外的加入到了你的生产环境,仅仅是因为他们收到了一个错误的组播信号。 +对于组播 _本身_ 并没有错,组播会导致一些愚蠢的问题,并且导致集群变的脆弱(比如,一个网络工程师正在捣鼓网络,而没有告诉你,你会发现所有的节点突然发现不了对方了)。 + +使用单播,你可以为 Elasticsearch 提供一些它应该去尝试连接的节点列表。 +当一个节点联系到单播列表中的成员时,它就会得到整个集群所有节点的状态,然后它会联系 master 节点,并加入集群。 + +这意味着你的单播列表不需要包含你的集群中的所有节点, +它只是需要足够的节点,当一个新节点联系上其中一个并且说上话就可以了。如果你使用 master 候选节点作为单播列表,你只要列出三个就可以了。 +这个配置在 `elasticsearch.yml` 文件中: [source,yaml] ---- discovery.zen.ping.unicast.hosts: ["host1", "host2:port"] ---- -For more information about how Elasticsearch nodes find eachother, see -https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery] -in the Elasticsearch Reference. +关于 Elasticsearch 节点发现的详细信息,请参阅 https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html[Zen Discovery] Elasticsearch 文献。 diff --git a/510_Deployment/45_dont_touch.asciidoc b/510_Deployment/45_dont_touch.asciidoc index 37506390f..ec78f4bd4 100644 --- a/510_Deployment/45_dont_touch.asciidoc +++ b/510_Deployment/45_dont_touch.asciidoc @@ -1,84 +1,50 @@ +[[dont-touch-these-settings]] +=== 不要触碰这些配置! -=== Don't Touch These Settings! +在 Elasticsearch 中有一些热点,人们可能不可避免的会碰到。((("deployment", "settings to leave unaltered"))) 我们理解的,所有的调整就是为了优化,但是这些调整,你真的不需要理会它。因为它们经常会被乱用,从而造成系统的不稳定或者糟糕的性能,甚至两者都有可能。 -There are a few hotspots in Elasticsearch that people just can't seem to avoid -tweaking. ((("deployment", "settings to leave unaltered"))) We understand: knobs just beg to be turned. But of all the knobs to turn, these you should _really_ leave alone. They are -often abused and will contribute to terrible stability or terrible performance. -Or both. +==== 垃圾回收器 -==== Garbage Collector +这里已经简要介绍了 <>,JVM 使用一个垃圾回收器来释放不再使用的内存。((("garbage collector"))) 这篇内容的确是上一篇的一个延续, +但是因为重要,所以值得单独拿出来作为一节。 -As briefly introduced in <>, the JVM uses a garbage -collector to free unused memory.((("garbage collector"))) This tip is really an extension of the last tip, -but deserves its own section for emphasis: +不要更改默认的垃圾回收器! -Do not change the default garbage collector! +Elasticsearch 默认的垃圾回收器( GC )是 CMS。((("Concurrent-Mark and Sweep (CMS) garbage collector"))) 这个垃圾回收器可以和应用并行处理,以便它可以最小化停顿。 +然而,它有两个 stop-the-world 阶段,处理大内存也有点吃力。 -The default GC for Elasticsearch is Concurrent-Mark and Sweep (CMS).((("Concurrent-Mark and Sweep (CMS) garbage collector"))) This GC -runs concurrently with the execution of the application so that it can minimize -pauses. It does, however, have two stop-the-world phases. It also has trouble -collecting large heaps. +尽管有这些缺点,它还是目前对于像 Elasticsearch 这样低延迟需求软件的最佳垃圾回收器。官方建议使用 CMS。 -Despite these downsides, it is currently the best GC for low-latency server software -like Elasticsearch. The official recommendation is to use CMS. +现在有一款新的垃圾回收器,叫 G1 垃圾回收器( G1GC )。((("Garbage First GC (G1GC)"))) 这款新的 GC 被设计,旨在比 CMS 更小的暂停时间,以及对大内存的处理能力。 +它的原理是把内存分成许多区域,并且预测哪些区域最有可能需要回收内存。通过优先收集这些区域( _garbage first_ ),产生更小的暂停时间,从而能应对更大的内存。 -There is a newer GC called the Garbage First GC (G1GC). ((("Garbage First GC (G1GC)"))) This newer GC is designed -to minimize pausing even more than CMS, and operate on large heaps. It works -by dividing the heap into regions and predicting which regions contain the most -reclaimable space. By collecting those regions first (_garbage first_), it can -minimize pauses and operate on very large heaps. +听起来很棒!遗憾的是,G1GC 还是太新了,经常发现新的 bugs。这些错误通常是段( segfault )类型,便造成硬盘的崩溃。 +Lucene 的测试套件对垃圾回收算法要求严格,看起来这些缺陷 G1GC 并没有很好地解决。 -Sounds great! Unfortunately, G1GC is still new, and fresh bugs are found routinely. -These bugs are usually of the segfault variety, and will cause hard crashes. -The Lucene test suite is brutal on GC algorithms, and it seems that G1GC hasn't -had the kinks worked out yet. +我们很希望在将来某一天推荐使用 G1GC,但是对于现在,它还不能足够稳定的满足 Elasticsearch 和 Lucene 的要求。 -We would like to recommend G1GC someday, but for now, it is simply not stable -enough to meet the demands of Elasticsearch and Lucene. +==== 线程池 -==== Threadpools +许多人 _喜欢_ 调整线程池。((("threadpools"))) 无论什么原因,人们都对增加线程数无法抵抗。索引太多了?增加线程!搜索太多了?增加线程!节点空闲率低于 95%?增加线程! -Everyone _loves_ to tweak threadpools.((("threadpools"))) For whatever reason, it seems people -cannot resist increasing thread counts. Indexing a lot? More threads! Searching -a lot? More threads! Node idling 95% of the time? More threads! +Elasticsearch 默认的线程设置已经是很合理的了。对于所有的线程池(除了 `搜索` ),线程个数是根据 CPU 核心数设置的。 +如果你有 8 个核,你可以同时运行的只有 8 个线程,只分配 8 个线程给任何特定的线程池是有道理的。 -The default threadpool settings in Elasticsearch are very sensible. For all -threadpools (except `search`) the threadcount is set to the number of CPU cores. -If you have eight cores, you can be running only eight threads simultaneously. It makes -sense to assign only eight threads to any particular threadpool. - -Search gets a larger threadpool, and is configured to `int((# of cores * 3) / 2) + 1`. - -You might argue that some threads can block (such as on a disk I/O operation), -which is why you need more threads. This is not a problem in Elasticsearch: -much of the disk I/O is handled by threads managed by Lucene, not Elasticsearch. - -Furthermore, threadpools cooperate by passing work between each other. You don't -need to worry about a networking thread blocking because it is waiting on a disk -write. The networking thread will have long since handed off that work unit to -another threadpool and gotten back to networking. - -Finally, the compute capacity of your process is finite. Having more threads just forces -the processor to switch thread contexts. A processor can run only one thread -at a time, so when it needs to switch to a different thread, it stores the current -state (registers, and so forth) and loads another thread. If you are lucky, the switch -will happen on the same core. If you are unlucky, the switch may migrate to a -different core and require transport on an inter-core communication bus. - -This context switching eats up cycles simply by doing administrative housekeeping; estimates can peg it as high as 30μs on modern CPUs. So unless the thread -will be blocked for longer than 30μs, it is highly likely that that time would -have been better spent just processing and finishing early. - -People routinely set threadpools to silly values. On eight core machines, we have -run across configs with 60, 100, or even 1000 threads. These settings will simply -thrash the CPU more than getting real work done. - -So. Next time you want to tweak a threadpool, please don't. And if you -_absolutely cannot resist_, please keep your core count in mind and perhaps set -the count to double. More than that is just a waste. +搜索线程池设置的大一点,配置为 `int(( 核心数 * 3 )/ 2 )+ 1` 。 +你可能会认为某些线程可能会阻塞(如磁盘上的 I/O 操作),所以你才想加大线程的。对于 Elasticsearch 来说这并不是一个问题:因为大多数 I/O 的操作是由 Lucene 线程管理的,而不是 Elasticsearch。 +此外,线程池通过传递彼此之间的工作配合。你不必再因为它正在等待磁盘写操作而担心网络线程阻塞, +因为网络线程早已把这个工作交给另外的线程池,并且网络进行了响应。 +最后,你的处理器的计算能力是有限的,拥有更多的线程会导致你的处理器频繁切换线程上下文。 +一个处理器同时只能运行一个线程。所以当它需要切换到其它不同的线程的时候,它会存储当前的状态(寄存器等等),然后加载另外一个线程。 +如果幸运的话,这个切换发生在同一个核心,如果不幸的话,这个切换可能发生在不同的核心,这就需要在内核间总线上进行传输。 +这个上下文的切换,会给 CPU 时钟周期带来管理调度的开销;在现代的 CPUs 上,开销估计高达 30 μs。也就是说线程会被堵塞超过 30 μs,如果这个时间用于线程的运行,极有可能早就结束了。 +人们经常稀里糊涂的设置线程池的值。8 个核的 CPU,我们遇到过有人配了 60、100 甚至 1000 个线程。 +这些设置只会让 CPU 实际工作效率更低。 +所以,下次请不要调整线程池的线程数。如果你真 _想调整_ , +一定要关注你的 CPU 核心数,最多设置成核心数的两倍,再多了都是浪费。 diff --git a/510_Deployment/50_heap.asciidoc b/510_Deployment/50_heap.asciidoc index 7bac00cb6..dc5073915 100644 --- a/510_Deployment/50_heap.asciidoc +++ b/510_Deployment/50_heap.asciidoc @@ -1,107 +1,76 @@ [[heap-sizing]] -=== Heap: Sizing and Swapping +=== 堆内存:大小和交换 -The default installation of Elasticsearch is configured with a 1 GB heap. ((("deployment", "heap, sizing and swapping")))((("heap", "sizing and setting"))) For -just about every deployment, this number is usually too small. If you are using the -default heap values, your cluster is probably configured incorrectly. +Elasticsearch 默认安装后设置的堆内存是 1 GB。((("deployment", "heap, sizing and swapping")))((("heap", "sizing and setting")))对于任何一个业务部署来说, +这个设置都太小了。如果你正在使用这些默认堆内存配置,您的集群可能会出现问题。 -There are two ways to change the heap size in Elasticsearch. The easiest is to -set an environment variable called `ES_HEAP_SIZE`.((("ES_HEAP_SIZE environment variable"))) When the server process -starts, it will read this environment variable and set the heap accordingly. -As an example, you can set it via the command line as follows: +这里有两种方式修改 Elasticsearch 的堆内存。最简单的一个方法就是指定 `ES_HEAP_SIZE` 环境变量。((("ES_HEAP_SIZE environment variable")))服务进程在启动时候会读取这个变量,并相应的设置堆的大小。 +比如,你可以用下面的命令设置它: [source,bash] ---- export ES_HEAP_SIZE=10g ---- -Alternatively, you can pass in the heap size via a command-line argument when starting -the process, if that is easier for your setup: +此外,你也可以通过命令行参数的形式,在程序启动的时候把内存大小传递给它,如果你觉得这样更简单的话: [source,bash] ---- ./bin/elasticsearch -Xmx10g -Xms10g <1> ---- -<1> Ensure that the min (`Xms`) and max (`Xmx`) sizes are the same to prevent -the heap from resizing at runtime, a very costly process. +<1> 确保堆内存最小值( `Xms` )与最大值( `Xmx` )的大小是相同的,防止程序在运行时改变堆内存大小, +这是一个很耗系统资源的过程。 -Generally, setting the `ES_HEAP_SIZE` environment variable is preferred over setting -explicit `-Xmx` and `-Xms` values. +通常来说,设置 `ES_HEAP_SIZE` 环境变量,比直接写 `-Xmx -Xms` 更好一点。 -==== Give (less than) Half Your Memory to Lucene +==== 把你的内存的一半给 Lucene -A common problem is configuring a heap that is _too_ large. ((("heap", "sizing and setting", "giving half your memory to Lucene"))) You have a 64 GB -machine--and by golly, you want to give Elasticsearch all 64 GB of memory. More -is better! +一个常见的问题是给 Elasticsearch 分配的内存 _太_ 大了。((("heap", "sizing and setting", "giving half your memory to Lucene")))假设你有一个 64 GB 内存的机器, +天啊,我要把 64 GB 内存全都给 Elasticsearch。因为越多越好啊! -Heap is definitely important to Elasticsearch. It is used by many in-memory data -structures to provide fast operation. But with that said, there is another major -user of memory that is _off heap_: Lucene. +当然,内存对于 Elasticsearch 来说绝对是重要的,它可以被许多内存数据结构使用来提供更快的操作。但是说到这里, +还有另外一个内存消耗大户 _非堆内存_ (off-heap):Lucene。 -Lucene is designed to leverage the underlying OS for caching in-memory data structures.((("Lucene", "memory for"))) -Lucene segments are stored in individual files. Because segments are immutable, -these files never change. This makes them very cache friendly, and the underlying -OS will happily keep hot segments resident in memory for faster access. These segments -include both the inverted index (for fulltext search) and doc values (for aggregations). +Lucene 被设计为可以利用操作系统底层机制来缓存内存数据结构。((("Lucene", "memory for"))) +Lucene 的段是分别存储到单个文件中的。因为段是不可变的,这些文件也都不会变化,这是对缓存友好的,同时操作系统也会把这些段文件缓存起来,以便更快的访问。 -Lucene's performance relies on this interaction with the OS. But if you give all -available memory to Elasticsearch's heap, there won't be any left over for Lucene. -This can seriously impact the performance. +Lucene 的性能取决于和操作系统的相互作用。如果你把所有的内存都分配给 Elasticsearch 的堆内存,那将不会有剩余的内存交给 Lucene。 +这将严重地影响全文检索的性能。 -The standard recommendation is to give 50% of the available memory to Elasticsearch -heap, while leaving the other 50% free. It won't go unused; Lucene will happily -gobble up whatever is left over. - -If you are not aggregating on analyzed string fields (e.g. you won't be needing -<>) you can consider lowering the heap even -more. The smaller you can make the heap, the better performance you can expect -from both Elasticsearch (faster GCs) and Lucene (more memory for caching). +标准的建议是把 50% 的可用内存作为 Elasticsearch 的堆内存,保留剩下的 50%。当然它也不会被浪费,Lucene 会很乐意利用起余下的内存。 [[compressed_oops]] -==== Don't Cross 32 GB! -There is another reason to not allocate enormous heaps to Elasticsearch. As it turns((("heap", "sizing and setting", "32gb heap boundary")))((("32gb Heap boundary"))) -out, the HotSpot JVM uses a trick to compress object pointers when heaps are less -than around 32 GB. - -In Java, all objects are allocated on the heap and referenced by a pointer. -Ordinary object pointers (OOP) point at these objects, and are traditionally -the size of the CPU's native _word_: either 32 bits or 64 bits, depending on the -processor. The pointer references the exact byte location of the value. - -For 32-bit systems, this means the maximum heap size is 4 GB. For 64-bit systems, -the heap size can get much larger, but the overhead of 64-bit pointers means there -is more wasted space simply because the pointer is larger. And worse than wasted -space, the larger pointers eat up more bandwidth when moving values between -main memory and various caches (LLC, L1, and so forth). - -Java uses a trick called https://wikis.oracle.com/display/HotSpotInternals/CompressedOops[compressed oops]((("compressed object pointers"))) -to get around this problem. Instead of pointing at exact byte locations in -memory, the pointers reference _object offsets_.((("object offsets"))) This means a 32-bit pointer can -reference four billion _objects_, rather than four billion bytes. Ultimately, this -means the heap can grow to around 32 GB of physical size while still using a 32-bit -pointer. - -Once you cross that magical ~32 GB boundary, the pointers switch back to -ordinary object pointers. The size of each pointer grows, more CPU-memory -bandwidth is used, and you effectively lose memory. In fact, it takes until around -40–50 GB of allocated heap before you have the same _effective_ memory of a -heap just under 32 GB using compressed oops. - -The moral of the story is this: even when you have memory to spare, try to avoid -crossing the 32 GB heap boundary. It wastes memory, reduces CPU performance, and -makes the GC struggle with large heaps. - -==== Just how far under 32gb should I set the JVM? - -Unfortunately, that depends. The exact cutoff varies by JVMs and platforms. -If you want to play it safe, setting the heap to `31gb` is likely safe. -Alternatively, you can verify the cutoff point for the HotSpot JVM by adding -`-XX:+PrintFlagsFinal` to your JVM options and checking that the value of the -UseCompressedOops flag is true. This will let you find the exact cutoff for your -platform and JVM. - -For example, here we test a Java 1.7 installation on MacOSX and see the max heap -size is around 32600mb (~31.83gb) before compressed pointers are disabled: +==== 不要超过 32 GB! +这里有另外一个原因不分配大内存给 Elasticsearch。事实上((("heap", "sizing and setting", "32gb heap boundary")))((("32gb Heap boundary"))), +JVM 在内存小于 32 GB 的时候会采用一个内存对象指针压缩技术。 + +在 Java 中,所有的对象都分配在堆上,并通过一个指针进行引用。 +普通对象指针(OOP)指向这些对象,通常为 CPU _字长_ 的大小:32 位或 64 位,取决于你的处理器。 + +对于 32 位的系统,意味着堆内存大小最大为 4 GB。对于 64 位的系统, +可以使用更大的内存,但是 64 位的指针意味着更大的浪费,因为你的指针本身大了。更糟糕的是, +更大的指针在主内存和各级缓存(例如 LLC,L1 等)之间移动数据的时候,会占用更多的带宽。 + +Java 使用一个叫作 https://wikis.oracle.com/display/HotSpotInternals/CompressedOops[内存指针压缩(compressed oops)]((("compressed object pointers")))的技术来解决这个问题。 +它的指针不再表示对象在内存中的精确位置,而是表示 _偏移量_ 。((("object offsets")))这意味着 32 位的指针可以引用 40 亿个 _对象_ , +而不是 40 亿个字节。最终, +也就是说堆内存增长到 32 GB 的物理内存,也可以用 32 位的指针表示。 + +一旦你越过那个神奇的 ~32 GB 的边界,指针就会切回普通对象的指针。 +每个对象的指针都变长了,就会使用更多的 CPU 内存带宽,也就是说你实际上失去了更多的内存。事实上,当内存到达 +40–50 GB 的时候,有效内存才相当于使用内存对象指针压缩技术时候的 32 GB 内存。 + +这段描述的意思就是说:即便你有足够的内存,也尽量不要 +超过 32 GB。因为它浪费了内存,降低了 CPU 的性能,还要让 GC 应对大内存。 + +==== 到底需要低于 32 GB多少,来设置我的 JVM? + +遗憾的是,这需要看情况。确切的划分要根据 JVMs 和操作系统而定。 +如果你想保证其安全可靠,设置堆内存为 `31 GB` 是一个安全的选择。 +另外,你可以在你的 JVM 设置里添加 `-XX:+PrintFlagsFinal` 用来验证 `JVM` 的临界值, +并且检查 UseCompressedOops 的值是否为 true。对于你自己使用的 JVM 和操作系统,这将找到最合适的堆内存临界值。 + +例如,我们在一台安装 Java 1.7 的 MacOSX 上测试,可以看到指针压缩在被禁用之前,最大堆内存大约是在 32600 mb(~31.83 gb): [source,bash] ---- @@ -111,8 +80,7 @@ $ JAVA_HOME=`/usr/libexec/java_home -v 1.7` java -Xmx32766m -XX:+PrintFlagsFinal bool UseCompressedOops = false ---- -In contrast, a Java 1.8 installation on the same machine has a max heap size -around 32766mb (~31.99gb): +相比之下,同一台机器安装 Java 1.8,可以看到指针压缩在被禁用之前,最大堆内存大约是在 32766 mb(~31.99 gb): [source,bash] ---- @@ -122,93 +90,75 @@ $ JAVA_HOME=`/usr/libexec/java_home -v 1.8` java -Xmx32767m -XX:+PrintFlagsFinal bool UseCompressedOops = false ---- -The moral of the story is that the exact cutoff to leverage compressed oops -varies from JVM to JVM, so take caution when taking examples from elsewhere and -be sure to check your system with your configuration and JVM. +这个例子告诉我们,影响内存指针压缩使用的临界值, +是会根据 JVM 的不同而变化的。 +所以从其他地方获取的例子,需要谨慎使用,要确认检查操作系统配置和 JVM。 -Beginning with Elasticsearch v2.2.0, the startup log will actually tell you if your -JVM is using compressed OOPs or not. You'll see a log message like: +如果使用的是 Elasticsearch v2.2.0,启动日志其实会告诉你 JVM 是否正在使用内存指针压缩。 +你会看到像这样的日志消息: [source, bash] ---- [2015-12-16 13:53:33,417][INFO ][env] [Illyana Rasputin] heap size [989.8mb], compressed ordinary object pointers [true] ---- -Which indicates that compressed object pointers are being used. If they are not, -the message will say `[false]`. - +这表明内存指针压缩正在被使用。如果没有,日志消息会显示 `[false]` 。 [role="pagebreak-before"] -.I Have a Machine with 1 TB RAM! +.我有一个 1 TB 内存的机器! **** -The 32 GB line is fairly important. So what do you do when your machine has a lot -of memory? It is becoming increasingly common to see super-servers with 512–768 GB -of RAM. - -First, we would recommend avoiding such large machines (see <>). +这个 32 GB 的分割线是很重要的。那如果你的机器有很大的内存怎么办呢? +一台有着 512–768 GB内存的服务器愈发常见。 -But if you already have the machines, you have three practical options: +首先,我们建议避免使用这样的高配机器(参考 <>)。 -- Are you doing mostly full-text search? Consider giving 4-32 GB to Elasticsearch -and letting Lucene use the rest of memory via the OS filesystem cache. All that -memory will cache segments and lead to blisteringly fast full-text search. +但是如果你已经有了这样的机器,你有两个可选项: -- Are you doing a lot of sorting/aggregations? Are most of your aggregations on numerics, -dates, geo_points and `not_analyzed` strings? You're in luck, your aggregations will be done on -memory-friendly doc values! Give Elasticsearch somewhere from 4-32 GB of memory and leave the -rest for the OS to cache doc values in memory. +- 你主要做全文检索吗?考虑给 Elasticsearch 不超过 32 GB 的内存, +让 Lucene 通过操作系统文件缓存来利用余下的内存。那些内存都会用来缓存 segments,带来极速的全文检索。 -- Are you doing a lot of sorting/aggregations on analyzed strings (e.g. for word-tags, -or SigTerms, etc)? Unfortunately that means you'll need fielddata, which means you -need heap space. Instead of one node with a huge amount of RAM, consider running two or -more nodes on a single machine. Still adhere to the 50% rule, though. -+ -So if your machine has 128 GB of RAM, run two nodes each with just under 32 GB. This means that less -than 64 GB will be used for heaps, and more than 64 GB will be left over for Lucene. +- 你需要更多的排序和聚合?你可能会更希望那些那些内存用在堆中。 +你可以考虑一台机器上创建两个或者更多 ES 节点,而不要部署一个使用或者超过 32 GB 内存的节点。 +仍然要坚持 50% 原则。假设你有个机器有 128 GB 的内存, +你可以创建两个节点,每个节点内存分配不超过 32 GB。 +也就是说不超过 64 GB 内存给 ES 的堆内存,剩下的超过 64 GB 的内存给 Lucene。 + -If you choose this option, set `cluster.routing.allocation.same_shard.host: true` -in your config. This will prevent a primary and a replica shard from colocating -to the same physical machine (since this would remove the benefits of replica high availability). +如果你选择第二种,你需要配置 `cluster.routing.allocation.same_shard.host: true` 。 +这会防止同一个分片(shard)的主副本存在同一个物理机上(因为如果存在一个机器上,副本的高可用性就没有了)。 **** -==== Swapping Is the Death of Performance +==== Swapping 是性能的坟墓 -It should be obvious,((("heap", "sizing and setting", "swapping, death of performance")))((("memory", "swapping as the death of performance")))((("swapping, the death of performance"))) but it bears spelling out clearly: swapping main memory -to disk will _crush_ server performance. Think about it: an in-memory operation -is one that needs to execute quickly. +这是显而易见的,((("heap", "sizing and setting", "swapping, death of performance")))((("memory", "swapping as the death of performance")))((("swapping, the death of performance")))但是还是有必要说的更清楚一点:内存交换 +到磁盘对服务器性能来说是 _致命_ 的。想想看:一个内存操作必须能够被快速执行。 -If memory swaps to disk, a 100-microsecond operation becomes one that take 10 -milliseconds. Now repeat that increase in latency for all other 10us operations. -It isn't difficult to see why swapping is terrible for performance. +如果内存交换到磁盘上,一个 100 微秒的操作可能变成 10 毫秒。 +再想想那么多 10 微秒的操作时延累加起来。 +不难看出 swapping 对于性能是多么可怕。 -The best thing to do is disable swap completely on your system. This can be done -temporarily: +最好的办法就是在你的操作系统中完全禁用 swap。这样可以暂时禁用: [source,bash] ---- sudo swapoff -a ---- -To disable it permanently, you'll likely need to edit your `/etc/fstab`. Consult -the documentation for your OS. +如果需要永久禁用,你可能需要修改 `/etc/fstab` 文件,这要参考你的操作系统相关文档。 -If disabling swap completely is not an option, you can try to lower `swappiness`. -This value controls how aggressively the OS tries to swap memory. -This prevents swapping under normal circumstances, but still allows the OS to swap -under emergency memory situations. +如果你并不打算完全禁用 swap,也可以选择降低 `swappiness` 的值。 +这个值决定操作系统交换内存的频率。 +这可以预防正常情况下发生交换,但仍允许操作系统在紧急情况下发生交换。 -For most Linux systems, this is configured using the `sysctl` value: +对于大部分Linux操作系统,可以在 `sysctl` 中这样配置: [source,bash] ---- vm.swappiness = 1 <1> ---- -<1> A `swappiness` of `1` is better than `0`, since on some kernel versions a `swappiness` -of `0` can invoke the OOM-killer. +<1> `swappiness` 设置为 `1` 比设置为 `0` 要好,因为在一些内核版本 `swappiness` 设置为 `0` 会触发系统 OOM-killer(注:Linux 内核的 Out of Memory(OOM)killer 机制)。 -Finally, if neither approach is possible, you should enable `mlockall`. - file. This allows the JVM to lock its memory and prevent -it from being swapped by the OS. In your `elasticsearch.yml`, set this: +最后,如果上面的方法都不合适,你需要打开配置文件中的 `mlockall` 开关。 +它的作用就是允许 JVM 锁住内存,禁止操作系统交换出去。在你的 `elasticsearch.yml` 文件中,设置如下: [source,yaml] ---- diff --git a/520_Post_Deployment/30_indexing_perf.asciidoc b/520_Post_Deployment/30_indexing_perf.asciidoc index 7a15fb459..8cbdf8d5e 100644 --- a/520_Post_Deployment/30_indexing_perf.asciidoc +++ b/520_Post_Deployment/30_indexing_perf.asciidoc @@ -1,111 +1,64 @@ [[indexing-performance]] -=== Indexing Performance Tips +=== 索引性能技巧 -If you are in an indexing-heavy environment,((("indexing", "performance tips")))((("post-deployment", "indexing performance tips"))) such as indexing infrastructure -logs, you may be willing to sacrifice some search performance for faster indexing -rates. In these scenarios, searches tend to be relatively rare and performed -by people internal to your organization. They are willing to wait several -seconds for a search, as opposed to a consumer facing a search that must -return in milliseconds. +如果你是在一个索引负载很重的环境,((("indexing", "performance tips")))((("post-deployment", "indexing performance tips")))比如索引的是基础设施日志,你可能愿意牺牲一些搜索性能换取更快的索引速率。在这些场景里,搜索常常是很少见的操作,而且一般是由你公司内部的人发起的。他们也愿意为一个搜索等上几秒钟,而不像普通消费者,要求一个搜索必须毫秒级返回。 -Because of this unique position, certain trade-offs can be made -that will increase your indexing performance. +基于这种特殊的场景,我们可以有几种权衡办法来提高你的索引性能。 -.These Tips Apply Only to Elasticsearch 1.3+ +.这些技巧仅适用于 Elasticsearch 1.3 及以上版本 **** -This book is written for the most recent versions of Elasticsearch, although much -of the content works on older versions. +本书是为最新几个版本的 Elasticsearch 写的,虽然大多数内容在更老的版本也也有效。 -The tips presented in this section, however, are _explicitly_ for version 1.3+. There -have been multiple performance improvements and bugs fixed that directly impact -indexing. In fact, some of these recommendations will _reduce_ performance on -older versions because of the presence of bugs or performance defects. +不过,本节提及的技巧, _只_ 针对 1.3 及以上版本。该版本后有不少性能提升和故障修复是直接影响到索引的。事实上,有些建议在老版本上反而会因为故障或性能缺陷而 _降低_ 性能。 **** -==== Test Performance Scientifically +==== 科学的测试性能 -Performance testing is always difficult, so try to be as scientific as possible -in your approach.((("performance testing")))((("indexing", "performance tips", "performance testing"))) Randomly fiddling with knobs and turning on ingestion is not -a good way to tune performance. If there are too many _causes_, it is impossible -to determine which one had the best _effect_. A reasonable approach to testing is as follows: +性能测试永远是复杂的,所以在你的方法里已经要尽可能的科学。((("performance testing")))((("indexing", "performance tips", "performance testing")))随机摆弄旋钮以及写入开关可不是做性能调优的好办法。如果有太多种 _可能_ ,我们就无法判断到底哪一种有最好的 _效果_ 。合理的测试方法如下: -1. Test performance on a single node, with a single shard and no replicas. -2. Record performance under 100% default settings so that you have a baseline to -measure against. -3. Make sure performance tests run for a long time (30+ minutes) so you can -evaluate long-term performance, not short-term spikes or latencies. Some events -(such as segment merging, and GCs) won't happen right away, so the performance -profile can change over time. -4. Begin making single changes to the baseline defaults. Test these rigorously, -and if performance improvement is acceptable, keep the setting and move on to the -next one. +1. 在单个节点上,对单个分片,无副本的场景测试性能。 +2. 在 100% 默认配置的情况下记录性能结果,这样你就有了一个对比基线。 +3. 确保性能测试运行足够长的时间(30 分钟以上)这样你可以评估长期性能,而不是短期的峰值或延迟。一些事件(比如段合并,GC)不会立刻发生,所以性能概况会随着时间继续而改变的。 +4. 开始在基线上逐一修改默认值。严格测试它们,如果性能提升可以接受,保留这个配置项,开始下一项。 -==== Using and Sizing Bulk Requests +==== 使用批量请求并调整其大小 -This should be fairly obvious, but use bulk indexing requests for optimal performance.((("indexing", "performance tips", "bulk requests, using and sizing")))((("bulk API", "using and sizing bulk requests"))) -Bulk sizing is dependent on your data, analysis, and cluster configuration, but -a good starting point is 5–15 MB per bulk. Note that this is physical size. -Document count is not a good metric for bulk size. For example, if you are -indexing 1,000 documents per bulk, keep the following in mind: +显而易见的,优化性能应该使用批量请求。((("indexing", "performance tips", "bulk requests, using and sizing")))((("bulk API", "using and sizing bulk requests")))批量的大小则取决于你的数据、分析和集群配置,不过每次批量数据 5–15 MB 大是个不错的起始点。注意这里说的是物理字节数大小。文档计数对批量大小来说不是一个好指标。比如说,如果你每次批量索引 1000 个文档,记住下面的事实: -- 1,000 documents at 1 KB each is 1 MB. -- 1,000 documents at 100 KB each is 100 MB. +- 1000 个 1 KB 大小的文档加起来是 1 MB 大。 +- 1000 个 100 KB 大小的文档加起来是 100 MB 大。 -Those are drastically different bulk sizes. Bulks need to be loaded into memory -at the coordinating node, so it is the physical size of the bulk that is more -important than the document count. +这可是完完全全不一样的批量大小了。批量请求需要在协调节点上加载进内存,所以批量请求的物理大小比文档计数重要得多。 -Start with a bulk size around 5–15 MB and slowly increase it until you do not -see performance gains anymore. Then start increasing the concurrency of your -bulk ingestion (multiple threads, and so forth). +从 5–15 MB 开始测试批量请求大小,缓慢增加这个数字,直到你看不到性能提升为止。然后开始增加你的批量写入的并发度(多线程等等办法)。 -Monitor your nodes with Marvel and/or tools such as `iostat`, `top`, and `ps` to see -when resources start to bottleneck. If you start to receive `EsRejectedExecutionException`, -your cluster can no longer keep up: at least one resource has reached capacity. Either reduce concurrency, provide more of the limited resource (such as switching from spinning disks to SSDs), or add more nodes. +用 Marvel 以及诸如 `iostat` 、 `top` 和 `ps` 等工具监控你的节点,观察资源什么时候达到瓶颈。如果你开始收到 `EsRejectedExecutionException` ,你的集群没办法再继续了:至少有一种资源到瓶颈了。或者减少并发数,或者提供更多的受限资源(比如从机械磁盘换成 SSD),或者添加更多节点。 [NOTE] ==== -When ingesting data, make sure bulk requests are round-robined across all your -data nodes. Do not send all requests to a single node, since that single node -will need to store all the bulks in memory while processing. +写数据的时候,要确保批量请求是轮询发往你的全部数据节点的。不要把所有请求都发给单个节点,因为这个节点会需要在处理的时候把所有批量请求都存在内存里。 ==== -==== Storage +==== 存储 -Disks are usually the bottleneck of any modern server. Elasticsearch heavily uses disks, and the more throughput your disks can handle, the more stable your nodes will be. Here are some tips for optimizing disk I/O: +磁盘在现代服务器上通常都是瓶颈。Elasticsearch 重度使用磁盘,你的磁盘能处理的吞吐量越大,你的节点就越稳定。这里有一些优化磁盘 I/O 的技巧: -- Use SSDs. As mentioned elsewhere, ((("storage")))((("indexing", "performance tips", "storage")))they are superior to spinning media. -- Use RAID 0. Striped RAID will increase disk I/O, at the obvious expense of -potential failure if a drive dies. Don't use mirrored or parity RAIDS since -replicas provide that functionality. -- Alternatively, use multiple drives and allow Elasticsearch to stripe data across -them via multiple `path.data` directories. -- Do not use remote-mounted storage, such as NFS or SMB/CIFS. The latency introduced -here is antithetical to performance. -- If you are on EC2, beware of EBS. Even the SSD-backed EBS options are often slower -than local instance storage. +- 使用 SSD。就像其他地方提过的,((("storage")))((("indexing", "performance tips", "storage")))他们比机械磁盘优秀多了。 +- 使用 RAID 0。条带化 RAID 会提高磁盘 I/O,代价显然就是当一块硬盘故障时整个就故障了。不要使用镜像或者奇偶校验 RAID 因为副本已经提供了这个功能。 +- 另外,使用多块硬盘,并允许 Elasticsearch 通过多个 `path.data` 目录配置把数据条带化分配到它们上面。 +- 不要使用远程挂载的存储,比如 NFS 或者 SMB/CIFS。这个引入的延迟对性能来说完全是背道而驰的。 +- 如果你用的是 EC2,当心 EBS。即便是基于 SSD 的 EBS,通常也比本地实例的存储要慢。 [[segments-and-merging]] -==== Segments and Merging +==== 段和合并 -Segment merging is computationally expensive,((("indexing", "performance tips", "segments and merging")))((("merging segments")))((("segments", "merging"))) and can eat up a lot of disk I/O. -Merges are scheduled to operate in the background because they can take a long -time to finish, especially large segments. This is normally fine, because the -rate of large segment merges is relatively rare. +段合并的计算量庞大,((("indexing", "performance tips", "segments and merging")))((("merging segments")))((("segments", "merging")))而且还要吃掉大量磁盘 I/O。合并在后台定期操作,因为他们可能要很长时间才能完成,尤其是比较大的段。这个通常来说都没问题,因为大规模段合并的概率是很小的。 -But sometimes merging falls behind the ingestion rate. If this happens, Elasticsearch -will automatically throttle indexing requests to a single thread. This prevents -a _segment explosion_ problem, in which hundreds of segments are generated before -they can be merged. Elasticsearch will log `INFO`-level messages stating `now -throttling indexing` when it detects merging falling behind indexing. +不过有时候合并会拖累写入速率。如果这个真的发生了,Elasticsearch 会自动限制索引请求到单个线程里。这个可以防止出现 _段爆炸_ 问题,即数以百计的段在被合并之前就生成出来。如果 Elasticsearch 发现合并拖累索引了,它会会记录一个声明有 `now throttling indexing` 的 `INFO` 级别信息。 -Elasticsearch defaults here are conservative: you don't want search performance -to be impacted by background merging. But sometimes (especially on SSD, or logging -scenarios), the throttle limit is too low. +Elasticsearch 默认设置在这块比较保守:不希望搜索性能被后台合并影响。不过有时候(尤其是 SSD,或者日志场景)限流阈值太低了。 -The default is 20 MB/s, which is a good setting for spinning disks. If you have -SSDs, you might consider increasing this to 100–200 MB/s. Test to see what works -for your system: +默认值是 20 MB/s,对机械磁盘应该是个不错的设置。如果你用的是 SSD,可以考虑提高到 100–200 MB/s。测试验证对你的系统哪个值合适: [source,js] ---- @@ -117,9 +70,7 @@ PUT /_cluster/settings } ---- -If you are doing a bulk import and don't care about search at all, you can disable -merge throttling entirely. This will allow indexing to run as fast as your -disks will allow: +如果你在做批量导入,完全不在意搜索,你可以彻底关掉合并限流。这样让你的索引速度跑到你磁盘允许的极限: [source,js] ---- @@ -130,58 +81,31 @@ PUT /_cluster/settings } } ---- -<1> Setting the throttle type to `none` disables merge throttling entirely. When -you are done importing, set it back to `merge` to reenable throttling. +<1> 设置限流类型为 `none` 彻底关闭合并限流。等你完成了导入,记得改回 `merge` 重新打开限流。 -If you are using spinning media instead of SSD, you need to add this to your -`elasticsearch.yml`: +如果你使用的是机械磁盘而非 SSD,你需要添加下面这个配置到你的 `elasticsearch.yml` 里: [source,yaml] ---- index.merge.scheduler.max_thread_count: 1 ---- -Spinning media has a harder time with concurrent I/O, so we need to decrease -the number of threads that can concurrently access the disk per index. This setting -will allow `max_thread_count + 2` threads to operate on the disk at one time, -so a setting of `1` will allow three threads. - -For SSDs, you can ignore this setting. The default is -`Math.min(3, Runtime.getRuntime().availableProcessors() / 2)`, which works well -for SSD. - -Finally, you can increase `index.translog.flush_threshold_size` from the default -512 MB to something larger, such as 1 GB. This allows larger segments to accumulate -in the translog before a flush occurs. By letting larger segments build, you -flush less often, and the larger segments merge less often. All of this adds up -to less disk I/O overhead and better indexing rates. Of course, you will need -the corresponding amount of heap memory free to accumulate the extra buffering -space, so keep that in mind when adjusting this setting. - -==== Other - -Finally, there are some other considerations to keep in mind: - -- If you don't need near real-time accuracy on your search results, consider -dropping the `index.refresh_interval` of((("indexing", "performance tips", "other considerations")))((("refresh_interval setting"))) each index to `30s`. If you are doing -a large import, you can disable refreshes by setting this value to `-1` for the -duration of the import. Don't forget to reenable it when you are finished! - -- If you are doing a large bulk import, consider disabling replicas by setting -`index.number_of_replicas: 0`.((("replicas, disabling during large bulk imports"))) When documents are replicated, the entire document -is sent to the replica node and the indexing process is repeated verbatim. This -means each replica will perform the analysis, indexing, and potentially merging -process. +机械磁盘在并发 I/O 支持方面比较差,所以我们需要降低每个索引并发访问磁盘的线程数。这个设置允许 `max_thread_count + 2` 个线程同时进行磁盘操作,也就是设置为 `1` 允许三个线程。 + +对于 SSD,你可以忽略这个设置,默认是 `Math.min(3, Runtime.getRuntime().availableProcessors() / 2)` ,对 SSD 来说运行的很好。 + +最后,你可以增加 `index.translog.flush_threshold_size` 设置,从默认的 512 MB 到更大一些的值,比如 1 GB。这可以在一次清空触发的时候在事务日志里积累出更大的段。而通过构建更大的段,清空的频率变低,大段合并的频率也变低。这一切合起来导致更少的磁盘 I/O 开销和更好的索引速率。当然,你会需要对应量级的 heap 内存用以积累更大的缓冲空间,调整这个设置的时候请记住这点。 + +==== 其他 + +最后,还有一些其他值得考虑的东西需要记住: + +- 如果你的搜索结果不需要近实时的准确度,考虑把每个索引的 `index.refresh_interval`((("indexing", "performance tips", "other considerations")))((("refresh_interval setting")))改到 `30s` 。如果你是在做大批量导入,导入期间你可以通过设置这个值为 `-1` 关掉刷新。别忘记在完工的时候重新开启它。 + +- 如果你在做大批量导入,考虑通过设置 `index.number_of_replicas: 0`((("replicas, disabling during large bulk imports")))关闭副本。文档在复制的时候,整个文档内容都被发往副本节点,然后逐字的把索引过程重复一遍。这意味着每个副本也会执行分析、索引以及可能的合并过程。 + -In contrast, if you index with zero replicas and then enable replicas when ingestion -is finished, the recovery process is essentially a byte-for-byte network transfer. -This is much more efficient than duplicating the indexing process. - -- If you don't have a natural ID for each document, use Elasticsearch's auto-ID -functionality.((("id", "auto-ID functionality of Elasticsearch"))) It is optimized to avoid version lookups, since the autogenerated -ID is unique. - -- If you are using your own ID, try to pick an ID that is http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html[friendly to Lucene]. ((("UUIDs (universally unique identifiers)"))) Examples include zero-padded -sequential IDs, UUID-1, and nanotime; these IDs have consistent, sequential -patterns that compress well. In contrast, IDs such as UUID-4 are essentially -random, which offer poor compression and slow down Lucene. +相反,如果你的索引是零副本,然后在写入完成后再开启副本,恢复过程本质上只是一个字节到字节的网络传输。相比重复索引过程,这个算是相当高效的了。 + +- 如果你没有给每个文档自带 ID,使用 Elasticsearch 的自动 ID 功能。((("id", "auto-ID functionality of Elasticsearch")))这个为避免版本查找做了优化,因为自动生成的 ID 是唯一的。 + +- 如果你在使用自己的 ID,尝试使用一种 http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html[Lucene 友好的] ID。((("UUIDs (universally unique identifiers)")))包括零填充序列 ID、UUID-1 和纳秒;这些 ID 都是有一致的,压缩良好的序列模式。相反的,像 UUID-4 这样的 ID,本质上是随机的,压缩比很低,会明显拖慢 Lucene。