diff --git a/050_Search/00_Intro.asciidoc b/050_Search/00_Intro.asciidoc index 6b6516a6c..1c11ebd53 100644 --- a/050_Search/00_Intro.asciidoc +++ b/050_Search/00_Intro.asciidoc @@ -1,60 +1,43 @@ [[search]] -== Searching--The Basic Tools +== 搜索——最基本的工具 -So far, we have learned how to use Elasticsearch as a simple NoSQL-style -distributed document store. We can ((("searching")))throw JSON documents at Elasticsearch and -retrieve each one by ID. But the real power of Elasticsearch lies in its -ability to make sense out of chaos -- to turn Big Data into Big Information. +现在,我们已经学会了如何使用 Elasticsearch 作为一个简单的 NoSQL 风格的分布式文档存储系统。我们可以((("searching")))将一个 JSON 文档扔到 Elasticsearch 里,然后根据 ID 检索。但 Elasticsearch 真正强大之处在于可以从无规律的数据中找出有意义的信息——从“大数据”到“大信息”。 -This is the reason that we use structured JSON documents, rather than -amorphous blobs of data. Elasticsearch not only _stores_ the document, but -also _indexes_ the content of the document in order to make it searchable. +Elasticsearch 不只会_存储(stores)_ 文档,为了能被搜索到也会为文档添加_索引(indexes)_ ,这也是为什么我们使用结构化的 JSON 文档,而不是无结构的二进制数据。 -_Every field in a document is indexed and can be queried_. ((("indexing"))) And it's not just -that. During a single query, Elasticsearch can use _all_ of these indices, to -return results at breath-taking speed. That's something that you could never -consider doing with a traditional database. +_文档中的每个字段都将被索引并且可以被查询_ 。((("indexing")))不仅如此,在简单查询时,Elasticsearch 可以使用 _所有(all)_ 这些索引字段,以惊人的速度返回结果。这是你永远不会考虑用传统数据库去做的一些事情。 -A _search_ can be any of the following: +_搜索(search)_ 可以做到: -* A structured query on concrete fields((("fields", "searching on")))((("searching", "types of searches"))) like `gender` or `age`, sorted by - a field like `join_date`, similar to the type of query that you could construct - in SQL +* 在类似于 `gender` 或者 `age` 这样的字段((("fields", "searching on")))((("searching", "types of searches")))上使用结构化查询,`join_date` 这样的字段上使用排序,就像SQL的结构化查询一样。 -* A full-text query, which finds all documents matching the search keywords, - and returns them sorted by _relevance_ +* 全文检索,找出所有匹配关键字的文档并按照_相关性(relevance)_ 排序后返回结果。 -* A combination of the two +* 以上二者兼而有之。 -While many searches will just work out of((("full text search"))) the box, to use Elasticsearch to -its full potential, you need to understand three subjects: +很多搜索都是开箱即用的((("full text search"))),为了充分挖掘 Elasticsearch 的潜力,你需要理解以下三个概念: - _Mapping_:: - How the data in each field is interpreted - - _Analysis_:: - How full text is processed to make it searchable - - _Query DSL_:: - The flexible, powerful query language used by Elasticsearch + _映射(Mapping)_ :: + 描述数据在每个字段内如何存储 -Each of these is a big subject in its own right, and we explain them in -detail in <>. The chapters in this section introduce the -basic concepts of all three--just enough to help you to get an overall -understanding of how search works. + _分析(Analysis)_ :: + 全文是如何处理使之可以被搜索的 -We will start by explaining the `search` API in its simplest form. + _领域特定查询语言(Query DSL)_ :: + Elasticsearch 中强大灵活的查询语言 -.Test Data +以上提到的每个点都是一个大话题,我们将在 <> 一章详细阐述它们。本章节我们将介绍这三点的一些基本概念——仅仅帮助你大致了解搜索是如何工作的。 + +我们将使用最简单的形式开始介绍 `search` API。 + +.测试数据 **** -The documents that we will use for test purposes in this chapter can be found -in this gist: https://gist.github.com/clintongormley/8579281. +本章节的测试数据可以在这里找到: https://gist.github.com/clintongormley/8579281 。 -You can copy the commands and paste them into your shell in order to follow -along with this chapter. +你可以把这些命令复制到终端中执行来实践本章的例子。 -Alternatively, if you're in the online version of this book, you can link:sense_widget.html?snippets/050_Search/Test_data.json[click here to open in Sense]. +另外,如果你读的是在线版本,可以 link:sense_widget.html?snippets/050_Search/Test_data.json[点击这个链接] 感受下。 **** diff --git a/510_Deployment/50_heap.asciidoc b/510_Deployment/50_heap.asciidoc index 2d45a8830..8e8105bba 100644 --- a/510_Deployment/50_heap.asciidoc +++ b/510_Deployment/50_heap.asciidoc @@ -1,101 +1,76 @@ [[heap-sizing]] -=== Heap: Sizing and Swapping +=== 堆内存:大小和交换 -The default installation of Elasticsearch is configured with a 1 GB heap. ((("deployment", "heap, sizing and swapping")))((("heap", "sizing and setting"))) For -just about every deployment, this number is far too small. If you are using the -default heap values, your cluster is probably configured incorrectly. +Elasticsearch 默认安装后设置的堆内存是 1 GB。((("deployment", "heap, sizing and swapping")))((("heap", "sizing and setting")))对于任何一个业务部署来说, +这个设置都太小了。如果你正在使用这些默认堆内存配置,您的群集可能会出现问题。 -There are two ways to change the heap size in Elasticsearch. The easiest is to -set an environment variable called `ES_HEAP_SIZE`.((("ES_HEAP_SIZE environment variable"))) When the server process -starts, it will read this environment variable and set the heap accordingly. -As an example, you can set it via the command line as follows: +这里有两种方式修改 Elasticsearch 的堆内存。最简单的一个方法就是指定 `ES_HEAP_SIZE` 环境变量。((("ES_HEAP_SIZE environment variable")))服务进程在启动时候会读取这个变量,并相应的设置堆的大小。 +比如,你可以用下面的命令设置它: [source,bash] ---- export ES_HEAP_SIZE=10g ---- -Alternatively, you can pass in the heap size via a command-line argument when starting -the process, if that is easier for your setup: +此外,你也可以通过命令行参数的形式,在程序启动的时候把内存大小传递给它,如果你觉得这样更简单的话: [source,bash] ---- ./bin/elasticsearch -Xmx10g -Xms10g <1> ---- -<1> Ensure that the min (`Xms`) and max (`Xmx`) sizes are the same to prevent -the heap from resizing at runtime, a very costly process. +<1> 确保堆内存最小值( `Xms` )与最大值( `Xmx` )的大小是相同的,防止程序在运行时改变堆内存大小, +这是一个很耗系统资源的过程。 -Generally, setting the `ES_HEAP_SIZE` environment variable is preferred over setting -explicit `-Xmx` and `-Xms` values. +通常来说,设置 `ES_HEAP_SIZE` 环境变量,比直接写 `-Xmx -Xms` 更好一点。 -==== Give Half Your Memory to Lucene +==== 把你的内存的一半给 Lucene -A common problem is configuring a heap that is _too_ large. ((("heap", "sizing and setting", "giving half your memory to Lucene"))) You have a 64 GB -machine--and by golly, you want to give Elasticsearch all 64 GB of memory. More -is better! +一个常见的问题是给 Elasticsearch 分配的内存 _太_ 大了。((("heap", "sizing and setting", "giving half your memory to Lucene")))假设你有一个 64 GB 内存的机器, +天啊,我要把 64 GB 内存全都给 Elasticsearch。因为越多越好啊! -Heap is definitely important to Elasticsearch. It is used by many in-memory data -structures to provide fast operation. But with that said, there is another major -user of memory that is _off heap_: Lucene. +当然,内存对于 Elasticsearch 来说绝对是重要的,它可以被许多内存数据结构使用来提供更快的操作。但是说到这里, +还有另外一个内存消耗大户 _非堆内存_ (off-heap):Lucene。 -Lucene is designed to leverage the underlying OS for caching in-memory data structures.((("Lucene", "memory for"))) -Lucene segments are stored in individual files. Because segments are immutable, -these files never change. This makes them very cache friendly, and the underlying -OS will happily keep hot segments resident in memory for faster access. +Lucene 被设计为可以利用操作系统底层机制来缓存内存数据结构。((("Lucene", "memory for"))) +Lucene 的段是分别存储到单个文件中的。因为段是不可变的,这些文件也都不会变化,这是对缓存友好的,同时操作系统也会把这些段文件缓存起来,以便更快的访问。 -Lucene's performance relies on this interaction with the OS. But if you give all -available memory to Elasticsearch's heap, there won't be any left over for Lucene. -This can seriously impact the performance of full-text search. +Lucene 的性能取决于和操作系统的相互作用。如果你把所有的内存都分配给 Elasticsearch 的堆内存,那将不会有剩余的内存交给 Lucene。 +这将严重地影响全文检索的性能。 -The standard recommendation is to give 50% of the available memory to Elasticsearch -heap, while leaving the other 50% free. It won't go unused; Lucene will happily -gobble up whatever is left over. +标准的建议是把 50% 的可用内存作为 Elasticsearch 的堆内存,保留剩下的 50%。当然它也不会被浪费,Lucene 会很乐意利用起余下的内存。 [[compressed_oops]] -==== Don't Cross 32 GB! -There is another reason to not allocate enormous heaps to Elasticsearch. As it turns((("heap", "sizing and setting", "32gb heap boundary")))((("32gb Heap boundary"))) -out, the HotSpot JVM uses a trick to compress object pointers when heaps are less -than around 32 GB. - -In Java, all objects are allocated on the heap and referenced by a pointer. -Ordinary object pointers (OOP) point at these objects, and are traditionally -the size of the CPU's native _word_: either 32 bits or 64 bits, depending on the -processor. The pointer references the exact byte location of the value. - -For 32-bit systems, this means the maximum heap size is 4 GB. For 64-bit systems, -the heap size can get much larger, but the overhead of 64-bit pointers means there -is more wasted space simply because the pointer is larger. And worse than wasted -space, the larger pointers eat up more bandwidth when moving values between -main memory and various caches (LLC, L1, and so forth). - -Java uses a trick called https://wikis.oracle.com/display/HotSpotInternals/CompressedOops[compressed oops]((("compressed object pointers"))) -to get around this problem. Instead of pointing at exact byte locations in -memory, the pointers reference _object offsets_.((("object offsets"))) This means a 32-bit pointer can -reference four billion _objects_, rather than four billion bytes. Ultimately, this -means the heap can grow to around 32 GB of physical size while still using a 32-bit -pointer. - -Once you cross that magical ~32 GB boundary, the pointers switch back to -ordinary object pointers. The size of each pointer grows, more CPU-memory -bandwidth is used, and you effectively lose memory. In fact, it takes until around -40–50 GB of allocated heap before you have the same _effective_ memory of a -heap just under 32 GB using compressed oops. - -The moral of the story is this: even when you have memory to spare, try to avoid -crossing the 32 GB heap boundary. It wastes memory, reduces CPU performance, and -makes the GC struggle with large heaps. - -==== Just how far under 32gb should I set the JVM? - -Unfortunately, that depends. The exact cutoff varies by JVMs and platforms. -If you want to play it safe, setting the heap to `31gb` is likely safe. -Alternatively, you can verify the cutoff point for the HotSpot JVM by adding -`-XX:+PrintFlagsFinal` to your JVM options and checking that the value of the -UseCompressedOops flag is true. This will let you find the exact cutoff for your -platform and JVM. - -For example, here we test a Java 1.7 installation on MacOSX and see the max heap -size is around 32600mb (~31.83gb) before compressed pointers are disabled: +==== 不要超过 32 GB! +这里有另外一个原因不分配大内存给 Elasticsearch。事实上((("heap", "sizing and setting", "32gb heap boundary")))((("32gb Heap boundary"))), +JVM 在内存小于 32 GB 的时候会采用一个内存对象指针压缩技术。 + +在 Java 中,所有的对象都分配在堆上,并通过一个指针进行引用。 +普通对象指针(OOP)指向这些对象,通常为 CPU _字长_ 的大小:32 位或 64 位,取决于你的处理器。 + +对于 32 位的系统,意味着堆内存大小最大为 4 GB。对于 64 位的系统, +可以使用更大的内存,但是 64 位的指针意味着更大的浪费,因为你的指针本身大了。更糟糕的是, +更大的指针在主内存和各级缓存(例如 LLC,L1 等)之间移动数据的时候,会占用更多的带宽。 + +Java 使用一个叫作 https://wikis.oracle.com/display/HotSpotInternals/CompressedOops[内存指针压缩(compressed oops)]((("compressed object pointers")))的技术来解决这个问题。 +它的指针不再表示对象在内存中的精确位置,而是表示 _偏移量_ 。((("object offsets")))这意味着 32 位的指针可以引用 40 亿个 _对象_ , +而不是 40 亿个字节。最终, +也就是说堆内存增长到 32 GB 的物理内存,也可以用 32 位的指针表示。 + +一旦你越过那个神奇的 ~32 GB 的边界,指针就会切回普通对象的指针。 +每个对象的指针都变长了,就会使用更多的 CPU 内存带宽,也就是说你实际上失去了更多的内存。事实上,当内存到达 +40–50 GB 的时候,有效内存才相当于使用内存对象指针压缩技术时候的 32 GB 内存。 + +这段描述的意思就是说:即便你有足够的内存,也尽量不要 +超过 32 GB。因为它浪费了内存,降低了 CPU 的性能,还要让 GC 应对大内存。 + +==== 到底需要低于 32 GB多少,来设置我的 JVM? + +遗憾的是,这需要看情况。确切的划分要根据 JVMs 和操作系统而定。 +如果你想保证其安全可靠,设置堆内存为 `31 GB` 是一个安全的选择。 +另外,你可以在你的 JVM 设置里添加 `-XX:+PrintFlagsFinal` 用来验证 `JVM` 的临界值, +并且检查 UseCompressedOops 的值是否为 true。对于你自己使用的 JVM 和操作系统,这将找到最合适的堆内存临界值。 + +例如,我们在一台安装 Java 1.7 的 MacOSX 上测试,可以看到指针压缩在被禁用之前,最大堆内存大约是在 32600 mb(~31.83 gb): [source,bash] ---- @@ -105,8 +80,7 @@ $ JAVA_HOME=`/usr/libexec/java_home -v 1.7` java -Xmx32766m -XX:+PrintFlagsFinal bool UseCompressedOops = false ---- -In contrast, a Java 1.8 installation on the same machine has a max heap size -around 32766mb (~31.99gb): +相比之下,同一台机器安装 Java 1.8,可以看到指针压缩在被禁用之前,最大堆内存大约是在 32766 mb(~31.99 gb): [source,bash] ---- @@ -116,86 +90,75 @@ $ JAVA_HOME=`/usr/libexec/java_home -v 1.8` java -Xmx32767m -XX:+PrintFlagsFinal bool UseCompressedOops = false ---- -The morale of the story is that the exact cutoff to leverage compressed oops -varies from JVM to JVM, so take caution when taking examples from elsewhere and -be sure to check your system with your configuration and JVM. +这个例子告诉我们,影响内存指针压缩使用的临界值, +是会根据 JVM 的不同而变化的。 +所以从其他地方获取的例子,需要谨慎使用,要确认检查操作系统配置和 JVM。 -Beginning with Elasticsearch v2.2.0, the startup log will actually tell you if your -JVM is using compressed OOPs or not. You'll see a log message like: +如果使用的是 Elasticsearch v2.2.0,启动日志其实会告诉你 JVM 是否正在使用内存指针压缩。 +你会看到像这样的日志消息: [source, bash] ---- [2015-12-16 13:53:33,417][INFO ][env] [Illyana Rasputin] heap size [989.8mb], compressed ordinary object pointers [true] ---- -Which indicates that compressed object pointers are being used. If they are not, -the message will say `[false]`. - +这表明内存指针压缩正在被使用。如果没有,日志消息会显示 `[false]` 。 [role="pagebreak-before"] -.I Have a Machine with 1 TB RAM! +.我有一个 1 TB 内存的机器! **** -The 32 GB line is fairly important. So what do you do when your machine has a lot -of memory? It is becoming increasingly common to see super-servers with 512–768 GB -of RAM. +这个 32 GB 的分割线是很重要的。那如果你的机器有很大的内存怎么办呢? +一台有着 512–768 GB内存的服务器愈发常见。 -First, we would recommend avoiding such large machines (see <>). +首先,我们建议避免使用这样的高配机器(参考 <>)。 -But if you already have the machines, you have two practical options: +但是如果你已经有了这样的机器,你有两个可选项: -- Are you doing mostly full-text search? Consider giving just under 32 GB to Elasticsearch -and letting Lucene use the rest of memory via the OS filesystem cache. All that -memory will cache segments and lead to blisteringly fast full-text search. +- 你主要做全文检索吗?考虑给 Elasticsearch 不超过 32 GB 的内存, +让 Lucene 通过操作系统文件缓存来利用余下的内存。那些内存都会用来缓存 segments,带来极速的全文检索。 -- Are you doing a lot of sorting/aggregations? You'll likely want that memory -in the heap then. Instead of one node with more than 32 GB of RAM, consider running two or -more nodes on a single machine. Still adhere to the 50% rule, though. So if your -machine has 128 GB of RAM, run two nodes, each with just under 32 GB. This means that less -than 64 GB will be used for heaps, and more than 64 GB will be left over for Lucene. +- 你需要更多的排序和聚合?你可能会更希望那些那些内存用在堆中。 +你可以考虑一台机器上创建两个或者更多 ES 节点,而不要部署一个使用或者超过 32 GB 内存的节点。 +仍然要坚持 50% 原则。假设你有个机器有 128 GB 的内存, +你可以创建两个节点,每个节点内存分配不超过 32 GB。 +也就是说不超过 64 GB 内存给 ES 的堆内存,剩下的超过 64 GB 的内存给 Lucene。 + -If you choose this option, set `cluster.routing.allocation.same_shard.host: true` -in your config. This will prevent a primary and a replica shard from colocating -to the same physical machine (since this would remove the benefits of replica high availability). +如果你选择第二种,你需要配置 `cluster.routing.allocation.same_shard.host: true` 。 +这会防止同一个分片(shard)的主副本存在同一个物理机上(因为如果存在一个机器上,副本的高可用性就没有了)。 **** -==== Swapping Is the Death of Performance +==== Swapping 是性能的坟墓 -It should be obvious,((("heap", "sizing and setting", "swapping, death of performance")))((("memory", "swapping as the death of performance")))((("swapping, the death of performance"))) but it bears spelling out clearly: swapping main memory -to disk will _crush_ server performance. Think about it: an in-memory operation -is one that needs to execute quickly. +这是显而易见的,((("heap", "sizing and setting", "swapping, death of performance")))((("memory", "swapping as the death of performance")))((("swapping, the death of performance")))但是还是有必要说的更清楚一点:内存交换 +到磁盘对服务器性能来说是 _致命_ 的。想想看:一个内存操作必须能够被快速执行。 -If memory swaps to disk, a 100-microsecond operation becomes one that take 10 -milliseconds. Now repeat that increase in latency for all other 10us operations. -It isn't difficult to see why swapping is terrible for performance. +如果内存交换到磁盘上,一个 100 微秒的操作可能变成 10 毫秒。 +再想想那么多 10 微秒的操作时延累加起来。 +不难看出 swapping 对于性能是多么可怕。 -The best thing to do is disable swap completely on your system. This can be done -temporarily: +最好的办法就是在你的操作系统中完全禁用 swap。这样可以暂时禁用: [source,bash] ---- sudo swapoff -a ---- -To disable it permanently, you'll likely need to edit your `/etc/fstab`. Consult -the documentation for your OS. +如果需要永久禁用,你可能需要修改 `/etc/fstab` 文件,这要参考你的操作系统相关文档。 -If disabling swap completely is not an option, you can try to lower `swappiness`. -This value controls how aggressively the OS tries to swap memory. -This prevents swapping under normal circumstances, but still allows the OS to swap -under emergency memory situations. +如果你并不打算完全禁用 swap,也可以选择降低 `swappiness` 的值。 +这个值决定操作系统交换内存的频率。 +这可以预防正常情况下发生交换,但仍允许操作系统在紧急情况下发生交换。 -For most Linux systems, this is configured using the `sysctl` value: +对于大部分Linux操作系统,可以在 `sysctl` 中这样配置: [source,bash] ---- vm.swappiness = 1 <1> ---- -<1> A `swappiness` of `1` is better than `0`, since on some kernel versions a `swappiness` -of `0` can invoke the OOM-killer. +<1> `swappiness` 设置为 `1` 比设置为 `0` 要好,因为在一些内核版本 `swappiness` 设置为 `0` 会触发系统 OOM-killer(注:Linux 内核的 Out of Memory(OOM)killer 机制)。 -Finally, if neither approach is possible, you should enable `mlockall`. - file. This allows the JVM to lock its memory and prevent -it from being swapped by the OS. In your `elasticsearch.yml`, set this: +最后,如果上面的方法都不合适,你需要打开配置文件中的 `mlockall` 开关。 +它的作用就是允许 JVM 锁住内存,禁止操作系统交换出去。在你的 `elasticsearch.yml` 文件中,设置如下: [source,yaml] ----