Move the terms index of `_id` off-heap. #52518

jpountz · 2020-02-19T15:08:41Z

Backport of #52405

In #42838 we moved the terms index of all fields off-heap except the
_id field because we were worried it might make indexing slower. In
general, the indexing rate is only affected if explicit IDs are used, as
otherwise Elasticsearch almost never performs lookups in the terms
dictionary for the purpose of indexing. So it's quite wasteful to
require the terms index of _id to be loaded on-heap for users who have
append-only workloads. Furthermore I've been conducting benchmarks when
indexing with explicit ids on the http_logs dataset that suggest that
the slowdown is low enough that it's probably not worth forcing the terms
index to be kept on-heap. Here are some numbers for the median indexing
rate in docs/s:

Run	Master	Patch
1	45851.2	46401.4
2	45192.6	44561.0
3	45635.2	44137.0
4	46435.0	44692.8
5	45829.0	44949.0

And now heap usage in MB for segments:

Run	Master	Patch
1	41.1720	0.352083
2	45.1545	0.382534
3	41.7746	0.381285
4	45.3673	0.412737
5	45.4616	0.375063

Indexing rate decreased by 1.8% on average, while memory usage decreased
by more than 100x.

The http_logs dataset contains small documents and has a simple
indexing chain. More complex indexing chains, e.g. with more fields,
ingest pipelines, etc. would see an even lower decrease of indexing rate.

Backport of elastic#52405 In elastic#42838 we moved the terms index of all fields off-heap except the `_id` field because we were worried it might make indexing slower. In general, the indexing rate is only affected if explicit IDs are used, as otherwise Elasticsearch almost never performs lookups in the terms dictionary for the purpose of indexing. So it's quite wasteful to require the terms index of `_id` to be loaded on-heap for users who have append-only workloads. Furthermore I've been conducting benchmarks when indexing with explicit ids on the http_logs dataset that suggest that the slowdown is low enough that it's probably not worth forcing the terms index to be kept on-heap. Here are some numbers for the median indexing rate in docs/s: | Run | Master | Patch | | --- | ------- | ------- | | 1 | 45851.2 | 46401.4 | | 2 | 45192.6 | 44561.0 | | 3 | 45635.2 | 44137.0 | | 4 | 46435.0 | 44692.8 | | 5 | 45829.0 | 44949.0 | And now heap usage in MB for segments: | Run | Master | Patch | | --- | ------- | -------- | | 1 | 41.1720 | 0.352083 | | 2 | 45.1545 | 0.382534 | | 3 | 41.7746 | 0.381285 | | 4 | 45.3673 | 0.412737 | | 5 | 45.4616 | 0.375063 | Indexing rate decreased by 1.8% on average, while memory usage decreased by more than 100x. The `http_logs` dataset contains small documents and has a simple indexing chain. More complex indexing chains, e.g. with more fields, ingest pipelines, etc. would see an even lower decrease of indexing rate.

elasticmachine · 2020-02-19T15:13:38Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

jpountz · 2020-02-19T20:46:43Z

@elasticmachine test this please

jpountz · 2020-02-21T17:14:47Z

@elasticmachine run elasticsearch-ci/packaging-sample-matrix-unix

jpountz · 2020-02-24T07:22:35Z

@elasticmachine run elasticsearch-ci/packaging-sample-matrix-unix

jpountz added :Analytics/Geo Indexing, search aggregations of geo points and shapes :Analytics/Aggregations Aggregations backport v7.7.0 and removed :Analytics/Aggregations Aggregations :Analytics/Geo Indexing, search aggregations of geo points and shapes labels Feb 19, 2020

jpountz added 2 commits February 20, 2020 09:38

remove tab

c821502

space instead of tab

12bb4b3

Merge branch '7.x' into backport/52405

b1330e2

jpountz merged commit f993ef8 into elastic:7.x Feb 24, 2020

jpountz deleted the backport/52405 branch February 24, 2020 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move the terms index of `_id` off-heap. #52518

Move the terms index of `_id` off-heap. #52518

jpountz commented Feb 19, 2020

elasticmachine commented Feb 19, 2020

jpountz commented Feb 19, 2020

jpountz commented Feb 21, 2020

jpountz commented Feb 24, 2020

Move the terms index of _id off-heap. #52518

Move the terms index of _id off-heap. #52518

Conversation

jpountz commented Feb 19, 2020

elasticmachine commented Feb 19, 2020

jpountz commented Feb 19, 2020

jpountz commented Feb 21, 2020

jpountz commented Feb 24, 2020

Move the terms index of `_id` off-heap. #52518

Move the terms index of `_id` off-heap. #52518