Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache the creation of parsers within DateProcessor #92880

Merged
merged 5 commits into from
Jan 31, 2023

Conversation

HiDAl
Copy link
Contributor

@HiDAl HiDAl commented Jan 12, 2023

This PR introduces a cache mechanism within the DateProcessor class. This cache helps to memoize the creation of expensive objects, which are created through calls to DateFormatter.forPattern(...) method and other expensive methods.

We have benchmarked the change, obtaining an increase in performance between 10% and 80% for some of the logging benchmarks (~80% for logs-postgresql.log-1.2.0).

Benchmark

  • Benchmarked using esbench (use-case=simple, track.name=elastic/logs)
  • Beckhmarked 2 caches (size 16 & 64). There's a slight difference in processing time between them (200ns - 1ms), with the smaller cache as a winner.
  • Below results are summarised (Full results)

BASELINE

Ingest Summary:
===============
          count  time_in_millis  time_in_nanos
total  32367760         1532279          47340

Pipeline Summary:
=================
                                          count  time_in_millis  time_in_nanos  percent
logs-nginx.access-1.2.0                 1769000          188822         106740    24.0%
logs-postgresql.log-1.2.0                320000          185256         578924    23.6%
logs-system.syslog-1.6.4                2234000          109214          48888    13.9%
logs-system.auth-1.6.4                   195881           72172         368447     9.2%
logs-redis.log-1.1.0                    1600000           50652          31658     6.4%
.fleet_final_pipeline-1                 9398881           47060           5007     6.0%
logs-kafka.log-0.5.0                     640000           28595          44680     3.6%
logs-apache.access-1.3.0                 320000           27248          85150     3.5%
logs-mysql.error-1.1.0                   560000           24513          43774     3.1%
logs-apache.error-1.3.0                  320000           19929          62278     2.5%
logs-postgresql.log-1.2.0-pipeline-log   320000           14646          45769     1.9%
logs-nginx.error-1.2.0                   320000           12374          38669     1.6%
logs-mysql.slowlog-1.1.0                 160000            5424          33900     0.7%

CACHE OF SIZE 16

Ingest Summary:
===============
          count  time_in_millis  time_in_nanos
total  33376000         1412871          42332

Pipeline Summary:
=================
                                          count  time_in_millis  time_in_nanos  percent
logs-nginx.access-1.2.0                 1920000          188206          98024    29.3%
logs-system.auth-1.6.4                   320000          116025         362577    18.0%
logs-system.syslog-1.6.4                2240000           91281          40751    14.2%
.fleet_final_pipeline-1                 9848000           44558           4525     6.9%
logs-postgresql.log-1.2.0                320000           44092         137788     6.9%
logs-redis.log-1.1.0                    1600000           37506          23442     5.8%
logs-kafka.log-0.5.0                     728000           26900          36951     4.2%
logs-apache.access-1.3.0                 320000           25326          79144     3.9%
logs-mysql.error-1.1.0                   640000           21607          33761     3.4%
logs-apache.error-1.3.0                  320000           18109          56591     2.8%
logs-postgresql.log-1.2.0-pipeline-log   320000           13532          42288     2.1%
logs-nginx.error-1.2.0                   320000           10569          33029     1.6%
logs-mysql.slowlog-1.1.0                 160000            5708          35675     0.9%

CACHE OF SIZE 64

Ingest Summary:
===============
          count  time_in_millis  time_in_nanos
total  33376000         1445614          43313

Pipeline Summary:
=================
                                          count  time_in_millis  time_in_nanos  percent
logs-nginx.access-1.2.0                 1920000          195122         101626    29.3%
logs-system.auth-1.6.4                   320000          123478         385868    18.5%
logs-system.syslog-1.6.4                2240000           94484          42181    14.2%
.fleet_final_pipeline-1                 9848000           46163           4688     6.9%
logs-postgresql.log-1.2.0                320000           43839         136997     6.6%
logs-redis.log-1.1.0                    1600000           37604          23503     5.6%
logs-kafka.log-0.5.0                     728000           27726          38086     4.2%
logs-apache.access-1.3.0                 320000           25662          80194     3.8%
logs-mysql.error-1.1.0                   640000           23132          36144     3.5%
logs-apache.error-1.3.0                  320000           18514          57857     2.8%
logs-postgresql.log-1.2.0-pipeline-log   320000           13572          42413     2.0%
logs-nginx.error-1.2.0                   320000           11415          35672     1.7%
logs-mysql.slowlog-1.1.0                 160000            6134          38338     0.9%

@HiDAl HiDAl added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.7.0 labels Jan 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @HiDAl, I've created a changelog YAML for you.

@HiDAl HiDAl changed the title Add a cache in the DateProcessor for the formatters Cache the creation of parsers within DateProcessor Jan 12, 2023
@HiDAl
Copy link
Contributor Author

HiDAl commented Jan 12, 2023

@elasticsearchmachine generate changelog

@joegallo
Copy link
Contributor

joegallo commented Jan 13, 2023

@HiDAl can you grab a heapdump during a run of the Logging benchmark? I'm thinking we've got ~20 date processors running there, each with a cache up to 16 of these things. So that's (up to) 320 of them. I'm guessing they're pretty small, but how small?

That is, clearly we're making a time/space tradeoff here. The stats above give us a clear sense of the time we're buying (tons! awesome!), but I'm curious how much space we're giving up to buy that.

A separate interesting question. What if we wanted to have zero cache evictions? Could you temporarily add some extra logging and run the Logging benchmark again? I'd be very curious what the normal usage pattern is here. For example, are some date processors just using one or two cached date-parsing-functions (embarrassingly cacheable! perfect!) and others are churning their cache constantly as they try to rotate through 512 different functions?

An alternative implementation might be a static cache based around a ConcurrentMap<Key, SoftReference<Function<String, ZonedDateTime>>> -- in that case we could make it bigger than 16, even something like 256 or 512 would probably be reasonable.

I'm super in favor of getting a cache like this in, but I want to be very thoughtful about it.

@HiDAl
Copy link
Contributor Author

HiDAl commented Jan 17, 2023

@joegallo well, I've taken the heap dump, and here's what I found out:

  1. DateProcessor is the second bigger processor (right after GrokProcessor [1])

Screenshot 2023-01-17 at 10 38 03

  1. The whole set of DateProcessor (18 objects for this pipeline configuration), represents 0.03% of the reserved memory

Screenshot 2023-01-17 at 09 48 58

obviously, being the cache a big portion of this memory (~128KB):

Screenshot 2023-01-17 at 09 54 42

  1. Even though some of the DateProcessors are using values coming from the documents for the locale or the timezone (part of the cache key and with the potential to have a high dispersion), they're still highly cacheable [2]: we only have 22 elements across all the Caches

Screenshot 2023-01-17 at 10 10 05

Cache lifecycle

On the other hand, adding some basic logging (89b9e0), we have found out:

  1. There are no evictions for this pipeline configuration:
../logs/server$ grep -i 'purging' default.log
../logs/server$
  1. The cache size is small [3]:
../logs/server$ grep -i 'cache size' default.log  | cut -d ']' -f 6 | sort -n | uniq -c
   7389  Cache size: 1
  10660  Cache size: 2
   6425  Cache size: 3
    173  Cache size: 4
   3235  Cache size: 6

About having a common static cache

Running a new benchmark against having a big and shared cache (commit 86b35) we obtain the following results:

Ingest & Index Summary:
=======================
          count  time_in_millis  time_in_nanos  ingest_time_%  index_time_in_millis
total  33324250         1409999          42312          25.0%               4220354

Pipeline Summary:
=================
                                          count  time_in_millis  time_in_nanos  percent
logs-nginx.access-1.2.0                 1920000          185653          96695    28.4%
logs-system.auth-1.6.4                   320000          119544         373574    18.3%
logs-system.syslog-1.6.4                2240000           94827          42334    14.5%
logs-postgresql.log-1.2.0                320000           48548         151713     7.4%
.fleet_final_pipeline-1                 9822129           44440           4525     6.8%
logs-redis.log-1.1.0                    1600000           36968          23105     5.7%
logs-kafka.log-0.5.0                     702131           26312          37475     4.0%
logs-apache.access-1.3.0                 320000           25239          78872     3.9%
logs-mysql.error-1.1.0                   640000           22179          34655     3.4%
logs-apache.error-1.3.0                  320000           18090          56532     2.8%
logs-postgresql.log-1.2.0-pipeline-log   320000           15339          47935     2.3%
logs-nginx.error-1.2.0                   320000           11169          34904     1.7%
logs-mysql.slowlog-1.1.0                 160000            5838          36488     0.9%

These results are slightly better than having a small & per-processor cache. Also, when checking the lifecycle of the cache (size and evictions), we get the following:

../logs/server$ grep -i 'cache size' default.log  | cut -d ']' -f 6 | sort -n | uniq -c
    512  Cache size: 1
    558  Cache size: 3
   1530  Cache size: 4
    466  Cache size: 5

../logs/server$ grep -i 'purging' default.log
../logs/server$

As you can see, the cache has only 5 different values along the benchmark, compared with the 22 values of the non-shared cache (read footnote [3]). So, for this specific benchmark and probably for other scenarios where the cache key exists in a broader domain, the static cache is the best option, due to:

  1. In the case the configuration uses the default locale & timezone and only a small set of format, the values will be highly-cacheable and the default key won't be created along the DateProcessors
  2. Since we already calculated the size of a cache entry, having a cache of a maximum of 512 would use only ~3.5MB of memory (from screenshot 1: 131264 B / 18 * 512 = 3.56MB), even if we want to avoid evictions, we could enlarge even more the cache.
  3. Since it's a shared cache the max size would be constant, no mattering the number of processors (compared with a static cache of 512, a cluster having more than 32 processors, with a per-processor cache, would use more memory in the worst scenario)

[1] This is interesting: why is Grok taking so much space? Let us pull a bit of the thread here later
[2] Important consider that this is true for this pipeline configuration and data set: From the heap dump, we can see that the domain of the cache key (the combination of format, timezone, and locale) it's really small, so highly cacheable.
[3] I tried to find a way to relate each pipeline with its cache size but I didn't find an easy way to do it without having to hack a lot of the code, but I think this version is enough to have a global view

@HiDAl
Copy link
Contributor Author

HiDAl commented Jan 26, 2023

the last push introduces the static cache with a default value of 256, configurable through a settings

Copy link
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, but I've been involved enough in the creation of this that I think a second reviewer would be a good thing.


element = supplier.get();
if (map.size() >= capacity) {
map.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of surprising. Whenever the cache is full, we empty it and start over? Is this to avoid the complexity of a concurrent LRU cache? Or is the idea that we'll rarely see more than 256 formats in practice, so this is just for safety and will rarely be hit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! this decision was made mainly because we want to avoid the complexity of LRU (or any eviction policy) within this code path. The decision to make the size 256 was primarily based on experiments with different benchmarks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding a one line comment in the code above the clear to explain its purpose.

@joegallo
Copy link
Contributor

Oh, I missed that this version of things doesn't have the SoftReferences around the values. I think we want that, or at least, if we're not going to go that way I'd like to understand why.

@HiDAl
Copy link
Contributor Author

HiDAl commented Jan 27, 2023

Oh, I missed that this version of things doesn't have the SoftReferences around the values. I think we want that, or at least, if we're not going to go that way I'd like to understand why.

@joegallo you're right, my bad! just pushed the needed changes using SoftReference<>s and some rudimentary tests.

@HiDAl HiDAl requested a review from masseyke January 27, 2023 10:07
HiDAl and others added 4 commits January 27, 2023 11:29
cache potentially duped values in the DateProcessor, avoid the creation
of disposable objects during the different executions of the processor
private static final Cache CACHE;

static {
var cacheSizeStr = System.getProperty(CACHE_CAPACITY_SETTING, "256");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for doing this as a system property instead of a Settings thing (and related, are we documenting this setting)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expect users will have to change this value unless they see a big performance effect on their pipelines. In the opposite case, we wanted to provide an easy way to change the value if we receive an SDH (with the obvious memory consumption trade-off). I inspired by this other PR#80902 by @joegallo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, git grep 'System.getProperty("es' | grep -vi test/ shows a good number of examples of this as the pattern for non-customer-facing internal knobs that we don't expect to need to twiddle (but that we're making possible to twiddle as a favor to our future selves).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, just wanted to make sure it was intentional.

@HiDAl HiDAl requested a review from masseyke January 30, 2023 09:29
@HiDAl HiDAl merged commit 7e0ed53 into elastic:main Jan 31, 2023
@HiDAl HiDAl deleted the palcantar/date-proc-cache branch January 31, 2023 14:08
mark-vieira pushed a commit to mark-vieira/elasticsearch that referenced this pull request Jan 31, 2023
cache potentially duped values in the `DateProcessor`, avoiding the creation of disposable objects during the different executions
@joegallo
Copy link
Contributor

joegallo commented Mar 1, 2023

Related to #73918

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants