Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowlog loggers do not seem to be cleaned up when no longer needed #56171

Closed
DaveCTurner opened this issue May 4, 2020 · 6 comments · Fixed by #57239
Closed

Slowlog loggers do not seem to be cleaned up when no longer needed #56171

DaveCTurner opened this issue May 4, 2020 · 6 comments · Fixed by #57239
Labels
>bug :Core/Infra/Logging Log management and logging utilities Team:Core/Infra Meta label for core/infra team

Comments

@DaveCTurner
Copy link
Contributor

Investigating https://discuss.elastic.co/t/index-creation-slows-down-over-time/230775 I wrote a script that creates and deletes indices repeatedly and left it running for several hours against a 7.6.2 cluster. Each iteration had indeed slowed down as reported, from around 6s to around 10s.

The hot threads API often reported a substantial amount of time spent doing things with loggers:

   20.3% (101.5ms out of 500ms) cpu usage by thread 'elasticsearch[node-0][clusterApplierService#updateTask][T#1]'
     2/10 snapshots sharing following 25 elements
       java.base@14/java.util.Collections$UnmodifiableCollection$1.hasNext(Collections.java:1048)
       app//org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:127)
       app//org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:111)
       app//org.elasticsearch.index.SearchSlowLog.setLevel(SearchSlowLog.java:127)
       app//org.elasticsearch.index.SearchSlowLog.<init>(SearchSlowLog.java:123)
       app//org.elasticsearch.index.IndexModule.<init>(IndexModule.java:151)
       app//org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:595)
       app//org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:542)
       app//org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:173)
       app//org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:484)
       app//org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:246)
       app//org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517)
       app//org.elasticsearch.cluster.service.ClusterApplierService$$Lambda$4041/0x000000080175d440.accept(Unknown Source)
       java.base@14/java.lang.Iterable.forEach(Iterable.java:75)
       app//org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514)
       app//org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485)
       app//org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432)
       app//org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73)
       app//org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633)
       app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
       app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
       java.base@14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14/java.lang.Thread.run(Thread.java:832)
     2/10 snapshots sharing following 24 elements
       app//org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:127)
       app//org.elasticsearch.common.logging.Loggers.setLevel(Loggers.java:111)
       app//org.elasticsearch.index.SearchSlowLog.setLevel(SearchSlowLog.java:128)
       app//org.elasticsearch.index.SearchSlowLog.<init>(SearchSlowLog.java:123)
       app//org.elasticsearch.index.IndexModule.<init>(IndexModule.java:151)
       app//org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:595)
       app//org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:542)
       app//org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:173)
       app//org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:484)
       app//org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:246)
       app//org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517)
       app//org.elasticsearch.cluster.service.ClusterApplierService$$Lambda$4041/0x000000080175d440.accept(Unknown Source)
       java.base@14/java.lang.Iterable.forEach(Iterable.java:75)
       app//org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514)
       app//org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485)
       app//org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432)
       app//org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73)
       app//org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633)
       app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
       app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
       java.base@14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14/java.lang.Thread.run(Thread.java:832)
     3/10 snapshots sharing following 14 elements
       app//org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517)
       app//org.elasticsearch.cluster.service.ClusterApplierService$$Lambda$4041/0x000000080175d440.accept(Unknown Source)
       java.base@14/java.lang.Iterable.forEach(Iterable.java:75)
       app//org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514)
       app//org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485)
       app//org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432)
       app//org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73)
       app//org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633)
       app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
       app//org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
       java.base@14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14/java.lang.Thread.run(Thread.java:832)
     3/10 snapshots sharing following 10 elements
       java.base@14/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@14/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
       java.base@14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
       java.base@14/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
       java.base@14/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
       java.base@14/java.util.concurrent.PriorityBlockingQueue.take(PriorityBlockingQueue.java:548)
       java.base@14/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1056)
       java.base@14/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
       java.base@14/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14/java.lang.Thread.run(Thread.java:832)

I took a heap dump and observed almost 150,000 loggers of the forms index.indexing.slowlog.<INDEX_UUID>, index.search.slowlog.fetch.<INDEX_UUID> and index.search.slowlog.query.<INDEX_UUID> (equal parts of each). There were only 1280 other instances of org.apache.logging.log4j.core.Logger.

I suspect this is a consequence of #47234 which introduced per-index loggers.

@DaveCTurner DaveCTurner added >bug :Core/Infra/Logging Log management and logging utilities labels May 4, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Logging)

@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label May 4, 2020
@pgomulka pgomulka self-assigned this May 5, 2020
@jasontedor
Copy link
Member

jasontedor commented May 5, 2020

The underlying problem here is that Log4j doesn't do any cleanup of loggers, once a logger is created, it is forever registered in the logger registry and there is no cleanup. This is why we introduced PrefixLogger, so that we can have, for example, per-index loggers that are cleaned up when the index is no longer allocated to a node. However, using per-index loggers in the slow logs with separate log-levels per such logger means that we can not use the PrefixLogger mechanism. It requires that Log4j create a logger for each slow log logger, and that will be forever maintained in the registry. Using per-index loggers in the slow log is therefore fatal. We effectively have two options then:

  • if we want to maintain per-index slow log settings, we have to use a single logger, but implement the logging level decision-making ourselves
  • remove per-index slow log settings

I'm not sure that I see the benefits of per-index slow log settings. If I want to log what is slow, I would want that for all of my indices. Therefore, I advocate that we remove per-index slow log settings.

@pgomulka
Copy link
Contributor

pgomulka commented May 5, 2020

There is a https://logging.apache.org/log4j/2.x/log4j-api/apidocs/org/apache/logging/log4j/spi/LoggerRegistry.WeakMapFactory.html that possibly could help us.
I probably need to educate myself better on this, sadly the log4j2 docs are not very detailed on the subject.

@jasontedor
Copy link
Member

The weak map factory is designed for web servers like Tomcat where there are servlets that carry various context used when creating the logger. Using the weak map factory in Elasticsearch means replacing the logger registry which means replacing the logger context (since it holds a private reference to the logger registry and can not be controlled) which means replacing the logger context factory. It's a lot of hoops to jump through and we end up copying a lot of Log4j behavior. Again, I question the value of per-index slow log settings to begin with. Let's start with what features we want, and then engineer the solution.

@pgomulka
Copy link
Contributor

pgomulka commented May 7, 2020

we discussed this during core-infra sync on 6th May. The conclusion is that we will deprecate the use of changing log level per index in 7.x and remove it in 8. We will still allow changing thresholds per index. To fix a memory leak in 7.x we will revert the change that creates a new log4j logger per index and will do a "manual" check of index name/package checks.

In ES8 we will no longer allow changing log level per index. That means index.search.slowlog.level: info will be removed.
If a slowlog is required for specific index/index pattern, it will have to be changed using thresholds.

The option of maintaining current behaviour was rejected. It would require to maintain too many classes from log4j2.

pgomulka added a commit that referenced this issue May 27, 2020
Slow loggers should use single shared logger as otherwise when index is
deleted the log4j logger will remain reachable (log4j is caching) and
will create a memory leak.

closes #56171
pgomulka added a commit to pgomulka/elasticsearch that referenced this issue May 27, 2020
Slow loggers should use single shared logger as otherwise when index is
deleted the log4j logger will remain reachable (log4j is caching) and
will create a memory leak.

closes elastic#56171
pgomulka added a commit to pgomulka/elasticsearch that referenced this issue May 27, 2020
Slow loggers should use single shared logger as otherwise when index is
deleted the log4j logger will remain reachable (log4j is caching) and
will create a memory leak.

closes elastic#56171
@pgomulka
Copy link
Contributor

closed by #56708

pgomulka added a commit to pgomulka/elasticsearch that referenced this issue May 27, 2020
Slow loggers should use single shared logger as otherwise when index is
deleted the log4j logger will remain reachable (log4j is caching) and
will create a memory leak.

closes elastic#56171
pgomulka added a commit to pgomulka/elasticsearch that referenced this issue May 27, 2020
Slow loggers should use single shared logger as otherwise when index is
deleted the log4j logger will remain reachable (log4j is caching) and
will create a memory leak.

closes elastic#56171
pgomulka added a commit that referenced this issue May 27, 2020
Slow loggers should use single shared logger as otherwise when index is
deleted the log4j logger will remain reachable (log4j is caching) and
will create a memory leak.

closes #56171
pgomulka added a commit that referenced this issue May 27, 2020
Slow loggers should use single shared logger as otherwise when index is
deleted the log4j logger will remain reachable (log4j is caching) and
will create a memory leak.

closes #56171
pgomulka added a commit that referenced this issue May 27, 2020
Slow loggers should use single shared logger as otherwise when index is
deleted the log4j logger will remain reachable (log4j is caching) and
will create a memory leak.

closes #56171
pgomulka added a commit that referenced this issue May 28, 2020
Slow loggers should use single shared logger as otherwise when index is
deleted the log4j logger will remain reachable (log4j is caching) and
will create a memory leak.

closes #56171
pgomulka added a commit that referenced this issue Jun 4, 2020
Setting a slow log level requires an unnecessary conditional logic in SearchSlowLog and IndexingSlowLog
The behaviour of setting a level on a slow logger can be achieved with correct slow log threshold settings.
This PR is removing slow log and modifies tests to achieve the same behaviour with changing threshold.
relates #56171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Logging Log management and logging utilities Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants