Performance improvement in Category callAppenders() #51

lauredogit · 2022-07-04T20:19:09Z

Hi,

We just got a bottleneck in production with 2000+ threads blocked waiting on the lock in the synchronized block of callAppenders() in Category.

This never happend before to us, including in stress tests.

However, I seem to think a small performance improvement could be made with a simple open call (if I'm reading the code correctly).

    public void callAppenders(LoggingEvent event) {
	int writes = 0;

	for (Category c = this; c != null; c = c.parent) {
	    // Protected against simultaneous call to addAppender, removeAppender,...
        // open call, read the values under lock:
        AppenderAttachableImpl aai;
        boolean additive;
	    synchronized (c) {
            aai = c.aai;
            additive = c.additive;
        }
        // open call
        if (aai != null) {
            writes += aai.appendLoopOnAppenders(event);
        }
        if (!additive) {
            break;
        }
    }

Since aai is final and is a thread-safe data structure, we only want correct visibility of the aai and additive fields and we don't need to block the whole category object while calling appendLoopOnAppenders.

What do you think?

Best regards,
Dominique

ceki · 2022-07-04T20:32:25Z

@lauredogit Which version of reload4j are you using?

lauredogit · 2022-07-04T22:15:14Z

Production was using in 1.2.20.

lauredogit · 2022-07-05T10:22:39Z

It appears the threads were piling up due to slowness induced by a remote diagnostic agent running at the time (RDA).

It is strange that it materialized by showing a bottleneck in the synchronized block of Category#callAppenders().

Nonetheless, it might be a performance improvement to use an open call to invoke the appendLoopOnAppenders(...) method instead of doing it while holding the lock.

isoflavone-r · 2024-10-14T19:34:01Z

TLDR: [as of "update 2"] definitely appears to behave differently than old log4j, but in Apache Spark 3.0.x with reload4j. Reducing scope if synchronization didn't fix it.

Explicitly initializing logging with a /opt/spark/log4j.properties from the template in that same dir and setting the root log level to WARN (even though this was the logging level already) removed the contention.

hello. FWIW i'm also running into thread pileup on this lock.

Using reload4j 1.2.25 retrofitted into an Apache Spark 3.0.1 application where Spark is used from pyspark (which may be relevant because the VM's output is sent back to a python process for printing).

Symptoms are that on a GCP VM with 56 cores the application has long periods (maybe 90% of wall-clock runtime in aggregate) of only using 6 cores compared to the same tasks on stock Spark 3.0.1 which is using 50+ cores. Overall impact is that elapsed time for the task is ~8 hours vs 1.5 hours.

Jdk 8 (but reload4j scenario also tested with 11). I'm not aware of any JVM debug agents attached / running. (there's maybe some excess info below for search engines / anyone else that might walk this way)

based on jstack output during this period there are many threads (25+) waiting on at org.apache.log4j.Category.callAppenders(Category.java:192)

the stacks look something like this:

"Executor task launch worker for task 2565" #4463 daemon prio=5 os_prio=0 cpu=76064.28ms elapsed=322.80s tid=0x00007ef054009800 nid=0x8d664 waiting for monitor entry  [0x00007eef32  5bd000]
     java.lang.Thread.State: BLOCKED (on object monitor)
          at org.apache.log4j.Category.callAppenders(Category.java:192)
          - waiting to lock <0x00007ef35c5677f0> (a org.apache.log4j.spi.RootLogger)
          at org.apache.log4j.Category.forcedLog(Category.java:364)
          at org.apache.log4j.Category.log(Category.java:771)
          at org.slf4j.impl.Reload4jLoggerAdapter.debug(Reload4jLoggerAdapter.java:252)
          at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.log(MessageColumnIO.java:283)

In my app's case parquet writing as part of

          at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
          at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182)
          at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44)

and in our setup i note that we use just the spark deployment so it says the following at startup:

24/10/12 02:45:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[update: no change in behavior after loading native library.
added entries to to conf/spark-defaults.conf

spark.driver.extraLibraryPath /opt/hadoop-3.2.0/lib/native
spark.executor.extraLibraryPath /opt/hadoop-3.2.0/lib/native

The warning disappeared and verified with lsof -c java | grep native that the java process had an open handle on /opt/hadoop-3.2.0/lib/native/libhadoop.so.1.0.0. But no change in contention shown by jstack.]

[update 2:

I compiled a jar with the reduced synchronization scope proposed in this thread and it had no impact. Threads still piled on getting that lock.
Even removing the synchronization altogether (without regard to correctness concerns) increased CPU usage from 6 cores to 13, but still well short of 40-55 of the old log4j. Using jstack to investigate there were no locks per se but sometimes 10s of threads were looking for effectiveLoggingLevel

2020 at org.apache.log4j.Category.getEffectiveLevel(Category.java:417)
2021 at org.apache.log4j.Category.isDebugEnabled(Category.java:672)
2022 at org.slf4j.impl.Reload4jLoggerAdapter.debug(Reload4jLoggerAdapter.java:250)
2023 at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.log(MessageColumnIO.java:283)

one or two in "isInfoEnabled()" and several in just in callAppenders().

from code inspection it seemed that Category.getEffectiveLevel should somehow be short circuited, but for some reason wasn't. On a whim copied /opt/spark/conf/log4j.properties.template to /opt/spark/conf/log4j.properties which among other things explicitly sets some loggers to ERROR or WARN, including log4j.logger.org.apache.parquet=ERROR
(i also set the root to WARN ). In terms of output this changed nothing since i was only ever getting WARN + messages anyway, just as spark's default, but now CPU usage is up to 30-55 cores and jstack never catches ANY threads in any log4j packages. I put in place the release reload4j-1.2.25 jar and still continues ok.

CPU usage is more variable and more time is spent in iowait; the impression is that the task is doing less overall work. elapsed time is now as fast or shorter than stock log4j (though those tests didn't have this init file). Area under the CPU usage graph looks similar so my impression may be invalid. This is probably where i'm going to leave things.
]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement in Category callAppenders() #51

Performance improvement in Category callAppenders() #51

lauredogit commented Jul 4, 2022 •

edited

Loading

ceki commented Jul 4, 2022

lauredogit commented Jul 4, 2022 •

edited

Loading

lauredogit commented Jul 5, 2022

isoflavone-r commented Oct 14, 2024 •

edited

Loading

Performance improvement in Category callAppenders() #51

Performance improvement in Category callAppenders() #51

Comments

lauredogit commented Jul 4, 2022 • edited Loading

ceki commented Jul 4, 2022

lauredogit commented Jul 4, 2022 • edited Loading

lauredogit commented Jul 5, 2022

isoflavone-r commented Oct 14, 2024 • edited Loading

lauredogit commented Jul 4, 2022 •

edited

Loading

lauredogit commented Jul 4, 2022 •

edited

Loading

isoflavone-r commented Oct 14, 2024 •

edited

Loading