KAFKA-10579: Make Reflections thread safe to resolve flaky NPE scanning failure #14020
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
The Reflections library has a race condition that causes it to sometimes throw an NPE. This NPE crashes the connect worker on startup, both in live environments and in tests, and causes those tests to be flaky. This PR uses reflection to patch the library and eliminate the race condition. This is done instead of upstreaming the patch or forking the library because the library itself is unmaintained and should be phased out.
Alternatively, we can consider upgrading the library to a version which has patched this bug: #14029
Background
The Reflections library makes use of a data structure
Store
to store the results of scanning for later querying. The scanner writes to the store duringReflections#scan()
via theSubTypesScanner
. The store is later queried byReflections#getSubTypesOf
.Due to the slow speed of reflectively discovering all classes on the classpath and plugin.path, the Reflections library is used with a parallel executor, increasing the scanning speed. Unfortunately the parallel mode of the library has some bugs, one of which have already been patched via the InternalReflections subclass.
The parallel mode causes the Store to receive concurrent writes. The javadoc for the class does not specify that it is or isn't thread-safe, but due to the use of ConcurrentHashMap and the support for parallel scanning in Reflections, the class seems intended to be thread-safe.
Symptoms
The failure appears as the following stack trace.
The stack trace refers to the old location of this code, but the failure persists in it's new location. The line numbers inside of the Reflections library are still accurate, since we haven't upgraded the version of this library in several years.
Diagnosis
From the stacktrace, the NPE is caused by the argument of
ConcurrentHashMap#get(String)
being null. Tracing backwards, this value is ultimately read fromStore#storeMap
, meaning it contains a null in the innermost Collection. Also, it contains this null after all of the concurrent scanning has finished (seeReflections#scan
where it waits for all of the submitted futures) so there are no writes racing with the reads. TheStore#storeMap
contains a null at the end of scanning.Analyzing the data flow into the
Store#storeMap
, we can see that there is only one method which writes to it:Store#put(String, String, String)
, where the last argument is added into the innermost collection. This method is called in various places:Reflections#expandSuperTypes
always non-null, and access is single-threadedStore#merge
(not used) null iff already null in the other Store, and access is single-threadedXmlSerializer#read
(not used) null iff null in the XML file, and access is single-threadedAbstractScanner#put
always non-null, but the method is called concurrentlyBecause the argument appears to be non-null on all active code-paths, it doesn't appear that the null could be coming from a caller. Because there are no null values upon entering the method, and there are when exiting, I believe the method itself must be introducing a null. The only interesting property of this method is that it is called concurrently, so there could be a concurrency bug.
Following the hypothesis that this is due to concurrency, I looked for potentially non-thread-safe stuff in this method implementation:
The innermost Collection turns out to be a non-thread-safe ArrayList instance. From the ArrayList javadoc:
With this specific usage pattern, the
ArrayList#add()
is not synchronized, and structurally modifies the list. Once theConcurrentHashMap#computeIfAbsent
calls complete, there is no synchronization between the different threads operating on the returned Collection instance, and the racing writes can cause unexpected behavior. I found references online which indicate that one symptom of concurrent use of an ArrayList is the appearance of nulls when none were explicitly inserted, the effect we were seeing in theStore#storeMap
contents.Reflective fix
The Store class is not final, so we can subclass it to override behaviors, and then inject the custom store into the existing InternalReflections class used for the existing fix. Unfortunately the
Store#storeMap
instance variable is private, meaning that we can't simply override theput
method. I elected to use reflection to make the field visible and override the singleput
method, because patching the behavior without reflection required overriding every read method, essentially copying the whole class. If the reflective accesses fail, the InternalStore falls back to the Store behavior, re-introducing the race condition.Rather than replacing the ArrayList, I chose to use Collections.synchronizedList to make it synchronized. I plan on running some manual performance tests to see how this impacts the scanning time, and we can evaluate changing the collection to something more performant.
Backport
This change is targeted at the ReflectionScanner on trunk, but I can easily re-write it to target the DelegatingLoader on <= 3.5. The bug has been present ever since the scanning was made parallel in #4561 so there are a lot of affected branches that could benefit from this patch, but the severity may not warrant any backporting.
Committer Checklist (excluded from commit message)