kafka-python #1744 Attempt 2 at race conditions with IFR #1766

isamaru · 2019-03-27T19:01:14Z

Attempt 2 (after #1757) at fixing #1744 with more aggresive (and reentrant) locking.
I expect this to be less performant, especially on Python 2.x, but it's safer.

@dpkp I tried to avoid using RLock but then just gave up as due to avoiding problems with callbacks (particularly during authorization), the size and complexity of it was quickly getting out of hand.

This change is

dpkp

Some comments inline. I took a stab at resolving some of these issues in #1768 - what do you think?

dpkp · 2019-03-29T14:02:05Z

kafka/client_async.py

@@ -617,7 +617,7 @@ def _poll(self, timeout):
            conn = key.data
            processed.add(conn)

-            if not conn.in_flight_requests:
+            if not conn.has_in_flight_requests():


Do these types of reads require additional locking? Perhaps naive, but my understanding is that a boolean check on a dict is atomic (via single CALL_FUNCTION opcode):

>>> import dis >>> n = {'a': 1, 'b': 2} >>> def foo(): ... bool(n) ... >>> dis.dis(foo) 2 0 LOAD_GLOBAL 0 (bool) 2 LOAD_GLOBAL 1 (n) 4 CALL_FUNCTION 1 6 POP_TOP 8 LOAD_CONST 0 (None) 10 RETURN_VALUE

dpkp · 2019-03-30T18:17:09Z

kafka/conn.py

@@ -273,7 +273,7 @@ def __init__(self, host, port, afi, **configs):
        # per-connection locks to the upstream client, we will use this lock to
        # make sure that access to the protocol buffer is synchronized
        # when sends happen on multiple threads
-        self._lock = threading.Lock()
+        self._lock = threading.RLock()


I think we can avoid the RLock if we call lock acquire() and release() more strategically, and not rely exclusively on the context manager -- particularly when we want to release the lock before processing an exception

dpkp · 2019-03-30T18:18:43Z

kafka/conn.py

-                    if selector is not None:
-                        selector.close()
-                        selector = None
+        with self._lock:


I am fine skipping the locking here for now -- we only use this method in check_version(), which is only called during initialization, and I think we can just make sure that always has the client lock.

dpkp · 2019-03-30T18:20:00Z

kafka/conn.py


    def connect(self):
+        with self._lock:


I think this makes sense long term, but for now I think we can rely on connect being synchronized via the KafkaClient lock (in _maybe_connect)

dpkp · 2019-04-01T00:10:47Z

kafka/conn.py

+            if self.state is ConnectionStates.DISCONNECTED:
+                if error is not None:
+                    log.warning('%s: Duplicate close() with error: %s', self, error)
+                self._fail_ifrs(error)


What is your thinking on this one? If we take care to always drain here and never add more ifrs while the state is disconnected, can't we assume that ifrs will also always be empty here?

dpkp · 2019-04-01T00:13:12Z

kafka/conn.py

+                return
+            log.info('%s: Closing connection. %s', self, error or '')
+            self.state = ConnectionStates.DISCONNECTING
+            self.config['state_change_callback'](self)


I think there may be some deadlock issues wrt this callback because it requires the client lock:

thread A acquires conn._lock
thread B acquires client._lock
thread A calls conn_state_change, blocks waiting for client._lock
thread B calls conn.close() or conn.send() etc and blocks waiting for conn._lock

=> deadlock

dpkp · 2019-04-01T00:14:35Z

kafka/conn.py

+            self._protocol = KafkaProtocol(
+                client_id=self.config['client_id'],
+                api_version=self.config['api_version'])
+            self._fail_ifrs(error)


I think we want to release the conn lock before processing callbacks here

dpkp · 2019-04-02T16:26:22Z

I merged #1768, which uses a non-reentrant lock. I'm planning to merge #1775 as well and I think that should cover all of concurrency issues I found on review. If you're able to test these changes in your setup, I would love that. We're planning to push out a patch release quickly.

Thanks again for all your excellent work on this issue. Fantastic bug report, debugging, and PRs. We really appreciate it, and I hope 1.4.6 will not fail you!

kafka-python dpkp#1744 Attempt 2 at race conditions with IFR

4c4e815

dpkp reviewed Apr 1, 2019

View reviewed changes

dpkp closed this Apr 2, 2019

jeffwidman mentioned this pull request Apr 2, 2019

Worker stuck after "Protocol out of sync" #1744

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kafka-python #1744 Attempt 2 at race conditions with IFR #1766

kafka-python #1744 Attempt 2 at race conditions with IFR #1766

isamaru commented Mar 27, 2019 •

edited

Loading

dpkp left a comment

dpkp Mar 29, 2019

dpkp Mar 30, 2019

dpkp Mar 30, 2019

dpkp Mar 30, 2019

dpkp Apr 1, 2019

dpkp Apr 1, 2019

dpkp Apr 1, 2019

dpkp commented Apr 2, 2019

kafka-python #1744 Attempt 2 at race conditions with IFR #1766

kafka-python #1744 Attempt 2 at race conditions with IFR #1766

Conversation

isamaru commented Mar 27, 2019 • edited Loading

dpkp left a comment

Choose a reason for hiding this comment

dpkp Mar 29, 2019

Choose a reason for hiding this comment

dpkp Mar 30, 2019

Choose a reason for hiding this comment

dpkp Mar 30, 2019

Choose a reason for hiding this comment

dpkp Mar 30, 2019

Choose a reason for hiding this comment

dpkp Apr 1, 2019

Choose a reason for hiding this comment

dpkp Apr 1, 2019

Choose a reason for hiding this comment

dpkp Apr 1, 2019

Choose a reason for hiding this comment

dpkp commented Apr 2, 2019

isamaru commented Mar 27, 2019 •

edited

Loading