Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException caused by async connect/disconnect processing #553

Closed
andrew-tpz opened this issue Nov 19, 2020 · 7 comments · Fixed by #557 or #561
Closed

NullPointerException caused by async connect/disconnect processing #553

andrew-tpz opened this issue Nov 19, 2020 · 7 comments · Fixed by #557 or #561
Labels

Comments

@andrew-tpz
Copy link

andrew-tpz commented Nov 19, 2020

Expected behavior

Session is not removed by handleConnectionLost if new session is already reopened

Actual behavior

  1. handleConnectionLost starts
  2. processConnect starts and reopen session
  3. handleConnectionLost continues and removes session
  4. PostOffice.flushInFlight fails to use session

Steps to reproduce

  1. Run broker
  2. Use clean session=true in mqtt-client
  3. Start and stop mqtt-client several times. Try to do pause between stop and start as short as possible.
    P.S. It seems like one more mqtt-client connected permanently gives better chance to reproduce.

Minimal yet complete reproducer code (or URL to code) or complete log file

I added extra logging to catch the problem

14:59:39.254 [nioEventLoopGroup-3-6] io.moquette.broker.SessionRegistry - retrieve session OK by handleConnectionLost

14:59:39.254 [nioEventLoopGroup-3-7] io.moquette.broker.MQTTConnection  - processConnect started
14:59:39.254 [nioEventLoopGroup-3-7] io.moquette.broker.SessionRegistry - NEW SESSION
14:59:39.255 [nioEventLoopGroup-3-7] io.moquette.broker.SessionRegistry - REOPEN SESSION

14:59:39.255 [nioEventLoopGroup-3-6] io.moquette.broker.SessionRegistry - remove session by handleConnectionLost

14:59:39.255 [nioEventLoopGroup-3-7] io.moquette.broker.MQTTConnection  - processConnect ACK started
14:59:39.257 [nioEventLoopGroup-3-7] io.moquette.broker.MQTTConnection  - processConnect ACK end with true
14:59:39.257 [nioEventLoopGroup-3-7] io.moquette.broker.SessionRegistry - retrieve session NULL by flushInFlight

14:59:39.257 [nioEventLoopGroup-3-7] io.moquette.broker.NewNettyMQTTHandler  - Unexpected exception while processing MQTT message. Closing Netty channel. CId=
 java.lang.NullPointerException: Attempt to invoke virtual method 'void io.moquette.broker.Session.flushAllQueuedMessages()' on a null object reference
     at io.moquette.broker.PostOffice.flushInFlight(PostOffice.java:319)
     at io.moquette.broker.MQTTConnection.readCompleted(MQTTConnection.java:536)
     at io.moquette.broker.NewNettyMQTTHandler.channelReadComplete(NewNettyMQTTHandler.java:76)

Moquette MQTT version

0.12.1, master 4665c5e

JVM version (e.g. java -version)

1.8

@andsel andsel added the bug label Dec 5, 2020
@andsel
Copy link
Collaborator

andsel commented Dec 5, 2020

@andrew-tpz I've some questions, to recreate a reproducible test.

  • Does you client connect and then exit without an close ?
  • Do you have multiple client instances that uses the same clientId?

@andsel
Copy link
Collaborator

andsel commented Dec 5, 2020

I think that, supposing 2 instances of the same client with same client Id and clean session true, trying to connect to the broker generate the crash as:

already connected client new connecting client
t1 is CONNECTED
t2 process connect and find existing client, so close it
t3 the connection is notified of closing,
t4 replace the old session with new one
published = pool.replace(clientId, oldSession, newSession);
t5 removes the session from registry
t6 complete the connect flushing the CONNACK
t7 receive the readComplete notification which retrieve the session without checking it's valid and crash with NPE

The problem is not the missed null check, but the interference of the already connected client that doesn't check the status of Session in registry, or if another client is in his connection sequence

@andrew-tpz
Copy link
Author

* Does you client connect and then exit without an close ?

Client connects on app start and I close it immediately by Alt+F4 (on Windows). So, time gap between start and stop is fractions of second. I can't say if it have time to close gracefully.

* Do you have multiple client instances that uses the same clientId?

No, but I start and close one client many times so quickly, that sometimes broker got "connected" event while it is processing previous "disconnected" event. It's kind of stress test, to catch the problem.

@andrew-tpz
Copy link
Author

The error now appears very rarely. However, I've got it once, but can't reproduce.

 E/i.m.b.NewNettyMQTTHandler: [nioEventLoopGroup-5-5] Error processing protocol message: CONNECTjava.lang.NullPointerException: Attempt to invoke virtual method 'boolean io.moquette.broker.Session.disconnected()' on a null object reference
        at io.moquette.broker.SessionRegistry.reopenExistingSession(SessionRegistry.java:119)
        at io.moquette.broker.SessionRegistry.createOrReopenSession(SessionRegistry.java:109)
        at io.moquette.broker.MQTTConnection.processConnect(MQTTConnection.java:168)
        at io.moquette.broker.MQTTConnection.handleMessage(MQTTConnection.java:69)
        at io.moquette.broker.NewNettyMQTTHandler.channelRead(NewNettyMQTTHandler.java:58)

@andsel andsel reopened this Dec 23, 2020
@andsel
Copy link
Collaborator

andsel commented Dec 26, 2020

I think this problem is due to a check

separed by from the retrieval of what's checked
final Session oldSession = pool.get(clientId);
, and the executing thread is interrupted in the middle, while the other client is disconnecting and removing the session in the middle of the two statements

@andsel
Copy link
Collaborator

andsel commented Dec 26, 2020

Hi @andrew-tpz could you try the PR #561 and let me know if it fixes this, I think I caught the problem

@andrew-tpz
Copy link
Author

Hi @andsel, everything seems to be ok! Thanks!

andsel added a commit that referenced this issue Dec 28, 2020
…nnect of another with same clientId (#561)

This PR fixes #553 (#553 (comment)) when a thread is processing the loose of client connection and start the clean up of session from the registry, another thread is processing the CONNECT from the same client (with same clientId), this second thread has to retrieve the connection from the registry in only one step, avoiding to do a check-and-get but doing only a get, then checking for presence or not
jbutler pushed a commit to aws-greengrass/aws-greengrass-moquette-mqtt that referenced this issue Jun 10, 2021
…nnect of another with same clientId (#561)

This PR fixes #553 (moquette-io/moquette#553 (comment)) when a thread is processing the loose of client connection and start the clean up of session from the registry, another thread is processing the CONNECT from the same client (with same clientId), this second thread has to retrieve the connection from the registry in only one step, avoiding to do a check-and-get but doing only a get, then checking for presence or not
jbutler added a commit to aws-greengrass/aws-greengrass-moquette-mqtt that referenced this issue Jun 10, 2021
* Moquette 0.12.1 versions release

* Updated documentation to version 0.12.1

* Flush written data in MQTTConnection#sendIfWritableElseDrop (#454)

* Flush written data in MQTTConnection#sendIfWritableElseDrop

* Add a new configuration `immediateBufferFlush`

* Add BrokerConfigurationTest

* Fix NPE in SessionTest

* Fixes #456, update the H2 version

* Bumped dev cycle 0.13

* Minor, just added donate button to README.md

* Extended websocket configuration in order to configure websocket path and max frame size; default values are still the same (#461)

* Update Freedomotic references (#475)

* remove unnecessary quotes (#478)

removed unnecessary quotes on line 34 and  line 86; these quotes make an execution in Windows 10 impossible;

* fix retained messages with wildcard subscriptions (#489)

Thanks @davesters good catch of the errors.

 Andrea

* After fix #488 used also in H2RetainedRepository

* Updated travis ci config to openjdk12

* Avoid slow responses due to missed queue drain after a client sent out all its messages

* Bumped release 0.13 (#523)

* Minor, changed header (#536)

* Reduce log level for log messages caused by client side input mistakes (#537)

Co-authored-by: Pieter Hameete <pieter.hameete@blockbax.com>

* Re-generated keystores used in SSL tests (#539)

* Added specific handling for BindExcpetion when the port is occupied (#540)

* Switched to gradle 5.6 (#541)

* Reintroduce maven build, with wrapper, removed gradle. (#542)

* Removed gradle script files

* Updates Netty to 4.1.53 to have MQTT5 support (#546)

* Rolbacked Paho to 1.2.0 since starting from 1.2.1 TLS tests fails with: Caused by: javax.net.ssl.SSLHandshakeException: No name matching localhost found

* Bump junit from 4.8.2 to 4.13.1 (#544)

Bumps [junit](https://github.com/junit-team/junit4) from 4.8.2 to 4.13.1.
- [Release notes](https://github.com/junit-team/junit4/releases)
- [Changelog](https://github.com/junit-team/junit4/blob/main/doc/ReleaseNotes4.13.1.md)
- [Commits](junit-team/junit4@r4.8.2...r4.13.1)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Moved all connect process flow into connection class, to avoid spread the connection logic in subsystems like SessionRegistry. (#548)

* Switched temp files in unit test to temp folder managed (#549)

* Updated paho to 1.2.5 to fix CVE-2019-11777 (fixed in Pago 1.2.1) (#550)

* Switched from connAck build method to message builder offered by netty

* Create maven_build.yml

Switch CI to GitHub actions

* GitHub actions (#552)

Added github action CI on every push:
- matrix for ubuntu as OS, JDK 11, 12, 15, arch x84_64
- Removed Java 8 from CI
- Removed travis CI

Co-authored-by: andsel <selva.andre@gmail.com>

Fix: #552

* Introduced test to verify that connectionLost remove the new client session (#557)

This PR add the link MQTTConnection -> Session so now the two entities are doubly linked. This permit the removal of session from session registry iff the session it's removing is the only one active in the session registry. This also avoid the continuous access to SessionRegistry from the MQTTConnection to retrieve the session linked to the MQTTConnection.

PR #557

* Fix bug in race condition between the shutdown of a client and the connect of another with same clientId (#561)

This PR fixes #553 (moquette-io/moquette#553 (comment)) when a thread is processing the loose of client connection and start the clean up of session from the registry, another thread is processing the CONNECT from the same client (with same clientId), this second thread has to retrieve the connection from the registry in only one step, avoiding to do a check-and-get but doing only a get, then checking for presence or not

* Update maven_build.yml

Fixed typo

* Upgraded JUnit dependency from 4.13.1 to 5.7.0

* Minor, removed link from main page

* Fix, recreate sessions when server restarts (#563)

When server restars the session must be recreated in the SessionRegistry pool, else the session is not found in lookup operation and the following published messages are not queued to the session.

* Test changes, switched some waiting stuff to Awaitatility library

* Minor, just fixed typos

* Log, update slf4j dep and added log4j 1.2 to distribution

* Log, added fishtagging to loglines, avoid to explicitly log clientId and channel, this job is left to logging framework leveraging log MDC (mapped diagnostic context)

* Fixed NPE due to TNode processing before CNode (#564)

This PR added accessors methods to topic, in this way the TNode could throw exception if it's called. This let us fix #505 and #529, implementing the suggestion in #505 to move forward the TNode check.

* Added GH workflow to publish artifacts to BinTray

* Bumped version 0.14

* fix: remove gradle directory

* Enable ServerLowLevel tests not running and add PUB QoS1 test (#580)

- Add unit test to check a second publish is sent if the not PUBACK is received in `FLIGHT_BEFORE_RESEND_MS` interval.
- Fixed ServerLowLevel tests not running.

* Added a test for #573

* Fixed #573 ByteBuf reference counting (#600)

Moved buffer release/retain calls to the outside interfaces of moquette.

* fix: migrate junit4 Ignore to junit5 Disabled

* chore: reduce PUBACK log to debug

* fix: update github workflows

* chore: update moquette from 0.12.1 to 0.14

Co-authored-by: andsel <selva.andre@gmail.com>
Co-authored-by: Mitsunori Komatsu <komamitsu@gmail.com>
Co-authored-by: aemaem <mmair@live.at>
Co-authored-by: Mauro Cicolella <mcicolella@libero.it>
Co-authored-by: Joel Strasser <strasser999@gmail.com>
Co-authored-by: David Corona <davesters81@gmail.com>
Co-authored-by: Pieter Hameete <phameete@gmail.com>
Co-authored-by: Pieter Hameete <pieter.hameete@blockbax.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Gerrit Grunwald <HanSolo@users.noreply.github.com>
Co-authored-by: Hylke van der Schaaf <hylke.vds@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment