Skip to content

[BUG] coreMQTT doesn't reconnect and keep alive handling fails #91

@NightSkySK

Description

@NightSkySK

Describe the bug

At first glance, the subject can suggest that it is the same issue as described in #47 and #48 However, as the symptoms are the same the solution from those issues haven't solve my problem. Let me explain step by step what happened and my observations and logs (It will be quite a long thread, I'm sorry...)

The story begins when I found that my code is suffering from mbedtls memory leak:
In logs I found esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 which was clear indication mbedtls memory leak

2024-07-22 09:01:59	
I (38176357) coreMQTT: Publishing message to /gates/GA97DF0004.

2024-07-22 09:02:00	
E (38177577) coreMQTT: sendMessageVector: Unable to send packet: Network Error.
E (38177577) coreMQTT: MQTT PUBLISH failed with status MQTTSendFailed.
E (38177577) coreMQTT: MQTT operation failed with status MQTTSendFailed

W (38177577) gate_control: Error or timed out waiting for ack for publish message 12420. Re-attempting publish.
I (38177587) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (38177587) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1664

I (38177607) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38177597) core_mqtt_agent_manager: TLS connection was disconnected.
I (38177627) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38177597) gate_control: Task "GateReport" sending publish request to coreMQTT-Agent with message "{"openReceived": 0,"gateSensors":{ "IN1": 0, "IN2": 0},"keyboard":{ "new_key": 0, "key": ""}, "iteration": 4135}" on topic "/gates/GA97DF0004" with ID 12420.
I (38177647) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (38177667) gate_control: Task "GateReport" waiting for publish 12420 to complete.
I (38177677) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-22 09:02:00	
E (38177947) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38177947) esp-tls: create_ssl_handle failed
E (38177947) esp-tls: Failed to open new connection

2024-07-22 09:02:01	
I (38178267) core_mqtt_agent_manager: Retry attempt 1.

2024-07-22 09:02:01	
E (38178487) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38178487) esp-tls: create_ssl_handle failed
E (38178487) esp-tls: Failed to open new connection

2024-07-22 09:02:01	
W (38178577) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (38178577) AWS_OTA: OTA Agent is suspended.
I (38178577) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]

2024-07-22 09:02:01	
I (38178947) core_mqtt_agent_manager: Retry attempt 2.

2024-07-22 09:02:01	
E (38179077) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38179077) esp-tls: create_ssl_handle failed
E (38179077) esp-tls: Failed to open new connection

2024-07-22 09:02:03	I (38180557) core_mqtt_agent_manager: Retry attempt 3.
E (38180707) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38180717) esp-tls: create_ssl_handle failed
E (38180717) esp-tls: Failed to open new connection

In the same software version I noticed that MQTTKeepAliveTimeout appear from time to time but the software can handle reconnection without issue:

2024-07-21 16:33:28	
I (18501007) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 6002

2024-07-21 16:33:36	
E (18508867) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (18508867) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (18508867) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout
I (18508877) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (18508887) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (18508877) core_mqtt_agent_manager: TLS connection was disconnected.
I (18508907) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (18508927) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (18508927) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-21 16:33:38	
I (18511227) coreMQTT: MQTT connection established with the broker.
I (18511227) core_mqtt_agent_manager: Session present: 0

I (18511227) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005/update will be attempted.
W (18511227) gate_control: Error or timed out waiting for ack to subscribe message 6002. Re-attempting subscribe.
I (18511237) report: coreMQTT-Agent connected.
I (18511257) gate_control: coreMQTT-Agent connected.
I (18511257) supervisor: coreMQTT-Agent connected.
I (18511257) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 6002
I (18511267) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (18511287) core_mqtt_agent_manager: coreMQTT-Agent connected.

So to fix mbedtls memory issue I've applied following commits:
a0fe220 referring to sdkconfig.default
and e2d407f referring to main/networking/mqtt/core_mqtt_agent_manager.c

And it worked well for mbedtls memory issue, I can't find any evidence esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 anymore, however the frequency of MQTTKeepAliveTimeout significantly increased and software is no longer capable to recover MQTT connection.

2024-07-24 01:30:43	
I (7749746) coreMQTT: Publishing message to /gates/GA97DF0004.


2024-07-24 01:30:46	
I (7753386) ota_over_mqtt: OTA agent resumed by timer.

2024-07-24 01:30:48	
I (7754386) AWS_OTA: otaPal_GetPlatformImageState
I (7754386) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (7754386) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (7754386) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]

2024-07-24 01:30:49	
I (7755416) gate_control: Task "GateControl" iteration 128 completed.

2024-07-24 01:30:49	
E (7755746) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (7755746) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (7755746) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (7755756) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (7755766) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (7755786) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (7755796) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (7755806) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-24 01:30:50	W (7756396) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
2024-07-24 01:32:11	E (7837816) coreMQTT: No command structure available.
2024-07-24 01:32:21	E (7847866) coreMQTT: No command structure available.
2024-07-24 01:32:31	E (7857916) coreMQTT: No command structure available.
2024-07-24 01:32:41	E (7867966) coreMQTT: No command structure available.

And on this stage we have two issues:

  1. Why MQTTKeepAliveTimeout even if few seconds earlier there was successfully published msg to AWS IoT MQTT server and from AWS IoT Documantation page I can read:

For MQTT (or MQTT over WebSockets) connections, a client can request a keep-alive interval between 30 - 1200 seconds as part of the MQTT CONNECT message. AWS IoT starts the keep-alive timer for a client when sending CONNACK in response to the CONNECT message. This timer is reset whenever AWS IoT receives a PUBLISH, SUBSCRIBE, PING, or PUBACK message from the client. AWS IoT will disconnect a client whose keep-alive timer has reached 1.5x the specified keep-alive interval (i.e., by a factor of 1.5).The default keep-alive interval is 1200 seconds. If a client requests a keep-alive interval of zero, the default keep-alive interval will be used. If a client requests a keep-alive interval greater than 1200 seconds, the default keep-alive interval will be used. If a client requests a keep-alive interval shorter than 30 seconds but greater than zero, the server treats the client as though it requested a keep-alive interval of 30 seconds.

where key information is This timer is reset whenever AWS IoT receives a PUBLISH, SUBSCRIBE, PING, or PUBACK message from the client.

I've also tried to increase in sdkconfig.default from CONFIG_GRI_MQTT_AGENT_KEEP_ALIVE_INTERVAL_SECONDS=60 to CONFIG_GRI_MQTT_AGENT_KEEP_ALIVE_INTERVAL_SECONDS=600 without any major difference

which I found confirmation at the server side as
Livecycle Connect/Disconnect events don't show MQTT_KEEP_ALIVE_TIMEOUT but DUPLICATE_CLIENTID caused by rebooting ESP32 without proper disconnection with MQTT server.

  1. The application lost the capability to recover from coreMQTT-Agent disconnected my short-term solution is to trigger the device reboot by my supervisor task, it isn't an elegant solution and I would prefer to fix it properly.

System information

  • Hardware board: ESP32
  • IDE used: 5.1.4
  • Operating System: Windows

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions