Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HandleKeepAlive frequently calls MQTT_Ping #1915

Closed
xiaomizhouya opened this issue Sep 11, 2024 · 15 comments
Closed

HandleKeepAlive frequently calls MQTT_Ping #1915

xiaomizhouya opened this issue Sep 11, 2024 · 15 comments

Comments

@xiaomizhouya
Copy link

When testing MQTT_ProcessLoop, we noticed a problem that MQTT_Ping was called frequently in the handleKeepAlive function, as shown in the red box in the figure, and every loop would come into this judgment logic. We found that in the SDK, when receiving a message from the server, pContext ->lastPacketTime does not change its value, which means that pContext ->lastPacketTime is always 0; When the current time exceeds PACKET-RX_TIMEOUT_S, it will continue to enter this logic and frequently send packets. Excuse me, is this a bug?

Also, I don't quite understand the logic of this logic. Why do I need to send ping packets based on the received message interval? My understanding is that sending ping packets based on the interval between the last message sent is sufficient. I hope to receive an answer.

Due to the above issue, we encountered a problem. We added log in the sdk and found that the device called sendBuffer function in MQTT_PING function to send data. When the return value is 2, can it indicate that the device has successfully sent data? However, according to the log of the server, no data packet was received from the device. I mean, what's the reason for this? Did the packet get lost in the network? May I ask if there is any log on the server side to assist us in debugging?

image
image

@aggarg
Copy link
Member

aggarg commented Sep 11, 2024

We found that in the SDK, when receiving a message from the server, pContext ->lastPacketTime does not change its value, which means that pContext ->lastPacketTime is always 0;

It is updated here - https://github.com/FreeRTOS/coreMQTT/blob/main/source/core_mqtt.c#L1794.

My understanding is that sending ping packets based on the interval between the last message sent is sufficient. I hope to receive an answer.

It is needed for both idle Tx or Rx and the rationale is explained in this PR - FreeRTOS/coreMQTT#191.

When the return value is 2, can it indicate that the device has successfully sent data?

This means the transport interface told us that it has sent 2 bytes.

However, according to the log of the server, no data packet was received from the device. I mean, what's the reason for this? Did the packet get lost in the network? May I ask if there is any log on the server side to assist us in debugging?

How did you determine that no data was received on the server?

@xiaomizhouya
Copy link
Author

So far, according to the client log, the sdk has successfully sent a 2bytes packet, but received no reply, and then returned to iot_core MQTTKeepAliveTimeout. We do not know if aws iot server received the data. We want to know what might cause this to happen and whether there is log support for debug on the server side.

@aggarg
Copy link
Member

aggarg commented Sep 12, 2024

We want to know what might cause this to happen and whether there is log support for debug on the server side.

You can try this - https://docs.aws.amazon.com/iot/latest/developerguide/cloud-watch-logs.html.

I'd also suggest to capture network traffic from the device.

@xiaomizhouya
Copy link
Author

thanks, I have an additional question to ask. We have found that there is a chance for the iot process to crash when executing MQTTDisconnect. By adding logs, we have discovered that there is a problem with the sendBuffer function. Have you encountered this problem before or have you solved it?
image

@aggarg
Copy link
Member

aggarg commented Sep 17, 2024

We have found that there is a chance for the iot process to crash when executing MQTTDisconnect. By adding logs, we have discovered that there is a problem with the sendBuffer function.

Would you please describe the problem that you discovered? I do not understand the problem from the image you shared.

@xiaomizhouya
Copy link
Author

By adding the log, we found that after the sdk enters the MQTT_Disconnect interface, the log only reaches line 3064, there is no log after that, and the iot_core process directly crashes, have you encountered this problem? This problem is not inevitable, I have tried seven or eight times encountered once.
image

@aggarg
Copy link
Member

aggarg commented Sep 18, 2024

Thank you for explaining. No, we are not aware of any such issue. Would you please run the process in the debugger and see why is crashes?

@xiaomizhouya
Copy link
Author

May I ask how to enter debugger mode? At present, we can only determine the location of the crash by adding logs.
image

@xiaomizhouya
Copy link
Author

We opened cloudwatch logs on the server side according to your suggestion. After the camera sends data at 14:49:58, the server logs from 14:45:50 to 14:50:50 as follows (in the json file, the time difference is 8 hours). It seems that there is no more detailed log. Does it mean that the server has not received the package? And because MQTT uses TLS protocol encryption, it is difficult to repeat the problem and capture the packet. Is there any suggestion to help us debug?

image
3C62F019B242.json

@aggarg
Copy link
Member

aggarg commented Sep 20, 2024

May I ask how to enter debugger mode?

Assuming you are running on Linux, you should be able to use GDB. There is plenty of tutorials available on how to use GDB.

Does it mean that the server has not received the package?

Most likely.

And because MQTT uses TLS protocol encryption, it is difficult to repeat the problem and capture the packet. Is there any suggestion to help us debug?

Your underlying transport is failing to send the data. Even though we wont be able to decrypt network traffic, it will tell us if the PINGREQ is put on wire or not.

@xiaomizhouya
Copy link
Author

I would like to ask some specific questions. Before, you said that MQTT_Ping has a return value, indicating that the client has sent a message out. Under what circumstances will this return value be obtained? Did it actually send the data? Because according to your later statement, it seems that you suspect that the client is not sending the data at the bottom. And how to tell if the server did not receive the packet rather than received the packet but did not reply?Is there any way to confirm that the server really did not receive the packet?

@aggarg
Copy link
Member

aggarg commented Sep 25, 2024

Before, you said that MQTT_Ping has a return value

MQTT_Ping does have a return value as you can see here - https://github.com/FreeRTOS/coreMQTT/blob/main/source/core_mqtt.c#L2914.

Under what circumstances will this return value be obtained?

A function's return value can be obtained under any circumstance. What do you mean by this question?

Did it actually send the data?

coreMQTT uses transport interface to send the data, the implementation of which is supplied by the application. In this case, you are likely using one of the following implementation of transport interface - https://github.com/aws/aws-iot-device-sdk-embedded-C/tree/main/platform/posix/transport/src. If the transport interface returns success indicating that it sent the data but does not send it, the MQTT library has no way to find if the data was not sent. Assuming you are using Linux, this scenario is possible because Linux copies the data into the kernel buffers and returns success to the application - it may later fail to send the data if the connection is broken.

And how to tell if the server did not receive the packet rather than received the packet but did not reply?Is there any way to confirm that the server really did not receive the packet?

That is why I have been asking you to capture network traffic and share? Are you able to do that?

vishwamartur added a commit to vishwamartur/aws-iot-device-sdk-embedded-C that referenced this issue Oct 30, 2024
Related to aws#1915

Update `handleKeepAlive` in `source/core_mqtt.c` to correctly update `pContext->lastPacketTime` when data is received.

* Ensure `MQTT_Ping` is not called frequently due to `pContext->lastPacketTime` not updating.
* Retain the logic for sending ping packets based on both received and sent message intervals.
* Address the issue related to the underlying transport failing to send the data, ensuring reliable data transmission.
@aggarg
Copy link
Member

aggarg commented Nov 7, 2024

@xiaomizhouya Did you get a chance to debug it further?

@DakshitBabbar
Copy link
Member

Hey @xiaomizhouya, Looking at the following in the CloudWatch logs, the client was disconnected by the broker because of MQTT_KEEP_ALIVE_TIMEOUT :

"@timestamp": "2024-09-19 06:50:42.700",
        "@message": {
            "timestamp": "2024-09-19 06:50:42.700",
            "logLevel": "INFO",
            "traceId": "a5fa7913-1f35-0976-5ae8-08f80e084f98",
            "accountId": "883998178536",
            "status": "Success",
            "eventType": "Disconnect",
            "protocol": "MQTT",
            "clientId": "3C62F019B242-2311DMM000269",
            "principalId": "d1c07375c9cddfbb9904a08aaf1bb5f2b6c8e14e7c7e6b1b3b6f98201674bdc0",
            "sourceIp": "223.112.52.222",
            "sourcePort": 48456,
            "disconnectReason": "MQTT_KEEP_ALIVE_TIMEOUT"
        },
        "@logStream": "e3201aac-bcad-4411-a64a-5cfc13353f6b_883998178536_0",
        "@log": "883998178536:AWSIotLogsV2"

According to this it means that there was no communication between the server and the client for 1.5x the keep alive time. Hence AWS IoT has not received the PINGREQ.

Also the fact that MQTT_Disconnect() and MQTT_Ping() both are not working as they should be and from the logs you shared for sendBuffer() confirm the earlier suspicion that the underlying transport interface is failing to send the data. Assuming that you are using Linux, your underlying connection is likely broken and attempting a reconnection is the right thing to do.

@aggarg
Copy link
Member

aggarg commented Dec 21, 2024

Closing this issue. Feel free to re-open if you need anything else.

@aggarg aggarg closed this as completed Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants