Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Fix interrupted network read operation #120

Merged
merged 34 commits into from
Dec 5, 2020

Conversation

aggarw13
Copy link
Contributor

@aggarw13 aggarw13 commented Dec 3, 2020

Problem

The MQTT_ProcessLoop and MQTT_ReceiveLoop read incoming MQTT packet payload over the network by calling the recvExact function. The recvExact function can be called multiple times to read the expected number of bytes for the MQTT packet but it also implements a timeout functionality of receiving the expected number of payload within the timeout value passed to the function.
This causes problems when the Transport_Recv call returns less than requested number of bytes, and there is a timeout (for example, when calling MQTT_ProcessLoop with 0ms duration) which causes the function to assume failure instead of reading the remaining payload of the MQTT packet by calling Transport_Recv again. Thus, in such cases, the MQTT connection is severed prematurely even though there is a high probability of receiving the remaining bytes of the MQTT packet over the network.

Solution

Instead of implementing a timeout on the entire duration of receiving the expected number of remaining MQTT packet bytes in recvExact, the use of timeout is being changed to be relevant only on the total time of receiving 0 bytes over the network over multiple calls to Transport_Recv.
As this modified meaning of the timeout duration is now unrelated to the timeout duration that the MQTT_ProcessLoop or MQTT_ReceiveLoop functions are called, a new configuration constant for the recvExact timeout value, MQTT_RECV_POLLING_TIMEOUT_MS, has been added to the library which will carry a default value of 10ms.

source/core_mqtt.c Outdated Show resolved Hide resolved
source/core_mqtt.c Outdated Show resolved Hide resolved
source/include/core_mqtt_config_defaults.h Outdated Show resolved Hide resolved
source/include/core_mqtt_config_defaults.h Outdated Show resolved Hide resolved
* repeatedly to read bytes from the network until either:
* 1. The requested number of bytes @a bytesToRecv are read.
* OR
* 2. There is a timeout of MQTT_RECV_POLLING_TIMEOUT_MS duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No data is received for MQTT_RECV_POLLING_TIMEOUT_MS.

@@ -683,13 +691,12 @@ static MQTTPubAckType_t getAckFromPacketType( uint8_t packetType )
/*-----------------------------------------------------------*/

static int32_t recvExact( const MQTTContext_t * pContext,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static int32_t recvExact( const MQTTContext_t * pContext,
                          size_t bytesToRecv )
{
    uint8_t * pIndex = NULL;
    size_t bytesRemaining = bytesToRecv;
    int32_t totalBytesRecvd = 0, bytesRecvd;
    uint32_t timestampBeforeRecvMs = 0U, timeSpentInRecvMs = 0U;
    TransportRecv_t recvFunc = NULL;
    MQTTGetCurrentTimeFunc_t getTimeStampMs = NULL;
    bool receiveError = false;

    assert( pContext != NULL );
    assert( bytesToRecv <= pContext->networkBuffer.size );
    assert( pContext->getTime != NULL );
    assert( pContext->transportInterface.recv != NULL );
    assert( pContext->networkBuffer.pBuffer != NULL );

    pIndex = pContext->networkBuffer.pBuffer;
    recvFunc = pContext->transportInterface.recv;
    getTimeStampMs = pContext->getTime;

    while( ( bytesRemaining > 0U ) && ( receiveError == false ) )
    {
        timestampBeforeRecvMs = getTimeStampMs();

        bytesRecvd = recvFunc( pContext->transportInterface.pNetworkContext,
                               pIndex,
                               bytesRemaining );

        timeSpentInRecvMs = calculateElapsedTime( getTimeStampMs(), timestampBeforeRecvMs );

        if( bytesRecvd < 0 )
        {
            LogError( ( "Network error while receiving packet: ReturnCode=%ld.",
                        ( long int ) bytesRecvd ) );
            totalBytesRecvd = bytesRecvd;
            receiveError = true;
        }
        else if( bytesRecvd > 0 )
        {
            /* It is a bug in the application's transport receive implementation
             * if more bytes than expected are received. To avoid a possible
             * overflow in converting bytesRemaining from unsigned to signed,
             * this assert must exist after the check for bytesRecvd being
             * negative. */
            assert( ( size_t ) bytesRecvd <= bytesRemaining );

            bytesRemaining -= ( size_t ) bytesRecvd;
            totalBytesRecvd += ( int32_t ) bytesRecvd;
            pIndex += bytesRecvd;
            LogDebug( ( "BytesReceived=%ld, BytesRemaining=%lu, "
                        "TotalBytesReceived=%ld.",
                        ( long int ) bytesRecvd,
                        ( unsigned long ) bytesRemaining,
                        ( long int ) totalBytesRecvd ) );
        }
        else
        {
            /* No bytes were read from the network. */
        }

        /* If there is more data to be received and nothing was received
         * for MQTT_RECV_POLLING_TIMEOUT_MS, treat this as error. */
        if( ( bytesRemaining > 0U ) &&
            ( timeSpentInRecvMs >= MQTT_RECV_POLLING_TIMEOUT_MS ) &&
            ( bytesRecvd == 0 ) )
        {
            LogError( ( "Time expired while receiving packet." ) );
            receiveError = true;
        }
    }

    return totalBytesRecvd;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only calculate the time spent for each call of recvFunc. Wouldn't there be an infinite loop if every recvFunc call returned 0 bytes but spent slightly less than MQTT_RECV_POLLING_TIMEOUT_MS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your code suggestion, timeSpentInRecvMs represents the time spent only in a single transport recv call, where as my changes look at the time spent across multiple calls to transport recv function that return zero bytes to determine if there has been a polling timeout.
With your code suggestion, there can be an issue of getting into an infinite loop of the transport recv function returning zero bytes, but the underlying SOCKET_TIMEOUT of transport recv being < MQTT_RECV_POLLING_TIMEOUT_MS, which will prevent the receiveError from being set to true and thus, the loop will not terminate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. In that case, changing the variable names would make it a bit more readable:

entryTimeMs  --> lastDataRecvdTimestampMs
noDataRecvdTimeMs  --> timeSinceLastDataWasRecvd

@@ -853,7 +867,7 @@ void test_MQTT_Connect_partial_receive()

setupTransportInterface( &transport );
setupNetworkBuffer( &networkBuffer );
transport.recv = transportRecvOneByte;
transport.recv = transportRecvNoData;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to write a test to replicate the scenario which caught this issue:

  • revc returns less than requested data (possibly return n/2 bytes so that second call with return complete data).
  • processLoop is called with 0 timeout.

This test should fail without this change and pass with this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can achieve that with the existing mock recv function that returns only byte at a time.
As the bug being addressed is primarily with "less than expected data received" when the "remaining timeout" value is nearing zero, I can add a test to mimic that by passing 0 as the timeout to MQTT_ProcessLoop and using the transportRecvOneByte mock function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the test

source/core_mqtt.c Outdated Show resolved Hide resolved
source/core_mqtt.c Outdated Show resolved Hide resolved
source/core_mqtt.c Outdated Show resolved Hide resolved
source/core_mqtt.c Outdated Show resolved Hide resolved
source/core_mqtt.c Outdated Show resolved Hide resolved
@@ -603,6 +603,13 @@ MQTTStatus_t MQTT_Disconnect( MQTTContext_t * pContext );
* @param[in] timeoutMs Minimum time in milliseconds that the receive loop will
* run, unless an error occurs.
*
* @note Calling this function blocks the calling context for a time period that
* depends on the passed @p timeoutMs, the configured macro, #MQTT_RECV_POLLING_TIMEOUT_MS
* and the underlying transport interface implementation timeouts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to also mention that if not hitting an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@sarenameas sarenameas force-pushed the fix/recv-timeout-issue branch from a1d70b9 to e1de841 Compare December 4, 2020 05:30
muneebahmed10
muneebahmed10 previously approved these changes Dec 4, 2020
*
* @return Number of bytes received, or negative number on network error.
*/
static int32_t recvExact( const MQTTContext_t * pContext,
size_t bytesToRecv,
uint32_t timeoutMs );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the removal of this parameter, some more timeouts could probably be removed due to similar reasoning. I don't think it's critical though and definitely shouldn't be done in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that the timeout passed to discardPacket and the timeout logic in the function can also be removed (as hygiene). However, I decided to keep the scope of this PR limited to just fixing the recvExact issue.

source/core_mqtt.c Show resolved Hide resolved
source/interface/transport_interface.h Outdated Show resolved Hide resolved

if( ( bytesRemaining > 0U ) && ( elapsedTimeMs >= timeoutMs ) )
if( ( bytesRemaining > 0U ) &&
( timeSinceLastDataWasRecvdMs >= MQTT_RECV_POLLING_TIMEOUT_MS ) )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't really like this name. Not saying it's not accurate, it's just long and I don't think the extra words are very meaningful. For example, something like timeSinceRecvMs still conveys the relevant information, but in a more concise manner.

I don't think the variable needs to be this explicit if it's clear from reading the rest of the code. Feel free to disagree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I like the recommendation. Have made the name concise.

muneebahmed10
muneebahmed10 previously approved these changes Dec 4, 2020
muneebahmed10
muneebahmed10 previously approved these changes Dec 4, 2020
aggarg
aggarg previously approved these changes Dec 4, 2020
yourslab
yourslab previously approved these changes Dec 4, 2020
source/core_mqtt.c Outdated Show resolved Hide resolved
@aggarw13 aggarw13 dismissed stale reviews from yourslab, aggarg, and muneebahmed10 via b923ea5 December 4, 2020 21:23
muneebahmed10
muneebahmed10 previously approved these changes Dec 4, 2020
…b max tries.

- After MAX_NETWORK_RECV_TRIES a zero is always returned from the transport recv stub.
- When a zero is returned recvExact will exit immediately because the max polling timeout is 1.
@aggarw13 aggarw13 merged commit b581fc4 into FreeRTOS:main Dec 5, 2020
@aggarw13 aggarw13 deleted the fix/recv-timeout-issue branch December 5, 2020 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants