Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no return from panic_abort(const char *details) (IDFGH-14221) #15018

Open
3 tasks done
ves011 opened this issue Dec 11, 2024 · 14 comments
Open
3 tasks done

no return from panic_abort(const char *details) (IDFGH-14221) #15018

ves011 opened this issue Dec 11, 2024 · 14 comments
Assignees
Labels
Awaiting Response awaiting a response from the author Status: Opened Issue is new

Comments

@ves011
Copy link

ves011 commented Dec 11, 2024

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

I have an application which ends in
*void IRAM_ATTR attribute((noreturn, no_sanitize_undefined)) panic_abort(const char details)
in panic.c line 449, and get stuck here.
Is there any reason why restart is not triggered?
The comment in user execption vector in xtensa_vectors.S line 623 is
/ never returns here - call0 is used as a jump (see note at top) /
But we need a way to recover from this situation without pressing reset button. A restart, like any other panic handlers, should be a reasonable option.
Are there other panic handlers which are not triggering restart?

@espressif-bot espressif-bot added the Status: Opened Issue is new label Dec 11, 2024
@github-actions github-actions bot changed the title no return from panic_abort(const char *details) no return from panic_abort(const char *details) (IDFGH-14221) Dec 11, 2024
@sudeep-mohanty
Copy link
Collaborator

Hi @ves011, Could you check if you have the the reboot behavior enabled for the panic handler?
You could do that in the menuconfig -> Component Config -> ESP System Settings -> Panic handler behavior.

Make sure you have either the Print registers and reboot or Silent reboot option selected.

@espressif-bot espressif-bot added the Awaiting Response awaiting a response from the author label Dec 11, 2024
@ves011
Copy link
Author

ves011 commented Dec 11, 2024

Yes.
Its set to "Print registers and reboot".
So far all system panics ended with a restart. Its the first time i see code stuck and no message on the console.
I could capture the issue only running under debugger.

@sudeep-mohanty
Copy link
Collaborator

If you have a debugger attached then this behavior is expected. The system would set a breakpoint and let users probe for the faults. You can read about the configuration option CONFIG_ESP_DEBUG_OCDAWARE.

@ves011
Copy link
Author

ves011 commented Dec 11, 2024

The issue shows up without debugger attached.
The code just get stuck: console is frozen, tcp is frozen , module is no longer accessible.
I could get to the point where it get stuck only after attaching a debugger.

@sudeep-mohanty
Copy link
Collaborator

Could you provide us the following information -

  • Which target are you using?
  • Your sdkconfig file.
  • A minimal reproducible code for us to reproduce the issue and debug it at our end?

@ves011
Copy link
Author

ves011 commented Dec 11, 2024

target is ESP32S3 WROOM1N4R2
I attached sdkconfig, generated sdkconfig.h and full sourse code of tcp_server.c
The issue pop up when running tcp_server in poor network conditions.
Reproducing is pretty difficult and will try to explain here.

  • register_tcp_server() function creates in the first step tcp_server() task
  • tcp_server task creates the server socket and enters accept connections loop
  • if connection is accepted it creates the 2 other tasks: send_task and process_message_task
  • if the 2 tasks created successfully it enters receive loop
  • while in receive loop if a tcp error is detected then it breaks out the loop, deletes the 2 created tasks and return to accept loop
    Here is a meaningful extract of the code
static void send_task(void *pvParameters)
    {
    socket_message_t msg;
    int written, ret;
    int sock = ((tParams_t *)pvParameters)->socket;
    xQueueReset(tcp_send_queue);
    ESP_LOGI(TAG, "send_task started, socket: %d", sock);
    send_error = 0;
    while(1)
        {
        if(xQueueReceive(tcp_send_queue, &msg, portMAX_DELAY))
            {
            
            written = 0, ret = 0;
            while(written < sizeof(socket_message_t))
                {
                ret = send(sock, (uint8_t *)(&msg + written), sizeof(socket_message_t) - written, 0);
                if (ret < 0) 
                    {
                    send_error = 1;
                    ESP_LOGE(TAG, "Error occurred during sending message: errno %d", errno);
                    break;
                    }
                else
                    written += ret;
                }
            if(ret <= 0)
                {
                // need to handle here error while sending
                // or its enough recv() error handling in tcp_server_task ???
                }
            }
        }
    }

void tcp_server(void *pvParameters)
    {
    ...
        
    listen_sock = socket(AF_INET, SOCK_STREAM, ip_protocol);
    if (listen_sock >= 0)
        {
        ...
        ret = bind(listen_sock, (struct sockaddr *)&dest_addr, sizeof(dest_addr));
        if (ret == 0)
            {
            ESP_LOGI(TAG, "Socket bound, port %d", DEFAULT_TCP_SERVER_PORT);
            ret = listen(listen_sock, 1);
            if (ret == 0)
                {
                while (1) // accept loop
                    {
                    ...
                    server_sock = accept(listen_sock, (struct sockaddr *)&source_addr, &addr_len);
                    if(server_sock >= 0)
                        { 
                        ESP_LOGI(TAG, "server task socket %d", server_sock);
                        ...
                        while(1) //communication loop
                            {
                            len = recv(server_sock, &message, sizeof(socket_message_t), MSG_WAITALL);
                            if (len < 0)
                                {
                                ESP_LOGE(TAG, "Error occurred during receiving: errno %d", errno);
                                close(server_sock);
                                break;
                                }
                            else if (len == 0) //socket closed by client
                                {
                                ESP_LOGI(TAG, "socket closed by peer");
                                close(server_sock);
                                break;
                                }
                            else //process message based on cmd_id && commstate
                                {
                                if(commstate == IDLE)
                                    {
                                    ...

                                    }
                                else //commstate == CONNECTED
                                     //just put the message in the queue
                                    {
                                    ...
                                    //posting the message, triggers process_message task
                                    xQueueSend(tcp_receive_queue, &message, 0);
                                    }
                                }
                            }
                        if(process_message_task_handle)
                            {
                            vTaskDelete(process_message_task_handle);
                            process_message_task_handle = NULL;
                            }
                        if(send_task_handle)
                            {
                            vTaskDelete(send_task_handle);
                            send_task_handle = NULL;
                            }
                        }
                    else
                        {
                        ESP_LOGE(TAG, "Unable to accept connection: errno %d", errno);
                        break;
                        }
                    }
                }
            else
                {
                ESP_LOGE(TAG, "Error occurred during listen: errno %d", errno);
                }
            }
        else
            {
            ESP_LOGE(TAG, "Socket unable to bind: errno %d", errno);
            }
        }
    else
        {
        ESP_LOGE(TAG, "Unable to create socket: errno %d", errno);
        }
    vTaskDelete(NULL);
    }

While in receive loop tcp_server() task is blocked on recv() and only send_task() is active sending continuously messages to the connected client.
tcp socket errors like 113 or 104, pop up during recv() in tcp_server() task, and during send() in send_task().

In some rare cases, the logs show the error pop up during both recv() and send() and in some of these rare cases the issue pop-up, which makes to module to get stuck
tmp.zip

@ves011
Copy link
Author

ves011 commented Dec 11, 2024

looking at all the above i think will be easier for you to creade a small test app with my sdkconfig and target and force it ending in *void IRAM_ATTR attribute((noreturn, no_sanitize_undefined)) panic_abort(const char details)

@sudeep-mohanty
Copy link
Collaborator

@ves011 I was not successful in reproducing the problem that you see. May I ask, if you see the panic banner being printed and any core-dump/register-dump on the console? It is strange that the panic handler does not reboot the chip. The fact that the problem occurs occasionally could indicate a memory corruption somewhere leading to a hang. So it is hard to say what is the cause here.

@ves011
Copy link
Author

ves011 commented Dec 17, 2024

Sorry for late reply, was disturbed by other tasks.

The issue is quite annoying and pops up on 2 different modules. Not likely to be a memory issue.
On my set-up I have 2 modules 1 running a tcp server and the other one, tcp client.
When resetting one of the modules, the peer should enter again either in listen mode (if server), or in attempt to connect (if client), so after reset the communication to be re-established. This is to simulate network connection lost by one of the modules.
The issue happens, as I said, when error is reported on both sending and receiving and happens every 3 -4 resets.
If you want, I can give you the link on github to my code so you can build it and reproduce the issue.

@sudeep-mohanty
Copy link
Collaborator

Hi @ves011,
It is unlikely that there could be something not working correctly with the panic handler as the issue should have been reproducible more consistently and with a more direct approach. The fact that you need a comprehensive wireless communication routine to see the problem suggests that there could be something wrong elsewhere in the code. I suggest that you tackle the problem by targeting specific parts of your application. You could remove/add them back to see if you can narrow down to the part that is causing the issue.

@ves011
Copy link
Author

ves011 commented Dec 19, 2024

Hi @sudeep-mohanty

There is something wrong in my code? Definiteley yes. No piece of code should end in a panic handler.
But this is not the point. The point here is that the panic handler instead of resetting the module is looping endlessly making the module nonresponsive.

@sudeep-mohanty
Copy link
Collaborator

sudeep-mohanty commented Dec 19, 2024

Hi @ves011,
Maybe I might not have understood the problem completely. Could you clarify if the panic handler reboots the modules sometimes or never?

If it is the former, then it is rather strange that the panic handler does not reboot the system occasionally. Usually, such issues should be consistently reproducible. The only suspicion I have currently is that some memory corruption might be happening which is causing the panic_handler to misbehave. Which is why it might be prudent to narrow down to the problematic part by running a reduced set of the app code, if possible.

Another way we could rule out any bugs with the panic_abort() code would be to induce the abort() at other stages of the application to see if the behavior is consistent.

Nevertheless, could you also try the below options and let me know the results?

  • Could you enable the config option CONFIG_ESP_PANIC_HANDLER_IRAM and try to reproduce the problem? If the crash happens when the flash cache is disabled, then the panic handler should enable it. But it could also not work correctly if the flash cache status is somehow corrupted.
  • Turn off CONFIG_ESP_DEBUG_OCDAWARE from the menuconfig. This could rule out some unwanted activation of the part of the handler which halts the CPU rather than rebooting it.

@o-marshmallow
Copy link
Collaborator

o-marshmallow commented Dec 19, 2024

Hello @ves011 ,

I also suspect that you have a memory corruption, it's hard to tell without a reproducible example but from the snippet you wrote above I can see a potential corruption:

socket_message_t msg;
ret = send(sock, (uint8_t *)(&msg + written), sizeof(socket_message_t) - written, 0);

Here, you are trying to send your message in multiple chunks, but you did a mistake in the cast. I think what you wanted to do is cast the msg into a byte pointer and send it as a byte array, so you meant to do (uint8_t *) &msg + written but what you wrote is (uint8_t *)(&msg + written), which is not the same!

In your case, the size of &msg is 4 (because it's a pointer), so &msg + written will in fact add 4 * written to your msg pointer and then cast the result into a uint8_t*. It will therefore send unpredictable data from the stack.

You will only have issues if your send function doesn't send all the message at once, which happens in case of poor network connection.

@ves011
Copy link
Author

ves011 commented Dec 19, 2024

@o-marshmallow, thanks for pointing that, its already fixed, together with others. I’m sure there are still a lot to be fixed.
But, guys, do not waste your time debugging my code, it has a lot of issues and its normal, its in development stage. This is my job and dont let me unemployed :)
Pls focus on the IDF issue which is:

*A piece of crap code ends in
void IRAM_ATTR attribute((noreturn, no_sanitize_undefined)) panic_abort(const char details) function in panic.c line 449
And no reboot or reset happens!!!! Just loops for ever.

I can give you a way to reproduce the issue, which I agree is not trivial, but don’t ask me to debug an IDF problem which with the current version of the code i'm working, is no longer reproducible.
This IDF issue is quite critical in my opinion, because it makes us, the users of the fwk, to have doubt about the proper behavior of an internal component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Response awaiting a response from the author Status: Opened Issue is new
Projects
None yet
Development

No branches or pull requests

4 participants