Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an error traceback to panic() #1119

Closed
TerryE opened this issue Mar 4, 2016 · 27 comments
Closed

Adding an error traceback to panic() #1119

TerryE opened this issue Mar 4, 2016 · 27 comments
Assignees
Labels

Comments

@TerryE
Copy link
Collaborator

TerryE commented Mar 4, 2016

As part of my LCD patch, I also reimplemented the Lua error traceback, so that by default an error give a proper Lua stack traceback, rather the a one-line goodbye. However, the RTS still has a catch-all panic() to trap any unprotected error calls. This puts out a very unhelpful error message: PANIC: unprotected error in call to Lua API (<Lua routine name>). It just struck me that it would be better to provide a proper Lua traceback. Is this a good idea?

This is a fairly easy change.

@TerryE TerryE self-assigned this Mar 4, 2016
@pjsg
Copy link
Member

pjsg commented Mar 5, 2016

Yes please.

@dnc40085
Copy link
Contributor

dnc40085 commented Mar 5, 2016

More debug info is always good.

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 6, 2016

This isn't quite as simple as I first thought. The cb stubs in all of the modules typically follow this type of pattern as in net.c:

  if(nud->cb_disconnect_ref != LUA_NOREF && nud->self_ref != LUA_NOREF) {
    lua_rawgeti(gL, LUA_REGISTRYINDEX, nud->cb_disconnect_ref);
    lua_rawgeti(gL, LUA_REGISTRYINDEX, nud->self_ref);
    lua_call(gL, 1, 0);
  }

The issue is that if the cb_disconnect throws an error then an error handler hasn't been established so
ldo.c:luaD_throw() follows the code patch for the error handler not established that is it clears down the stack before calling G(L)->panic(L) if the registered. BTW commenting out the c_exit() is a definite bug. We should use Johny's trick of doing while(1) {} to trip the WDT.

I need to play around to see exactly when an error panics, but I suspect that we will need some form of lua_call variant (and not just the standard xpcall) so I can do a glocal search and replace of these cb's in the modules.

More work is needed.

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 8, 2016

What is different from a standard Batch Lua is that with this there is a single call to Lua. An unprotected error results in program termination. A typical nodeMCU application is fragmented into many callback tasks. It would be a really nice development feature to be able to declare a node.atpanic(action_routine, type) so that uncaught errors could either be trapped by a user exception handler, or simply produce an error traceback and return you to the > prompt so that the developer can diagnose the error.

Note that this would be quite a complex change since you would want to turn off all further callbacks to prevent the application continuing in some indetermined way. Need to think about whether the advantages of doing this merit the effort involved.

@pjsg
Copy link
Member

pjsg commented Mar 8, 2016

(Warning: Stream of conciousness coming) To my way of thinking there are two modes of operation of the nodemcu. The first is during the development phase and the second during the operational (deployed) phase. I think that the requirements are different in these cases.

Development: On an unprotected error, you would like to be dropped into a debugger so that you can figure out what went wrong. This is beyond the capabilities of the platform, so generating a stack traceback is the first step. Having a way to run some lua code when this happens would be useful. I might (for example) dump out the globals table etc. After this runs, I think that I'm happy to have the platform restart.

Operation: Typically there isn't anything attached to the serial console (or those pins might actually be repurposed). However, I'm interested in how the platform fails and how often. I would report this (maybe using syslog or mqtt) to another server. Given the danger of doing stuff once the panic is triggered, it might make more sense to save the critical data somewhere and report it after the platform restarts.For example, the node.bootreason() returns information on a previous failure that can be reported after the restart.

Obviously, any of this functionality should be optional, but we could certainly document some design patterns and examples.

As a side note, it would be useful if the lua code could get access to the build information (e.g. the commit hash of the code, maybe the list of modules). This would allow the reporting to be more detailed.

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 8, 2016

Any form of full implementation of the debug module is going to be fraught as a debug session is essentially synchronous to a given processing thread, but threads as such don't exist as a concept within the SDK. If we did have any form of interactive debug, then it would have to be some sort of diagnostic prologue to rebooting the chip because you would also need a global means of blocking any timer, network, GPIO cb's which would still continue to fire. I have been brooding about this one because it would be a really useful feature.

Even the ability to establish a default at panic handler would be useful both for dev and prod. Even if just to capture a debug.traceback() and to frame it up into a syslog UDP packet before restarting.

PS Edit: Sorry but it's difficult to post from a mobile phone when a passenger on a motorway!

@pjsg
Copy link
Member

pjsg commented Mar 9, 2016

I don't think that doing the debug module is easy (or maybe even possible). I would like the stack traceback, and I would like to be able to register a piece of LUA to be called on a panic. It won't stop the panic, but I can always log some more stuff. Yes, there may be cases when the callback faults as well (e.g. run out of memory), but you can't have everything. I'd like the panic message as an argument to this callback.

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 9, 2016

.I'd like the panic message as an argument to this callback.

Notv sure if there is much point. There is only one Panic message and it's content is a subset of the debug.traceback. As to your wider point, agreed.

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 13, 2016

I've been brooding about this one and what the best way to proceed is. Philip and I have also been batting around the issue of debugging development vs production, and I think that this play into this. The main complication with all this is that thinking all of this through is involved and so I don't want to rush into this. However, my current thinking is that

  • We add an extra API hook, say node.atpanic() which establishes a default handler for uncaught Lua errors.
  • The aim of this hook is to provide pre-restart diagnostics, returning from the hooked callback restarts the ESP. No resumption is facilitated.
  • The reason for this is that various library callbacks such as the net:on() and tmr.alarm() will continue to fire even after the hook has been activated because of an uncaught error, so the only safe thing to do is to have some sort of global off switch to prevent then carrying on with an application in an unknown state. At the moment these library wrappers typically retrieve the callback function details from the Lua registry and lua_call() it. I suggest that we add a new lua_tcall()which only makes the call if the trap hasn't fired.
  • I've already developed a library history following the ideas of @hvegh in Add a circular buffer cirbuf module #1062, but with a more flexible string circular buffer. As well as the add() method and a read(), the find(index) method returns the number of lines found and can be used to set the read pointer relative to the tail (e.g. +10) or to the head (e.g. -10) of the buffer. I think that I'll also add a dump() which will either dump the history to a UART, a SPIFFS file or as UDP packets to a syslog-style listener process on another server. This last will have to be a C function as doing this will involve SDK calback and we've disabled Lua ones.
  • This will also involve a small change to the Lua core as the current core code unwinds the Lua stack before calling the panic handler and we want access to the stack so don't want to do this.

This is really just a heads-up for comment and ideas. To give you an ideas of how this might work in an production environment we might do the following in init,lua:

do
  local b=history.buffer(4096, true) -- 4K wrap around buffer and collect times
  node.output(function(str) b:add(str) end) -- redirect output to wraparound logger
  node.atpanic(function() 
    print(debug.traceback())  -- do a Lua traceback
    b:find((b:find(0)<40) and 0 or -40)
    b:dump(22514, '192.16.172')  -- write last 40 or available lines to remote logger  
  end)  -- exiting the panic function reboots the ESP
end

Does this sounds like a useful addition? Of course, since I've used the object form, there is nothing to stop the application having multiple circular loggers, for example if one is also used for collecting history data. Likewise, there's no delete() or close() method as dereferencing the buffer will get it GCed.

@devyte
Copy link

devyte commented Mar 13, 2016

If I may make a suggestion, per discussion in #1085 with @devsaurus: Instead of node.atpanic(blah), consider node.on("panic", blah), for consistency with other callback setting interfaces (e.g.: socket:on() ).
There could be other node-wide callbacks to implement in the future (ottomh: vddlow, as in low batt, memlow, spacelow for spiffs space running out, etc).

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 13, 2016

consider node.on("panic", blah)

@devyte Good suggestion. Will do :)

@pjsg
Copy link
Member

pjsg commented Mar 14, 2016

I like this. I do want to be able to do a complete RAM dump to the remote server. However, I don't see how to do this when the callbacks are disabled. Having said that, I don't want perfect to be the enemy of good -- and what you are proposing would be a big step forwards.

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 14, 2016

Philip, it's only the Lua callbacks that we'd need to inhibit which is why the dump will have to be written in C. In my first version of history, I limited the record size to 255 bytes but now I've decided that the only limit is that the record size must be less than the buffer size which must be less than 32Kb.

BTW. Another one that I've gotten into the habit of doing is this which picks up any registry leaks:

for k,v in pairs(debug.getregistry()) do print(k,v) end

@devyte
Copy link

devyte commented Mar 14, 2016

If it's any help to anyone, I have an ESPlorer snippet button that does this:

do
local function listG(t)
    if t == nil then
       t = _G
       print("Globals:")
    end
    for k,v in pairs(t) do
        print(string.format("%-15s %s", tostring(k), tostring(v)))
    end
end

local function listRegistry()
     print("Registry:")
     for k,v in pairs(debug.getregistry()) do
        print(string.format("%-15s %s", tostring(k), tostring(v)))
     end
end

listG()
print("")
listRegistry()
end

I use it everyday to detect leaks in the globals and in the registry, and it's just a click of a mouse button. Only issue is when I get an output like this:

Registry:
1               function: 3fff1d18
3               6
4               9
5               7
6               4
7               8
8               3
0               5
_LOADLIB        romtable: 40261498
9               2
_LOADED         table: 3fff0cb8
crypto.hash     romtable: 4025f080
net.socket      romtable: 4025f92c
net.server      romtable: 4025f9dc

I don't understand those number pairs, I'm not using numbers as counters or anything anywhere in my code, so I can't figure out how I could be leaking them... but still, very useful!

listG() receives an optional argument. When given (i.e.: load the function manually into mem and call it from cmdline), it will list the contents of the arg t instead of _G.
E.g.:

> listG(encoder)
fromBase64      lightfunction: 4025422c
toBase64        lightfunction: 40254210
fromHex         lightfunction: 402541f4
toHex           lightfunction: 402541d8

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 14, 2016

You have to look at the register and unregister API source, and the PiL write up. The registry is a standard Lua table that is not directly accessible from Lua but that the GC knows about. When a C routine registers a Lua variable, which could be a table or a function for example, then it is returns an index to entry whose value is this variable. The variable is now referenced, so the GC won't collect it as garbage. When it is unregistered the value of the entry is set to an integer, dereferencing the previous Lua variable. These integer values are reused (if you notice the integer->integer entries are a linked list starting at zero, and these are the ones that can be reused). You can do debug.getregistry()[1](), etc. to call a given entry!

BTW: I would just do:

function listG(t) for k,v in pairs(t or _G) print(k,v) end end

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 15, 2016

I suggest that we add a new lua_tcall() which only makes the call if the trap hasn't fired.

It struck me (in the process of digging in some concrete spurs for fence posts -- it's funny what goes on in the subconscious) that since I am proposing to replace all lua_call() functions in the modules with a lua_tcall() then I don't need to modify the Lua core panic handling as this enables us to wrap the lua_pcall() functionality to implement this.

@jmattsson
Copy link
Member

On a purely instinctual level, the lua_call() -> lua_tcall() change feels wrong. Wouldn't we be better off saying that if you hit a panic, then we'll close interrupts, run the panic handler (if it exists), and then reboot?

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 18, 2016

Johny your suggestion was my original idea, but this approach requires some difficult changes to the Lua core and limits the panic handler to UART output.

I switched to my last suggestion because the lua_tcall() enables you to do so much more but without any hairy mods to the core. You could add the debug.traceback() to the history buffers then dump the last N history records, or in a production environment to the same to a UDP syslog style listener. The user panic handler can be a user provided Lua function, but can't use any of the library calls which depends on (Lua) callbacks. Any routines that it calls can still be coded in C to use is own callbacks and tasks (which you would be need to dump the history to a UDP listener).

However, all of this is epilogue diagnostics; exiting the error handler function would trigger a reboot albeit possibly after a few seconds delay to allow the logger to complete any network logging.

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 26, 2016

A couple of follow up Qs for feedback on my history module:

  • The logging method b:add() currently takes a scaler argument. I did think about process as list by passing it through string.format(), hence `b:add("Temp = %6.2f", temp) would have the obvious interpretation. _Now done_
  • It is worth including a b:save() method to roll the tail of the log either to SPIFFS or a UDP listener.

I would be interested in views.

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 27, 2018

On a purely instinctual level, the lua_call() -> lua_tcall() change feels wrong. Wouldn't we be better off saying that if you hit a panic, then we'll close interrupts, run the panic handler (if it exists), and then reboot?

(And two years later I pick this up again, but with a new house built and lived in.) My previous comments about stuff in the registry is really a distraction. We have some facts:

  • When Lua code errors, this ultimately results in the RTS calling a luaD_throw(), and this takes one of two processing paths depending on L->errorJmp being NULL or not.
  • If not NULL then there is at least one protect call handler in the Lua call chain so the luaD_throw() does a C longjump into the error handler.
  • If NULL then it calls the atpanic routine which prints goodbye and reboots.
  • L->errorJmp is a linked list of error handlers that is maintained by the Lua call code in ldo.c Essentially, a lua_call() does not attach an error handler to the frame and lua_pcall() does.
  • We execute Lua code by pushing an LClosure and doing a lua_(p)call on it. A LClosure is a GCobject and so to avoid the luaGC reclaiming it, we need to keep this rooted, typically by it being a value in the Lua registry table.
  • There are three entry paths by which Lua code is executed
    1. The LUA_INIT path which is currently "@init.lua", that is the init.lua file in SPIFFS if it exists.
    2. An interactive command via the UART or node.input()
    3. A lua_call() in one of the many SDK callback functions in the modules.
  • Of these only (ii) has a error call handler to print out an error message; the others will messily panic and die.
  • There are currently 74 type (iii) calls over 34 modules, so whatever our implementation is we should either have
    • a simple 1-1 macro swap of the lua_call(L, n, 0) with a NODEMCU_CALL(L, n) and the macro does the necessary magic to wrap all this up, or
    • a major rewrite of the luaD_throw() to patch up the stack and call chain so that in effect the no call handler path is stitched up to substitute a panic traceback option.
  • Of these two, the second involves no changes to the modules but is technically difficult; the first is straight forward but involves a buld minor edit to modules which use callbacks.

Move vote is for a simple 1-1 macro swap. @pjsg, @jmattsson, if this is OK, then I will prepare the PR.

@pjsg
Copy link
Member

pjsg commented Mar 27, 2018

I'm actually quite happy with the current behavior -- if something fails in the app, then it panics and restarts. This protects you against all sorts of failures.

What would you do in the error case (e.g. if a net.socket:onconnected callback throws an error)?

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 27, 2018

@pjsg you've got the wrong end of the stick here. I am not suggesting we change the behaviour; I am suggesting that we change the level of diagnostics that we emit. Take this somewhat contrived ESPlorer snippet:

loadstring([[
function fred(n,m)
  --
  print(m, n[m]())
end
]], "fred")()

loadstring([[
function dofred(t)
  for k,v in pairs(t) do
    fred(v,k)
  end
end]], "dofred")()

If I call this interactively, say dofred{heap=node,fsinfo = file, cpux = node} then I get the following diagnoistic:

heap	40416
[string "fred"]:3: attempt to call field '?' (a nil value)
stack traceback:
	[string "fred"]:3: in function 'fred'
	[string "dofred"]:3: in function 'dofred'
	stdin:1: in main chunktmr.alarm(0,3000,0,function() dofred{heap=node,fsinfo = file, cpux = node}  end)

but if wrap this with a function() end and call it on a timer alarm, I only get:

 heap	40440
PANIC: unprotected error in call to Lua API ([string "fred"]:3: attempt to call field '?' (a nil value))

 ets Jan  8 2013,rst cause:2, boot mode:(3,6) ...

whereas I am suggesting that it would be a lot more helpful to developer if we had something like:

 heap	40440
PANIC: unprotected error in call to Lua API ([string "fred"]:3: attempt to call field '?' (a nil value))
stack traceback:
	[string "fred"]:3: in function 'fred'
	[string "dofred"]:3: in function 'dofred'
	Library callback tmr.c:123

 ets Jan  8 2013,rst cause:2, boot mode:(3,6) ...

The full stack trace makes it far easier to analyse the error

@pjsg
Copy link
Member

pjsg commented Mar 27, 2018

I now 100% agree -- better diagnostics are always better!

@TerryE
Copy link
Collaborator Author

TerryE commented Mar 27, 2018

I can write a macro NODEMCU_CALL(L, n) and if I swap out the lua_call(L,n, 0) statements in all of our modules then the panic handler can do this, just like the interactive prompt does.

If we do this we strictly won't be calling panicking at all. The handler will return control if the execution was successful, or print the error and reboot the system (actually issue a break 0:0 which does the same thing is the debugger hook isn't installed).

My last Q is in the case of (i) above: do we think it better to reboot or to drop into the interactive prompt if init.lua throws an error?

@TerryE
Copy link
Collaborator Author

TerryE commented Jun 23, 2018

I still think that this one has legs. A lower priority TODO.

@stale
Copy link

stale bot commented Jul 21, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 21, 2019
@stale stale bot closed this as completed Aug 4, 2019
@TerryE
Copy link
Collaborator Author

TerryE commented Aug 4, 2019

Just out of interest I am doing this as part of the Lua 5.1 to Lua 5.3 alignment so we can let this close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants