Log agent crashes to file and retrieve from console #1530

ChrisMaddock · 2018-01-03T12:59:06Z

ChrisMaddock
Jan 3, 2018

Agent crashes are often hard to debug - as the exception information is lost on the crashed process. If the console could display the exception stack trace, as it would if the tests were run in process, then that would help debugging - especially of flakey cases where the crash isn't readily reproducible.

How about we add use a temporary log file? The engine could pass the path the the agent on creation. We add an UnhandledExceptionEventHandler to the agent to write any exception to this file. If the console detects the agent has crashed - it then looks for any content in the log file to write out to the user. (And of course, cleans up the log file after itself!)

Thoughts?

CharliePoole · 2018-01-03T13:04:51Z

CharliePoole
Jan 3, 2018
Maintainer

Why not use the existing agent log file?

0 replies

ChrisMaddock · 2018-01-03T13:10:22Z

ChrisMaddock
Jan 3, 2018
Author

Because that only (currently) exists under the --trace=xxx option. It would also require an amount of additional processing, to work out which part of the log to present to the user, and whether to then clean up the log or not, based on whether the user has manually asked for it or not. And then there's the case of refactoring our logging code to create a just-in-time file. (Something which I'd also considered - but a separate discussion!)

I envisaged this as a separate, temporary file - which is temporarily written and deleted then by the engine, and simply used as a way to pass information between processes, post-crash.

0 replies

rprouse · 2018-01-03T13:20:12Z

rprouse
Jan 3, 2018
Maintainer

It is an interesting idea. It wouldn't work when/if we ever get agents running on additional machines, but we haven't made much progress in that direction and it might help with some immediate issues. What are you thinking of logging? Just the crashing exception and stack trace? Is the idea that it will be another mechanism to pass crash information back to the console runner when the remoting channel fails so that we don't just display the socket exception? Not sure this would work for classes of exceptions that we cannot catch which might limit it's use.

Or are you thinking of logging more? If so, what?

0 replies

ChrisMaddock · 2018-01-03T13:36:28Z

ChrisMaddock
Jan 3, 2018
Author

It wouldn't work when/if we ever get agents running on additional machines

Fair. I imagine multiple agents on a single machine will remain the main usage however, which would make this worthwhile in my eyes. 🙂

What are you thinking of logging? Just the crashing exception and stack trace? Is the idea that it will be another mechanism to pass crash information back to the console runner when the remoting channel fails so that we don't just display the socket exception? Not sure this would work for classes of exceptions that we cannot catch which might limit it's use.

Yes to everything.

My thinking was, if we can definitely get everything we can catch - then we can say with more confidence that an 'unrecorded' crash is likely e.g. a StackOverflowException. (For my benefit - what other kinds can't we catch?) And we can then supply some more useful information, e.g. try running --inprocess to debug.

I'm not totally sure how this will tie in with the timing shutdown issues Joseph has been looking at - but I'm hopeful they will be solved by an eventual more to a new communication method. We're suspicious however that there are also other problems - but they're proving flakey and difficult to pin down. I think there's room for us to do better at recording/reporting these errors, so we (or our users, if it's in their code!) might eventually be able to get them fixed.

0 replies

CharliePoole · 2018-01-03T13:40:11Z

CharliePoole
Jan 3, 2018
Maintainer

I see. That makes sense.

However, in addition, I think you just pointed out a weakness in our existing trace logging. We should obviously be capturing unhandled exceptions and logging them as errors in every agent process we create. Not sure if we need a separate issue for that, however, since it would all be implemented in one handler.

Let's stick with the convention of using the process id in any file names we create and make sure to create them in the defined work directory along with everything else. In fact, I don't even think we should delete such a file, but just keep it sitting next to the log files.

I'm for doing this with exception info, including inner exceptions, first and then adding more info if it proves necessary. Since a single agent can have multiple AppDomains running tests, we should indicate which domain caused the crash in the report as well.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log agent crashes to file and retrieve from console #1530

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Log agent crashes to file and retrieve from console #1530

ChrisMaddock Jan 3, 2018

Replies: 5 comments

CharliePoole Jan 3, 2018 Maintainer

ChrisMaddock Jan 3, 2018 Author

rprouse Jan 3, 2018 Maintainer

ChrisMaddock Jan 3, 2018 Author

CharliePoole Jan 3, 2018 Maintainer

ChrisMaddock
Jan 3, 2018

CharliePoole
Jan 3, 2018
Maintainer

ChrisMaddock
Jan 3, 2018
Author

rprouse
Jan 3, 2018
Maintainer

ChrisMaddock
Jan 3, 2018
Author

CharliePoole
Jan 3, 2018
Maintainer