Log agent crashes to file and retrieve from console #335

ChrisMaddock · 2018-01-03T12:59:06Z

Agent crashes are often hard to debug - as the exception information is lost on the crashed process. If the console could display the exception stack trace, as it would if the tests were run in process, then that would help debugging - especially of flakey cases where the crash isn't readily reproducible.

How about we add use a temporary log file? The engine could pass the path the the agent on creation. We add an UnhandledExceptionEventHandler to the agent to write any exception to this file. If the console detects the agent has crashed - it then looks for any content in the log file to write out to the user. (And of course, cleans up the log file after itself!)

Thoughts?

CharliePoole · 2018-01-03T13:04:51Z

Why not use the existing agent log file?

ChrisMaddock · 2018-01-03T13:10:22Z

Because that only (currently) exists under the --trace=xxx option. It would also require an amount of additional processing, to work out which part of the log to present to the user, and whether to then clean up the log or not, based on whether the user has manually asked for it or not. And then there's the case of refactoring our logging code to create a just-in-time file. (Something which I'd also considered - but a separate discussion!)

I envisaged this as a separate, temporary file - which is temporarily written and deleted then by the engine, and simply used as a way to pass information between processes, post-crash.

rprouse · 2018-01-03T13:20:12Z

It is an interesting idea. It wouldn't work when/if we ever get agents running on additional machines, but we haven't made much progress in that direction and it might help with some immediate issues. What are you thinking of logging? Just the crashing exception and stack trace? Is the idea that it will be another mechanism to pass crash information back to the console runner when the remoting channel fails so that we don't just display the socket exception? Not sure this would work for classes of exceptions that we cannot catch which might limit it's use.

Or are you thinking of logging more? If so, what?

ChrisMaddock · 2018-01-03T13:36:28Z

It wouldn't work when/if we ever get agents running on additional machines

Fair. I imagine multiple agents on a single machine will remain the main usage however, which would make this worthwhile in my eyes. 🙂

What are you thinking of logging? Just the crashing exception and stack trace? Is the idea that it will be another mechanism to pass crash information back to the console runner when the remoting channel fails so that we don't just display the socket exception? Not sure this would work for classes of exceptions that we cannot catch which might limit it's use.

Yes to everything.

My thinking was, if we can definitely get everything we can catch - then we can say with more confidence that an 'unrecorded' crash is likely e.g. a StackOverflowException. (For my benefit - what other kinds can't we catch?) And we can then supply some more useful information, e.g. try running --inprocess to debug.

I'm not totally sure how this will tie in with the timing shutdown issues Joseph has been looking at - but I'm hopeful they will be solved by an eventual more to a new communication method. We're suspicious however that there are also other problems - but they're proving flakey and difficult to pin down. I think there's room for us to do better at recording/reporting these errors, so we (or our users, if it's in their code!) might eventually be able to get them fixed.

CharliePoole · 2018-01-03T13:40:11Z

I see. That makes sense.

However, in addition, I think you just pointed out a weakness in our existing trace logging. We should obviously be capturing unhandled exceptions and logging them as errors in every agent process we create. Not sure if we need a separate issue for that, however, since it would all be implemented in one handler.

Let's stick with the convention of using the process id in any file names we create and make sure to create them in the defined work directory along with everything else. In fact, I don't even think we should delete such a file, but just keep it sitting next to the log files.

I'm for doing this with exception info, including inner exceptions, first and then adding more info if it proves necessary. Since a single agent can have multiple AppDomains running tests, we should indicate which domain caused the crash in the report as well.

ChrisMaddock added the is:idea label Jan 3, 2018

ChrisMaddock mentioned this issue Feb 11, 2018

Nunit.Console 3.8 - Socket Exception #370

Closed

JVimes mentioned this issue Aug 2, 2019

Planning: replacing remoting with custom protocol #266

Closed

CharliePoole added the Needs Design label Mar 9, 2022

CharliePoole added this to the 4.0 milestone Mar 9, 2022

CharliePoole mentioned this issue Jul 15, 2022

Killing hung runner should output partial results to aid debugging #664

Closed

nunit locked and limited conversation to collaborators Dec 14, 2024

CharliePoole converted this issue into discussion #1530 Dec 14, 2024

CharliePoole added the Moved to Discussion label Dec 15, 2024

CharliePoole modified the milestones: 4.0, 4.0.0-beta.1 Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Log agent crashes to file and retrieve from console #335

Log agent crashes to file and retrieve from console #335

ChrisMaddock commented Jan 3, 2018

CharliePoole commented Jan 3, 2018

ChrisMaddock commented Jan 3, 2018

rprouse commented Jan 3, 2018

ChrisMaddock commented Jan 3, 2018

CharliePoole commented Jan 3, 2018

This issue was moved to a discussion.

This issue was moved to a discussion.

Log agent crashes to file and retrieve from console #335

Log agent crashes to file and retrieve from console #335

Comments

ChrisMaddock commented Jan 3, 2018

CharliePoole commented Jan 3, 2018

ChrisMaddock commented Jan 3, 2018

rprouse commented Jan 3, 2018

ChrisMaddock commented Jan 3, 2018

CharliePoole commented Jan 3, 2018

This issue was moved to a discussion.