Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log agent crashes to file and retrieve from console #335

Closed
ChrisMaddock opened this issue Jan 3, 2018 · 5 comments
Closed

Log agent crashes to file and retrieve from console #335

ChrisMaddock opened this issue Jan 3, 2018 · 5 comments

Comments

@ChrisMaddock
Copy link
Member

Agent crashes are often hard to debug - as the exception information is lost on the crashed process. If the console could display the exception stack trace, as it would if the tests were run in process, then that would help debugging - especially of flakey cases where the crash isn't readily reproducible.

How about we add use a temporary log file? The engine could pass the path the the agent on creation. We add an UnhandledExceptionEventHandler to the agent to write any exception to this file. If the console detects the agent has crashed - it then looks for any content in the log file to write out to the user. (And of course, cleans up the log file after itself!)

Thoughts?

@CharliePoole
Copy link
Member

Why not use the existing agent log file?

@ChrisMaddock
Copy link
Member Author

Because that only (currently) exists under the --trace=xxx option. It would also require an amount of additional processing, to work out which part of the log to present to the user, and whether to then clean up the log or not, based on whether the user has manually asked for it or not. And then there's the case of refactoring our logging code to create a just-in-time file. (Something which I'd also considered - but a separate discussion!)

I envisaged this as a separate, temporary file - which is temporarily written and deleted then by the engine, and simply used as a way to pass information between processes, post-crash.

@rprouse
Copy link
Member

rprouse commented Jan 3, 2018

It is an interesting idea. It wouldn't work when/if we ever get agents running on additional machines, but we haven't made much progress in that direction and it might help with some immediate issues. What are you thinking of logging? Just the crashing exception and stack trace? Is the idea that it will be another mechanism to pass crash information back to the console runner when the remoting channel fails so that we don't just display the socket exception? Not sure this would work for classes of exceptions that we cannot catch which might limit it's use.

Or are you thinking of logging more? If so, what?

@ChrisMaddock
Copy link
Member Author

It wouldn't work when/if we ever get agents running on additional machines

Fair. I imagine multiple agents on a single machine will remain the main usage however, which would make this worthwhile in my eyes. 🙂

What are you thinking of logging? Just the crashing exception and stack trace? Is the idea that it will be another mechanism to pass crash information back to the console runner when the remoting channel fails so that we don't just display the socket exception? Not sure this would work for classes of exceptions that we cannot catch which might limit it's use.

Yes to everything.

My thinking was, if we can definitely get everything we can catch - then we can say with more confidence that an 'unrecorded' crash is likely e.g. a StackOverflowException. (For my benefit - what other kinds can't we catch?) And we can then supply some more useful information, e.g. try running --inprocess to debug.

I'm not totally sure how this will tie in with the timing shutdown issues Joseph has been looking at - but I'm hopeful they will be solved by an eventual more to a new communication method. We're suspicious however that there are also other problems - but they're proving flakey and difficult to pin down. I think there's room for us to do better at recording/reporting these errors, so we (or our users, if it's in their code!) might eventually be able to get them fixed.

@CharliePoole
Copy link
Member

I see. That makes sense.

However, in addition, I think you just pointed out a weakness in our existing trace logging. We should obviously be capturing unhandled exceptions and logging them as errors in every agent process we create. Not sure if we need a separate issue for that, however, since it would all be implemented in one handler.

Let's stick with the convention of using the process id in any file names we create and make sure to create them in the defined work directory along with everything else. In fact, I don't even think we should delete such a file, but just keep it sitting next to the log files.

I'm for doing this with exception info, including inner exceptions, first and then adding more info if it proves necessary. Since a single agent can have multiple AppDomains running tests, we should indicate which domain caused the crash in the report as well.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

3 participants