Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support log rotation, don't append crash.log files but use per-peer. #1847

Merged
merged 3 commits into from
Aug 23, 2018

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Aug 15, 2018

This is based on ~~~#1846~~~ #1863 because I needed the start flag to get_node(), but it's basically the last two commits.

@cdecker
Copy link
Member

cdecker commented Aug 22, 2018

Rebased after applying #1863

@@ -56,7 +56,7 @@ def directory(request, test_base_dir, test_name):
# This uses the status set in conftest.pytest_runtest_makereport to
# determine whether we succeeded or failed.
if request.node.rep_call.outcome == 'passed':
shutil.rmtree(directory)
pass #shutil.rmtree(directory)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably not intended to be in this queue

logpath = os.path.join(l1.daemon.lightning_dir, 'logfile')
logpath_moved = os.path.join(l1.daemon.lightning_dir, 'logfile_moved')

# FIXME: I couldn't get super(TailableProc, l1.daemon).start() to work?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found super to be rather obscure. This should work however:

TailableProc.start(l1.daemon)

Since it'll take the classmethod start from TailableProc and passes in l1.daemon as self

lightningd/log.c Outdated
static void handle_sighup(int sig)
{
/* This may fail if we're hammered with SIGHUP. We don't care. */
if (write(signalfds[1], "", 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this took me a while to figure out. This'll just write a 0x00 byte, since any C string is null-terminated, right?

lightningd/log.c Outdated
fd = open(logfile, O_WRONLY|O_CREAT, 0600);
}
/* We expect to be in config dir. */
snprintf(logfile, sizeof(logfile), "crash.log.%u", getpid());
Copy link
Member

@cdecker cdecker Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if pid is the best identifier here, we might want to use something like strftime(buffer, 14, "%Y%m%d%H%M%S", tm_info); in order to easily identify the last crash or correlate a crash dump with external monitoring.

lightningd/log.c Outdated
if (write(signalfds[1], "", 1));
/* Writes a single 0x00 byte to the signalfds pipe. This may fail if
* we're hammered with SIGHUP. We don't care. */
if (write(signalfds[1], "", 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang will complain if this is not on a separate line to signal explicit opt-out of this case.

@cdecker
Copy link
Member

cdecker commented Aug 22, 2018

I took the liberty to add fixup! commits for the two suggestions in my comments. Other than these changes the PR is good, and if you're happy with the changes I proposed, feel free to squash and merge :-)

ACK 734b5f8

@cdecker
Copy link
Member

cdecker commented Aug 22, 2018

Actually Travis-CI keeps complaining about one of the two tests (it fails test_crashlog but the error message related to the fake-bitcoin-cli of test_logging, which is really strange).

/tmp/ltests-0xxoonjf/test_logging_1/lightning-1/fake-bitcoin-cli exec failed: No such file or directory
lightningd: Fatal signal 6 (version 36db31f)
lightningd/lightningd: libbacktrace: no debug info in ELF executable
lightningd/lightningd: libbacktrace: no debug info in ELF executable

This happens both before and after my datetime crashlog commit, and I can't figure out why...

@cdecker
Copy link
Member

cdecker commented Aug 22, 2018

This might actually be valgrind intercepting the crash and thus not producing the crashlog:

------------------------------- Valgrind errors --------------------------------
Valgrind error file: valgrind-errors.2382
==2382== Jump to the invalid address stated on the next line
==2382==    at 0x0: ???
==2382==    by 0x4C5B96: backtrace_full (backtrace.c:127)
==2382==    by 0x430F6B: crashdump (daemon.c:42)
==2382==    by 0x56634AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==2382==    by 0x47A658: brute_force_first (timer.c:211)
==2382==    by 0x47A7BE: get_first (timer.c:246)
==2382==    by 0x47A7D8: update_first (timer.c:251)
==2382==    by 0x47A829: timer_earliest (timer.c:264)
==2382==    by 0x46CC99: io_loop (poll.c:272)
==2382==    by 0x4152E8: main (lightningd.c:455)
==2382==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==2382== 
==2382== 
==2382== Process terminating with default action of signal 11 (SIGSEGV)
==2382==  Bad permissions for mapped region at address 0x0
==2382==    at 0x0: ???
==2382==    by 0x4C5B96: backtrace_full (backtrace.c:127)
==2382==    by 0x430F6B: crashdump (daemon.c:42)
==2382==    by 0x56634AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==2382==    by 0x47A658: brute_force_first (timer.c:211)
==2382==    by 0x47A7BE: get_first (timer.c:246)
==2382==    by 0x47A7D8: update_first (timer.c:251)
==2382==    by 0x47A829: timer_earliest (timer.c:264)
==2382==    by 0x46CC99: io_loop (poll.c:272)
==2382==    by 0x4152E8: main (lightningd.c:455)
--------------------------------------------------------------------------------

@rustyrussell
Copy link
Contributor Author

Backtrace does not play nicely with valgrind... We disable it in dev mode, but not in non-dev mode:

#if DEVELOPER
	/* Suppresses backtrace (breaks valgrind) */
	if (!getenv("LIGHTNINGD_DEV_NO_BACKTRACE"))
		backtrace_state = backtrace_create_state(argv0, 0, NULL, NULL);
#else
	backtrace_state = backtrace_create_state(argv0, 0, NULL, NULL);
#endif

I'll work around it in the test itself...

Closes: ElementsProject#1623
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell and others added 2 commits August 23, 2018 11:39
Someone had a 21GB crash.log, which doesn't help anyone!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This should make it easier to identify the latest crash file and correlate
crashes with external monitoring tools.
@cdecker
Copy link
Member

cdecker commented Aug 23, 2018

The remaining test failure seems to be an instance of the flaky test tracked in #1866, restarting to see if it unflakes :-)

@cdecker
Copy link
Member

cdecker commented Aug 23, 2018

ACK d5425bc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants