Support log rotation, don't append crash.log files but use per-peer. #1847

rustyrussell · 2018-08-15T11:08:53Z

This is based on ~~~#1846~~~ #1863 because I needed the start flag to get_node(), but it's basically the last two commits.

cdecker · 2018-08-22T10:15:40Z

Rebased after applying #1863

cdecker · 2018-08-22T10:28:00Z

tests/fixtures.py

@@ -56,7 +56,7 @@ def directory(request, test_base_dir, test_name):
    # This uses the status set in conftest.pytest_runtest_makereport to
    # determine whether we succeeded or failed.
    if request.node.rep_call.outcome == 'passed':
-        shutil.rmtree(directory)
+        pass #shutil.rmtree(directory)


That's probably not intended to be in this queue

cdecker · 2018-08-22T10:29:52Z

tests/test_misc.py

+    logpath = os.path.join(l1.daemon.lightning_dir, 'logfile')
+    logpath_moved = os.path.join(l1.daemon.lightning_dir, 'logfile_moved')
+
+    # FIXME: I couldn't get super(TailableProc, l1.daemon).start() to work?


I found super to be rather obscure. This should work however:

TailableProc.start(l1.daemon)

Since it'll take the classmethod start from TailableProc and passes in l1.daemon as self

cdecker · 2018-08-22T10:31:24Z

lightningd/log.c

+static void handle_sighup(int sig)
+{
+	/* This may fail if we're hammered with SIGHUP.  We don't care. */
+	if (write(signalfds[1], "", 1));


Ok, this took me a while to figure out. This'll just write a 0x00 byte, since any C string is null-terminated, right?

cdecker · 2018-08-22T10:34:29Z

lightningd/log.c

-			fd = open(logfile, O_WRONLY|O_CREAT, 0600);
-		}
+	/* We expect to be in config dir. */
+	snprintf(logfile, sizeof(logfile), "crash.log.%u", getpid());


I'm wondering if pid is the best identifier here, we might want to use something like strftime(buffer, 14, "%Y%m%d%H%M%S", tm_info); in order to easily identify the last crash or correlate a crash dump with external monitoring.

cdecker · 2018-08-22T10:57:39Z

lightningd/log.c

-	if (write(signalfds[1], "", 1));
+	/* Writes a single 0x00 byte to the signalfds pipe. This may fail if
+	 * we're hammered with SIGHUP.  We don't care. */
+	if (write(signalfds[1], "", 1))


clang will complain if this is not on a separate line to signal explicit opt-out of this case.

cdecker · 2018-08-22T13:24:23Z

I took the liberty to add fixup! commits for the two suggestions in my comments. Other than these changes the PR is good, and if you're happy with the changes I proposed, feel free to squash and merge :-)

ACK 734b5f8

cdecker · 2018-08-22T14:28:06Z

Actually Travis-CI keeps complaining about one of the two tests (it fails test_crashlog but the error message related to the fake-bitcoin-cli of test_logging, which is really strange).

/tmp/ltests-0xxoonjf/test_logging_1/lightning-1/fake-bitcoin-cli exec failed: No such file or directory
lightningd: Fatal signal 6 (version 36db31f)
lightningd/lightningd: libbacktrace: no debug info in ELF executable
lightningd/lightningd: libbacktrace: no debug info in ELF executable

This happens both before and after my datetime crashlog commit, and I can't figure out why...

cdecker · 2018-08-22T14:31:56Z

This might actually be valgrind intercepting the crash and thus not producing the crashlog:

------------------------------- Valgrind errors --------------------------------
Valgrind error file: valgrind-errors.2382
==2382== Jump to the invalid address stated on the next line
==2382==    at 0x0: ???
==2382==    by 0x4C5B96: backtrace_full (backtrace.c:127)
==2382==    by 0x430F6B: crashdump (daemon.c:42)
==2382==    by 0x56634AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==2382==    by 0x47A658: brute_force_first (timer.c:211)
==2382==    by 0x47A7BE: get_first (timer.c:246)
==2382==    by 0x47A7D8: update_first (timer.c:251)
==2382==    by 0x47A829: timer_earliest (timer.c:264)
==2382==    by 0x46CC99: io_loop (poll.c:272)
==2382==    by 0x4152E8: main (lightningd.c:455)
==2382==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==2382== 
==2382== 
==2382== Process terminating with default action of signal 11 (SIGSEGV)
==2382==  Bad permissions for mapped region at address 0x0
==2382==    at 0x0: ???
==2382==    by 0x4C5B96: backtrace_full (backtrace.c:127)
==2382==    by 0x430F6B: crashdump (daemon.c:42)
==2382==    by 0x56634AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==2382==    by 0x47A658: brute_force_first (timer.c:211)
==2382==    by 0x47A7BE: get_first (timer.c:246)
==2382==    by 0x47A7D8: update_first (timer.c:251)
==2382==    by 0x47A829: timer_earliest (timer.c:264)
==2382==    by 0x46CC99: io_loop (poll.c:272)
==2382==    by 0x4152E8: main (lightningd.c:455)
--------------------------------------------------------------------------------

rustyrussell · 2018-08-22T23:31:07Z

Backtrace does not play nicely with valgrind... We disable it in dev mode, but not in non-dev mode:

#if DEVELOPER
	/* Suppresses backtrace (breaks valgrind) */
	if (!getenv("LIGHTNINGD_DEV_NO_BACKTRACE"))
		backtrace_state = backtrace_create_state(argv0, 0, NULL, NULL);
#else
	backtrace_state = backtrace_create_state(argv0, 0, NULL, NULL);
#endif

I'll work around it in the test itself...

Closes: ElementsProject#1623 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Someone had a 21GB crash.log, which doesn't help anyone! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This should make it easier to identify the latest crash file and correlate crashes with external monitoring tools.

cdecker · 2018-08-23T10:17:07Z

The remaining test failure seems to be an instance of the flaky test tracked in #1866, restarting to see if it unflakes :-)

cdecker · 2018-08-23T10:42:39Z

ACK d5425bc

rustyrussell requested a review from cdecker August 15, 2018 11:08

rustyrussell added this to the v0.6.1 milestone Aug 20, 2018

rustyrussell force-pushed the logfiles branch 2 times, most recently from d6be27f to f5d5624 Compare August 22, 2018 10:07

cdecker force-pushed the logfiles branch from f5d5624 to 4f9e23b Compare August 22, 2018 10:15

cdecker reviewed Aug 22, 2018

View reviewed changes

cdecker force-pushed the logfiles branch from d7f164d to 3ee85c2 Compare August 22, 2018 14:08

log: implement reopening log-file on SIGHUP

8528d07

Closes: ElementsProject#1623 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the logfiles branch from 734b5f8 to 5120402 Compare August 22, 2018 23:38

rustyrussell and others added 2 commits August 23, 2018 11:39

logging: always dump a crash log, but make files per-pid.

1ade440

Someone had a 21GB crash.log, which doesn't help anyone! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

log: Append the current time to the crash log filename

d5425bc

This should make it easier to identify the latest crash file and correlate crashes with external monitoring tools.

rustyrussell force-pushed the logfiles branch from 5120402 to d5425bc Compare August 23, 2018 02:09

cdecker mentioned this pull request Aug 23, 2018

Invalid write when shutting down with pending sendpay command #1866

Closed

cdecker merged commit 8f56d64 into ElementsProject:master Aug 23, 2018

cdecker mentioned this pull request Aug 23, 2018

Write logs to disk automatically #1623

Closed

cdecker mentioned this pull request Sep 2, 2018

Rotate crash files #1200

Closed

ZmnSCPxj mentioned this pull request Dec 2, 2020

SIGHUP only properly handled once; subsequent times kill lightningd #4240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support log rotation, don't append crash.log files but use per-peer. #1847

Support log rotation, don't append crash.log files but use per-peer. #1847

rustyrussell commented Aug 15, 2018 •

edited

Loading

cdecker commented Aug 22, 2018

cdecker Aug 22, 2018

cdecker Aug 22, 2018

cdecker Aug 22, 2018

cdecker Aug 22, 2018 •

edited

Loading

cdecker Aug 22, 2018

cdecker commented Aug 22, 2018 •

edited

Loading

cdecker commented Aug 22, 2018

cdecker commented Aug 22, 2018

rustyrussell commented Aug 22, 2018

cdecker commented Aug 23, 2018

cdecker commented Aug 23, 2018

Support log rotation, don't append crash.log files but use per-peer. #1847

Support log rotation, don't append crash.log files but use per-peer. #1847

Conversation

rustyrussell commented Aug 15, 2018 • edited Loading

cdecker commented Aug 22, 2018

cdecker Aug 22, 2018

Choose a reason for hiding this comment

cdecker Aug 22, 2018

Choose a reason for hiding this comment

cdecker Aug 22, 2018

Choose a reason for hiding this comment

cdecker Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

cdecker Aug 22, 2018

Choose a reason for hiding this comment

cdecker commented Aug 22, 2018 • edited Loading

cdecker commented Aug 22, 2018

cdecker commented Aug 22, 2018

rustyrussell commented Aug 22, 2018

cdecker commented Aug 23, 2018

cdecker commented Aug 23, 2018

rustyrussell commented Aug 15, 2018 •

edited

Loading

cdecker Aug 22, 2018 •

edited

Loading

cdecker commented Aug 22, 2018 •

edited

Loading