Skip to content

Conversation

@jodavies
Copy link
Collaborator

Here are a few ideas which may help with debugging in case of crashes, particularly for longer running code and crashes which other people can't reproduce. The first uses eu-addr2line to print a stack trace in Terminate. The second replaces Terminate everywhere with a macro which inserts file, line and function information into the error message, so one can see where exactly Terminate was called. The third retains debugging symbols in the form build. This "probably" doesn't have a measurable performance impact (needs to be tested), but it means you get a useful trace from form crashes, and not just vorm.

Any thoughts?

@coveralls
Copy link

coveralls commented May 23, 2024

Coverage Status

coverage: 49.275% (+0.03%) from 49.245%
when pulling 7877c98 on jodavies:backtrace
into e1289ac on vermaseren:master.

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 4, 2024

In the current state, a FORM crash looks like:

FORM 4.1 (May 31 2024, v4.1-20131025-641-g7062bd7)  Run: Tue Jun  4 10:49:26 2024
    Symbol x;
    CFunction f;
    Local test = 1/x;
    Multiply replace_(x,0);
    .end
Division by zero during normalization
Program terminating at crash.frm Line 4 -->
Terminate called from normal.c:4135 (Normalize). Backtrace:
#0: Terminate at startup.c:1799:20
#1: Normalize at normal.c:4139:2
#2: Generator at proces.c:3272:20
#3: Generator at proces.c:4216:8
#4: Generator at proces.c:4216:8
#5: Processor at proces.c:406:11
#6: DoExecute at execute.c:858:67
#7: PreProcessor at pre.c:1041:17
#8: main at startup.c:1692:2
#9: __libc_start_call_main at libc_start_call_main.h:58:16
#10: call_init at libc-start.c:128:20
#10:  (inlined by) __libc_start_main_impl at libc-start.c:379:5
#11: _start at ??:0
  0.00 sec out of 0.29 sec

or

TFORM 4.1 (May 31 2024, v4.1-20131025-641-g7062bd7) 4 workers  Run: Tue Jun  4 10:51:03 2024
    Symbol x;
    CFunction f;
    Local test = 1/x;
    Multiply replace_(x,0);
    .end
Division by zero during normalization
Program terminating in thread 1 at crash.frm Line 4 -->
Terminate called from normal.c:4135 (Normalize). Backtrace:
#0: Terminate at startup.c:1799:20
#1: Normalize at normal.c:4138:2
#2: Generator at proces.c:3272:20
#3: Generator at proces.c:4216:8
#4: Generator at proces.c:4216:8
#5: RunThread at threads.c:1393:10
#6: start_thread at pthread_create.c:442:8
#7: __clone3 at clone3.S:83
  0.00 sec + 0.00 sec: 0.00 sec out of 0.23 sec

@jodavies
Copy link
Collaborator Author

I cleaned this up a bit more, eg, it doesn't print any functions below main (or start_thread for tform).

I don't see any performance regressions due to keeping debug symbols and frame pointers in the release build.

My opinion would be to always print the trace by default, but if anyone strongly opposes this I would add a setup parameter to control it.

@tueda
Copy link
Collaborator

tueda commented Jun 13, 2024

It doesn't look nice when eu-addr2line is not available...

TFORM 5.0.0-beta.1 (Jun 13 2024, v5.0.0-beta.1-69-g4c9b872) 4 workers  Run: Thu Jun 13 19:26:48 2024
    Symbol x;
    CFunction f;
    Local test = 1/x;
    Multiply replace_(x,0);
    .end
Division by zero during normalization
Program terminating in thread 1 at test.frm Line 4 -->
Terminate called from ../../sources/normal.c:4135 (Normalize)
Backtrace:
# 0: sh: 1: eu-addr2line: not found
# 1: sh: 1: eu-addr2line: not found
# 2: sh: 1: eu-addr2line: not found
# 3: sh: 1: eu-addr2line: not found
# 4: sh: 1: eu-addr2line: not found
# 5: sh: 1: eu-addr2line: not found
# 6: sh: 1: eu-addr2line: not found
# 7: sh: 1: eu-addr2line: not found
  0.00 sec + 0.00 sec: 0.00 sec out of 0.08 sec

I wonder if we really need to replace Terminate with TERMINATE. In principle, Terminate can be macronized so that it calls the original Terminate (renamed to, say, TerminateImpl) using the current source file name and the line number. (I'm not sure which is better.)

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 13, 2024

For TERMINATE I was vaguely following the "convention" that macros have uppercase names, but in the form sources it is true that plenty of macros do not have this. For me it would be fine to rename TERMINATE as you suggest.

As for eu-addr2line, we could check for it (with which eu-addr2line? Edit: command can be used for this, and is POSIX?) and print a useful message that the user might want to install it.

@tueda
Copy link
Collaborator

tueda commented Jun 13, 2024

I think command -v is in POSIX.

Another question would be: do we want FORM to depend on glibc? backtrace is neither POSIX nor Linux system call.

@jodavies
Copy link
Collaborator Author

FORM already links libc from somewhere, I did not need to provide the linker with -lc. Whether this is actually a required link, I am not sure.

@tueda
Copy link
Collaborator

tueda commented Jun 13, 2024

I'm not referring to libc, but rather to glibc, which is usually linked to programs on GNU/Linux. But, there are other implementations like musl libc on Alpine Linux.

@jodavies
Copy link
Collaborator Author

I see, sorry. Then as I understand the "proper" way to do this is for the configure script to check for backtrace, and then we include this code only on platforms where it exists?

@tueda
Copy link
Collaborator

tueda commented Jun 13, 2024

Yes, I think so. Or, optionally, we can consider a new configure script switch like --enable-backtrace, which is automatically enabled if backtrace is available.

@tueda
Copy link
Collaborator

tueda commented Jun 13, 2024

About Terminate vs. TERMINATE, I would like to hear others' opinions.

@jodavies
Copy link
Collaborator Author

These should implement the changes we discussed. The last can be reverted if TERMINATE is preferred after all.

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 14, 2024

I thought the last commit broke the static build but actually it went wrong before. But github gave it a green tick regardless...? Anyway it looks like https://sourceware.org/bugzilla/show_bug.cgi?id=15648

If I link with -lpthread instead of -pthread it works but I am not really sure what the difference is there.

@tueda
Copy link
Collaborator

tueda commented Jun 14, 2024

The multiple __lll_lock_wait_private issue will be resolved if we bump the Ubuntu version from 20.04 to 22.04, where the glibc version is 2.35.

@jodavies
Copy link
Collaborator Author

jodavies commented Jun 14, 2024

This is true, but the build will still be broken for users who want to do their own static build (for whatever reason) on older systems. Edit: though of course in this (rare?) situation one can just --disable-backtrace.

@jodavies
Copy link
Collaborator Author

jodavies commented Oct 7, 2024

I cleaned these patches up, but we still need to decide what to do about the CI fail on 20.04. If we bump the version here things will be fine. Then in case someone reports that they can't do a static build on 20.04 in the future, we'd have to advise to configure with --disable-backtrace.

I'd like some input on having the backtrace "enabled" by default: in principle "simple form script errors" look more scary. It could be hidden behind On backtrace;. But to some extent this defeats the purpose of adding it, which is to get a bit more information on crashes which run for weeks and are hard to reproduce (and maybe On backtrace; was forgotten about or a user didn't know it exists in the first place).

@tueda
Copy link
Collaborator

tueda commented Oct 7, 2024

Maybe --enable-backtrace=check (the default) should also check if it works with options used for TFORM.

@jodavies
Copy link
Collaborator Author

jodavies commented Oct 7, 2024

By this do you mean that if we are doing a static link, disable backtrace if libpthread contains __lll_lock_wait_private? It looks like on 24.04, this symbol is now no longer present in libpthread.

@tueda
Copy link
Collaborator

tueda commented Oct 7, 2024

No. If I understand correctly, we can create a test program and check if the program can be linked with pthreads by AC_LINK_IFELSE.

@jodavies
Copy link
Collaborator Author

jodavies commented Oct 7, 2024

A simple pthread program suffices, if you build as gcc -lc -pthread -static main.c

#include <stdio.h>
#include <pthread.h>

void *print_function(void *ptr) {
        char *message = (char*) ptr;
        printf("%s\n", message);
}

int main(void) {

        pthread_t thread1;
        int ret1;
        char *message = "Hi from pthread";
        ret1 = pthread_create(&thread1, NULL, print_function, (void*) message);
        pthread_join(thread1, NULL);

        return 0;
}

@jodavies
Copy link
Collaborator Author

jodavies commented Oct 9, 2024

At least on Ubuntu 20.04, if I simply omit -lc from the link command, it works fine with backtrace enabled.

@tueda
Copy link
Collaborator

tueda commented Oct 9, 2024

So using AC_CHECK_FUNC instead of AC_CHECK_LIB may be enough?

@jodavies
Copy link
Collaborator Author

jodavies commented Oct 9, 2024

Yes, or AC_SEARCH_LIBS works too. Let's see what the github runner makes of it...

@tueda
Copy link
Collaborator

tueda commented Oct 9, 2024

Would it be useful if FORM prints some stack information, at least function names, even if eu-addr2line is unavailable?

TFORM 5.0.0-beta.1 (Oct  9 2024, v5.0.0-beta.1-73-g524ccf2-dirty) 4 workers  Run: Wed Oct  9 14:08:46 2024
    Symbol x;
    CFunction f;
    Local test = 1/x;
    Multiply replace_(x,0);
    .end
Division by zero during normalization
Program terminating in thread 3 at test.frm Line 4 -->
Terminate called from ../../sources/normal.c:4135 (Normalize)
Backtrace:
# 0: (TerminateImpl+0x1f4) [0x55d9ed4c68c4]
# 1: (Normalize+0x3272) [0x55d9ed3c9f72]
# 2: (Generator+0xf6) [0x55d9ed465dd6]
# 3: (Generator+0x28c2) [0x55d9ed4685a2]
# 4: (Generator+0x28c2) [0x55d9ed4685a2]
# 5: (RunThread+0x550) [0x55d9ed51e530]
# 6: (+0x84c1e) [0x7f0a90008c1e]
# 7: (+0x10587c) [0x7f0a9008987c]
Please install eu-addr2line for readable stack information.
  0.00 sec + 0.00 sec: 0.00 sec out of 0.07 sec
        else {
            /* eu-addr2line not found */
            char **strings;
            strings = backtrace_symbols(stack, stacksize);
            MesPrint("Backtrace:");
            for ( int i = 0; i < stacksize && !stop; i++ ) {
                char *p = strings[i];
                while ( *p && *p != '(' ) p++;
                MesPrint("%#%2d: %s\n", i, p);
                /* Maybe check stopping conditions? */
            }
            MesPrint("Please install eu-addr2line for readable stack information.");
            free(strings);
        }

This requires -rdynamic in LINKFLAGS/DEBUGLINKFLAGS.

@jodavies
Copy link
Collaborator Author

jodavies commented Oct 9, 2024

Yes, this is nice (I seem to remember having trouble getting this output before, maybe I messed up the rdynamic flag or something.)

Here we can stop printing at main, but not at start_thread for tform, though we could stop at RunThread or RunSortBot instead, we don't actually need to know start_thread at pthread_create.c:447:8 I suppose.

@jodavies
Copy link
Collaborator Author

I don't observe any performance impact due to the new build options (at least for a test with mincer).

@tueda
Copy link
Collaborator

tueda commented Oct 11, 2024

Maybe addr2line (in binutils) is more commonly installed than eu-addr2line (in elfutils) on Linux? Then, we can try to use addr2line if eu-addr2line is unavailable.

diff --git a/sources/startup.c b/sources/startup.c
index 425a4d7..6865bda 100644
--- a/sources/startup.c
+++ b/sources/startup.c
@@ -45,6 +45,10 @@
 #endif
 #ifdef ENABLE_BACKTRACE
 	#include <execinfo.h>
+#ifdef LINUX
+	#include <stdint.h>
+	#include <inttypes.h>
+#endif
 #endif
 
 /*
@@ -1820,8 +1824,64 @@ VOID TerminateImpl(int errorcode, const char* file, int line, const char* functi
 				pclose(fp);
 			}
 		}
+#ifdef LINUX
+		else if ( !system("command -v addr2line > /dev/null 2>&1") ) {
+			/* Get the executable path. */
+			char exe_path[PATH_MAX];
+			{
+				ssize_t len = readlink("/proc/self/exe", exe_path, sizeof(exe_path) - 1);
+				if ( len != -1 ) {
+					exe_path[len] = '\0';
+				}
+				else {
+					goto backtrace_fallback;
+				}
+			}
+			/* Assume PIE binary and get the base address. */
+			uintptr_t base_address = 0;
+			{
+				char line[256];
+				FILE *maps = fopen("/proc/self/maps", "r");
+				if ( !maps ) {
+					goto backtrace_fallback;
+				}
+				/* See the format used by nommu_region_show() in fs/proc/nommu.c of the Linux source. */
+				if ( fgets(line, sizeof(line), maps) ) {
+					sscanf(line, "%" SCNxPTR "-", &base_address);
+				}
+				else {
+					fclose(maps);
+					goto backtrace_fallback;
+				}
+				fclose(maps);
+			}
+			MesPrint("Backtrace:");
+			for ( int i = 0; i < stacksize && !stop; i++ ) {
+				FILE *fp;
+				char cmd[PATH_MAX + 512];
+				uintptr_t addr = (uintptr_t)stack[i] - base_address;
+				MesPrint("%#%2d: %", i);
+				snprintf(cmd, sizeof(cmd), "addr2line -e \"%s\" -i -p -s -f -C 0x%" PRIxPTR, exe_path, addr);
+				fp = popen(cmd, "r");
+				while ( fgets(cmd, sizeof(cmd), fp) != NULL ) {
+					MesPrint("%s", cmd);
+					/* Don't show functions lower than "main" */
+					if ( strstr(cmd, "main") || strstr(cmd, "RunThread") || strstr(cmd, "RunSortBot") ) {
+						stop = 1;
+					}
+				}
+				pclose(fp);
+			}
+		}
+#endif
 		else {
 			/* eu-addr2line not found */
+#ifdef LINUX
+backtrace_fallback:
+#endif
 			char **strings;
 			strings = backtrace_symbols(stack, stacksize);
 			MesPrint("Backtrace:");
@@ -1835,7 +1895,11 @@ VOID TerminateImpl(int errorcode, const char* file, int line, const char* functi
 				}
 				/* Maybe check stopping conditions? */
 			}
+#ifdef LINUX
+			MesPrint("Please install addr2line or eu-addr2line for readable stack information.");
+#else
 			MesPrint("Please install eu-addr2line for readable stack information.");
+#endif
 			free(strings);
 		}
 #else

@jodavies
Copy link
Collaborator Author

This patch works nicely for me. I just added the same skip of "main" etc as in the eu-addr2line code.

I've just noticed that none of this works with parform, however, if there is a straightforward fix for that.

@jodavies
Copy link
Collaborator Author

jodavies commented Dec 1, 2024

For this, maybe we should use the value passed to Terminate to determine whether to print a trace or not. Then for simple script syntax errors, the backtrace spam can be avoided, and it only prints for more "serious problems"? While I like the idea for better bug reports generally, it is not so nice to get the back trace for, eg, missing semicolon or undefined symbol.

This means a review of every Terminate call, and also in some cases it might be a bit tricky since functions return an error condition and Terminate is called further up the stack rather than terminating immediately.

Also an option On backtrace; could force a backtrace in all Terminate conditions.

@jodavies
Copy link
Collaborator Author

The last commit implements what we discussed on the video call: add an "on backtrace;" option which is off by default for form, on by default for vorm.

To clean up for merge probably I should squash all commits here apart from the first, which can stay separate.

@jodavies jodavies force-pushed the backtrace branch 2 times, most recently from b243618 to a0e59b4 Compare March 3, 2025 11:58
@jodavies
Copy link
Collaborator Author

I think I want to do a bit more benchmarking of this before merging. I did not find any issues before, but it is perhaps concerning that the form and tform binaries go from ~2.5M to ~13M, due to not stripping the symbols.

Slightly more useful information is given in case of a crash outside
of a debugger.
jodavies added 2 commits June 5, 2025 11:16
If enabled, use eu-addr2line or addr2line to print a stack trace on
crash, for easier debugging of long-running job crashes.

This builds FORM with -g -fno-omit-frame-pointer, which leads to a
performance penalty of about 1%.

For this reason, the feature is disabled by default.
This is on by default for both form and vorm.
@jodavies
Copy link
Collaborator Author

jodavies commented Jun 7, 2025

After some more benchmarking, I would say that the build with debug information is ~1% slower. For this reason, now the feature is disabled by default. If compiled in (configure --enable-backtrace), it is enabled by default for both form and vorm (but can be turned off with Off backtrace;).

In terms of bug reporting it is not ideal to have it disabled by default, but probably users do not want an opt-out 1% performance loss for something that is not a bug fix.

@jodavies jodavies merged commit a976801 into form-dev:master Jun 9, 2025
56 checks passed
@jodavies
Copy link
Collaborator Author

jodavies commented Jun 9, 2025

For information: I've cherry-picked the first commit (Terminate macro) into 4.3.2 but not the actual backtracing commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants