Skip to content

Conversation

@bneradt
Copy link
Contributor

@bneradt bneradt commented Jan 6, 2026

When traffic_server exits normally (e.g., after regression tests complete), traffic_crashlog was incorrectly logging a crash because it detected its parent process had terminated. This happened because traffic_crashlog uses PR_SET_PDEATHSIG to wake up when traffic_server exits, but it couldn't distinguish between a crash (where crash_logger_invoke sends signal info via the pipe) and a normal exit (where the pipe is simply closed). This fix adds a poll() check on stdin to verify that crash data was actually sent before logging a crash, preventing false positive crash logs.

When traffic_server exits normally (e.g., after regression tests
complete), traffic_crashlog was incorrectly logging a crash because it
detected its parent process had terminated. This happened because
traffic_crashlog uses PR_SET_PDEATHSIG to wake up when traffic_server
exits, but it couldn't distinguish between a crash (where
crash_logger_invoke sends signal info via the pipe) and a normal exit
(where the pipe is simply closed). This fix adds a poll() check on stdin
to verify that crash data was actually sent before logging a crash,
preventing false positive crash logs.
@bneradt bneradt added this to the 10.2.0 milestone Jan 6, 2026
@bneradt bneradt self-assigned this Jan 6, 2026
@bneradt bneradt requested a review from serrislew January 6, 2026 17:26
@bneradt bneradt merged commit dd3455c into apache:master Jan 6, 2026
15 checks passed
@bneradt bneradt deleted the address_regression_test_crashlog branch January 6, 2026 20:02
bneradt added a commit to bneradt/trafficserver that referenced this pull request Jan 7, 2026
The poll() check from apache#12783 relied on POLLIN to detect crash data, but
this is unreliable - on some systems poll() sets POLLIN for EOF (socket
closed without data). Instead, we now verify crash data by actually
reading from stdin: if read() returns 0 bytes in wait_mode,
traffic_server exited normally and crash_logger_invoke was never called,
so we exit without logging. This definitively distinguishes a real crash
(data written then pipe closed) from a normal exit (pipe just closed).
bneradt added a commit to bneradt/trafficserver that referenced this pull request Jan 7, 2026
The poll() check from apache#12783 relied on POLLIN to detect crash data, but
this is unreliable - on some systems poll() sets POLLIN for EOF (socket
closed without data). Instead, we now verify crash data by actually
reading from stdin: if read() returns 0 bytes in wait_mode,
traffic_server exited normally and crash_logger_invoke was never called,
so we exit without logging. This definitively distinguishes a real crash
(data written then pipe closed) from a normal exit (pipe just closed).
bneradt added a commit to bneradt/trafficserver that referenced this pull request Jan 7, 2026
The poll() check from apache#12783 relied on POLLIN to detect crash data, but
this is unreliable - on some systems poll() sets POLLIN for EOF (socket
closed without data). Instead, we now verify crash data by actually
reading from stdin: if read() returns 0 bytes in wait_mode,
traffic_server exited normally and crash_logger_invoke was never called,
so we exit without logging. This definitively distinguishes a real crash
(data written then pipe closed) from a normal exit (pipe just closed).

(cherry picked from commit 9edd4df)
bneradt added a commit to bneradt/trafficserver that referenced this pull request Jan 7, 2026
The poll() check from apache#12783 relied on POLLIN to detect crash data, but
this is unreliable - on some systems poll() sets POLLIN for EOF (socket
closed without data). Instead, we now verify crash data by actually
reading from stdin: if read() returns 0 bytes in wait_mode,
traffic_server exited normally and crash_logger_invoke was never called,
so we exit without logging. This actually (as opposed to the previous
patch) distinguishes a real crash (data written then pipe closed) from a
normal exit (pipe just closed).
bneradt added a commit to bneradt/trafficserver that referenced this pull request Jan 7, 2026
The poll() check from apache#12783 relied on POLLIN to detect crash data, but
this is unreliable - on some systems poll() sets POLLIN for EOF (socket
closed without data). Instead, we now verify crash data by actually
reading from stdin: if read() returns 0 bytes in wait_mode,
traffic_server exited normally and crash_logger_invoke was never called,
so we exit without logging. This actually (as opposed to the previous
patch) distinguishes a real crash (data written then pipe closed) from a
normal exit (pipe just closed).

(cherry picked from commit 49d8702)
bneradt added a commit to bneradt/trafficserver that referenced this pull request Jan 7, 2026
The poll() check from apache#12783 relied on POLLIN to detect crash data, but
this is unreliable - on some systems poll() sets POLLIN for EOF (socket
closed without data). This tweak adds reading on the socket to make sure
that there is data being sent to verify that there was indeed a crash.
This actually (as opposed to the previous patch) distinguishes a real
crash (data written then pipe closed) from a normal exit (pipe just
closed).
bneradt added a commit to bneradt/trafficserver that referenced this pull request Jan 7, 2026
The poll() check from apache#12783 relied on POLLIN to detect crash data, but
this is unreliable - on some systems poll() sets POLLIN for EOF (socket
closed without data). Instead, we now verify crash data by actually
reading from stdin: if read() returns 0 bytes in wait_mode,
traffic_server exited normally and crash_logger_invoke was never called,
so we exit without logging. This actually (as opposed to the previous
patch) distinguishes a real crash (data written then pipe closed) from a
normal exit (pipe just closed).

(cherry picked from commit 49d8702)
bneradt added a commit that referenced this pull request Jan 8, 2026
bneradt added a commit that referenced this pull request Jan 8, 2026
This reverts commit dd3455c.

Addressing this turns out to be more complicated than it at first appeared. I realize after more testing that while this patch removed the "false" crash log reports, it broke crash log generally. And attempts at fixing that haven't been very fruitful. Maybe we should look into this in the future, but in the meantime I want to restore things to the previous state so that crash logs will at least be created when there is an actual crash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants