SDK Thrown Unhandled Exception is OTEL Collector if not available #1447

yahorsi · 2024-12-05T10:53:24Z

Steps to reproduce
Start Open Telemetry instrumented app while OTEL Collector is not available, app cannot start

What is the expected behavior?
App does not crash

What is the actual behavior?
App crashes

brettmc · 2024-12-05T23:33:42Z

Hi @yahorsi are you certain that it crashed the app? The export failure exception has been captured and logged (and you can configure where the logs go)

I updated one of our example scripts here to use batch span processor, and it seems to me to behave correctly:

# php examples/traces/exporters/otlp_http.php 
Starting OTLP example
OTLP example complete!  
OpenTelemetry: [error] Export failure [exception] Export retry limit exceeded [previous] cURL error 6: Could not resolve host: collector (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://collector:4318/v1/traces
#0 /usr/src/myapp/src/Contrib/Otlp/SpanExporter.php(38): OpenTelemetry\SDK\Common\Export\Http\PsrTransport->send('\n\x98\r\n\x9E\x06\n\e\n\thost....', NULL)
#1 /usr/src/myapp/src/SDK/Trace/SpanProcessor/BatchSpanProcessor.php(247): OpenTelemetry\Contrib\Otlp\SpanExporter->export(Array)
#2 /usr/src/myapp/src/SDK/Trace/SpanProcessor/BatchSpanProcessor.php(190): OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor->flush('shutdown', NULL)
#3 /usr/src/myapp/src/SDK/Trace/TracerSharedState.php(71): OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor->shutdown(NULL)
#4 /usr/src/myapp/src/SDK/Trace/TracerProvider.php(98): OpenTelemetry\SDK\Trace\TracerSharedState->shutdown(NULL)
#5 /usr/src/myapp/examples/traces/exporters/otlp_http.php(53): OpenTelemetry\SDK\Trace\TracerProvider->shutdown()
#6 {main}

# echo $?
0

The return code from CLI is 0 (success). If you change the log destination, for example dropping putenv('OTEL_PHP_LOG_DESTINATION=none'); into your code, it should suppress the logging of errors. Does that change what you see in terms of crash/not?

brettmc · 2024-12-24T01:38:27Z

@yahorsi do you have a custom error handler that might be turning a NOTICE into an exception?

smaddock · 2025-01-11T20:00:42Z

I am also getting this error, also noted in #1448.

E.g.,

Jan 11 12:58:05 staging php[20586]: OpenTelemetry: [error] Export failure [exception] Export retry limit exceeded [previous] cURL error 28: Operation timed out after 10000 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:4318/v1/logs
Jan 11 12:58:45 staging php[20586]: OpenTelemetry: [error] Export failure [exception] Export retry limit exceeded [previous] cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:4318/v1/logs
Jan 11 12:59:26 staging php[20586]: OpenTelemetry: [error] Export failure [exception] Export retry limit exceeded [previous] cURL error 28: Operation timed out after 10000 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:4318/v1/logs
Jan 11 13:00:06 staging php[20586]: OpenTelemetry: [error] Export failure [exception] Export retry limit exceeded [previous] cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:4318/v1/logs

My config is fairly simple, no custom error handler:

$loggerProvider = \OpenTelemetry\SDK\Logs\LoggerProvider::builder()
    ->addLogRecordProcessor(
        new \OpenTelemetry\SDK\Logs\Processor\SimpleLogRecordProcessor(
            new \OpenTelemetry\Contrib\Otlp\LogsExporter(
                (new \OpenTelemetry\Contrib\Otlp\OtlpHttpTransportFactory())
                    ->create(
                        'http://localhost:4318/v1/logs',
                        'application/x-protobuf'
                    )
            )
        )
    )
    ->setResource(
        \OpenTelemetry\SDK\Resource\ResourceInfoFactory::emptyResource()
            ->merge(
                \OpenTelemetry\SDK\Resource\ResourceInfo::create(
                    \OpenTelemetry\SDK\Common\Attribute\Attributes::create([
                        \OpenTelemetry\SemConv\ResourceAttributes::SERVICE_NAME    => 'MyService',
                        \OpenTelemetry\SemConv\ResourceAttributes::SERVICE_VERSION => '1.0.0'
                    ])
                )
            )
    )
    ->build();

This is running our PHP app as a daemon via a systemd service file. I can POST via cURL from the shell without issue. As @brettmc mentioned in both issues, I could disable the logging to hide these errors, but from the wording I'm assuming logs are being lost, not that it will keep trying, which is concerning.

PECL extensions:

opentelemetry 1.1.0
protobuf 4.29.3

Composer packages:

open-telemetry/api 1.2.0
open-telemetry/context 1.1.0
open-telemetry/exporter-otlp 1.2.0
open-telemetry/gen-otlp-protobuf 1.2.1
open-telemetry/opentelemetry-logger-monolog 1.0.0
open-telemetry/sdk 1.2.0
open-telemetry/sem-conv 1.27.1

smaddock · 2025-01-13T17:35:26Z

More testing, each "export failure" is reported as an error, not a notice. It also blocks our app for up to 40 seconds each time the export error occurs, so a process that normally takes ~6 seconds is taking ~5 minutes, and causing other async HTTP clients to time out so our process is not even completing.

brettmc · 2025-01-13T23:32:04Z

What's going on with your local collector? My understanding is that the collector was designed to be a high-availability, high-throughput solution to the problem of offloading telemetry data. For a shared-nothing PHP setup, it is usually a critical piece of infrastructure because each PHP process has to complete really quickly so as not to exhaust all your web server workers.

I'm assuming logs are being lost, not that it will keep trying, which is concerning

Yes, they'll be lost. We don't have a way to configure retries because the spec hasn't defined anything (eg, see open-telemetry/opentelemetry-specification#3639). In this scenario, you likely don't want it to keep retrying forever.

I'd start by looking at your collector setup. In my experience, it's been very reliable. If for some reason you cannot get the collector working reliably, or you just don't trust it and want 100% telemetry collection, you could look at otlp/stdout exporting, or even revert to your preferred logging system for that signal. If you're happy with some manual configuration for the time being, you can also look into otlp/file (which is the same thing as otlp/stdout but with one file per signal, and an actual file instead of stdout)

I could disable the logging to hide these errors

Is the issue that the errors are directly interfering with your application, or just that they exist? If the former, you could send them to the php error_log instead, or some other destination that won't trigger an error_handler.

How you configure otel to behave when the collector is unavailable is an important consideration. Clearly waiting for the default 10s timeout, then retrying x3 is very time-consuming. With a shared-nothing PHP webserver, each request is delayed and you'll likely run out of workers very quickly.
So, if waiting forever is not working, you can consider failing fast. Personally, I'd adjust the timeouts way down to <1s, because I'd rather lose telemetry than break the application.

smaddock · 2025-01-15T22:29:01Z

Out of scope of this project, but for completeness, our proxy collector is failing when a downstream collector is unreachable, so filed an issue with the proxy repo to separate those issues. grafana/alloy#2424

For the immediate issue here, shortening the timeouts to 500ms and no retries, our app is still timing out, so we may have to go back to logging to disk and scraping that.

yahorsi added the bug Something isn't working label Dec 5, 2024

yahorsi changed the title ~~SDK Thrown Unhandled Exception is OTEL Collector is not available~~ SDK Thrown Unhandled Exception is OTEL Collector if not available Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDK Thrown Unhandled Exception is OTEL Collector if not available #1447

SDK Thrown Unhandled Exception is OTEL Collector if not available #1447

yahorsi commented Dec 5, 2024

brettmc commented Dec 5, 2024

brettmc commented Dec 24, 2024

smaddock commented Jan 11, 2025

smaddock commented Jan 13, 2025

brettmc commented Jan 13, 2025 •

edited

Loading

smaddock commented Jan 15, 2025

SDK Thrown Unhandled Exception is OTEL Collector if not available #1447

SDK Thrown Unhandled Exception is OTEL Collector if not available #1447

Comments

yahorsi commented Dec 5, 2024

brettmc commented Dec 5, 2024

brettmc commented Dec 24, 2024

smaddock commented Jan 11, 2025

smaddock commented Jan 13, 2025

brettmc commented Jan 13, 2025 • edited Loading

smaddock commented Jan 15, 2025

brettmc commented Jan 13, 2025 •

edited

Loading