Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK Thrown Unhandled Exception is OTEL Collector if not available #1447

Open
yahorsi opened this issue Dec 5, 2024 · 6 comments
Open

SDK Thrown Unhandled Exception is OTEL Collector if not available #1447

yahorsi opened this issue Dec 5, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@yahorsi
Copy link

yahorsi commented Dec 5, 2024

Steps to reproduce
Start Open Telemetry instrumented app while OTEL Collector is not available, app cannot start

What is the expected behavior?
App does not crash

What is the actual behavior?
App crashes
image

@yahorsi yahorsi added the bug Something isn't working label Dec 5, 2024
@yahorsi yahorsi changed the title SDK Thrown Unhandled Exception is OTEL Collector is not available SDK Thrown Unhandled Exception is OTEL Collector if not available Dec 5, 2024
@brettmc
Copy link
Collaborator

brettmc commented Dec 5, 2024

Hi @yahorsi are you certain that it crashed the app? The export failure exception has been captured and logged (and you can configure where the logs go)

I updated one of our example scripts here to use batch span processor, and it seems to me to behave correctly:

# php examples/traces/exporters/otlp_http.php 
Starting OTLP example
OTLP example complete!  
OpenTelemetry: [error] Export failure [exception] Export retry limit exceeded [previous] cURL error 6: Could not resolve host: collector (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://collector:4318/v1/traces
#0 /usr/src/myapp/src/Contrib/Otlp/SpanExporter.php(38): OpenTelemetry\SDK\Common\Export\Http\PsrTransport->send('\n\x98\r\n\x9E\x06\n\e\n\thost....', NULL)
#1 /usr/src/myapp/src/SDK/Trace/SpanProcessor/BatchSpanProcessor.php(247): OpenTelemetry\Contrib\Otlp\SpanExporter->export(Array)
#2 /usr/src/myapp/src/SDK/Trace/SpanProcessor/BatchSpanProcessor.php(190): OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor->flush('shutdown', NULL)
#3 /usr/src/myapp/src/SDK/Trace/TracerSharedState.php(71): OpenTelemetry\SDK\Trace\SpanProcessor\BatchSpanProcessor->shutdown(NULL)
#4 /usr/src/myapp/src/SDK/Trace/TracerProvider.php(98): OpenTelemetry\SDK\Trace\TracerSharedState->shutdown(NULL)
#5 /usr/src/myapp/examples/traces/exporters/otlp_http.php(53): OpenTelemetry\SDK\Trace\TracerProvider->shutdown()
#6 {main}

# echo $?
0

The return code from CLI is 0 (success). If you change the log destination, for example dropping putenv('OTEL_PHP_LOG_DESTINATION=none'); into your code, it should suppress the logging of errors. Does that change what you see in terms of crash/not?

@brettmc
Copy link
Collaborator

brettmc commented Dec 24, 2024

@yahorsi do you have a custom error handler that might be turning a NOTICE into an exception?

@smaddock
Copy link

I am also getting this error, also noted in #1448.

E.g.,

Jan 11 12:58:05 staging php[20586]: OpenTelemetry: [error] Export failure [exception] Export retry limit exceeded [previous] cURL error 28: Operation timed out after 10000 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:4318/v1/logs
Jan 11 12:58:45 staging php[20586]: OpenTelemetry: [error] Export failure [exception] Export retry limit exceeded [previous] cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:4318/v1/logs
Jan 11 12:59:26 staging php[20586]: OpenTelemetry: [error] Export failure [exception] Export retry limit exceeded [previous] cURL error 28: Operation timed out after 10000 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:4318/v1/logs
Jan 11 13:00:06 staging php[20586]: OpenTelemetry: [error] Export failure [exception] Export retry limit exceeded [previous] cURL error 28: Operation timed out after 10001 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for http://localhost:4318/v1/logs

My config is fairly simple, no custom error handler:

$loggerProvider = \OpenTelemetry\SDK\Logs\LoggerProvider::builder()
    ->addLogRecordProcessor(
        new \OpenTelemetry\SDK\Logs\Processor\SimpleLogRecordProcessor(
            new \OpenTelemetry\Contrib\Otlp\LogsExporter(
                (new \OpenTelemetry\Contrib\Otlp\OtlpHttpTransportFactory())
                    ->create(
                        'http://localhost:4318/v1/logs',
                        'application/x-protobuf'
                    )
            )
        )
    )
    ->setResource(
        \OpenTelemetry\SDK\Resource\ResourceInfoFactory::emptyResource()
            ->merge(
                \OpenTelemetry\SDK\Resource\ResourceInfo::create(
                    \OpenTelemetry\SDK\Common\Attribute\Attributes::create([
                        \OpenTelemetry\SemConv\ResourceAttributes::SERVICE_NAME    => 'MyService',
                        \OpenTelemetry\SemConv\ResourceAttributes::SERVICE_VERSION => '1.0.0'
                    ])
                )
            )
    )
    ->build();

This is running our PHP app as a daemon via a systemd service file. I can POST via cURL from the shell without issue. As @brettmc mentioned in both issues, I could disable the logging to hide these errors, but from the wording I'm assuming logs are being lost, not that it will keep trying, which is concerning.

PECL extensions:

  • opentelemetry 1.1.0
  • protobuf 4.29.3

Composer packages:

  • open-telemetry/api 1.2.0
  • open-telemetry/context 1.1.0
  • open-telemetry/exporter-otlp 1.2.0
  • open-telemetry/gen-otlp-protobuf 1.2.1
  • open-telemetry/opentelemetry-logger-monolog 1.0.0
  • open-telemetry/sdk 1.2.0
  • open-telemetry/sem-conv 1.27.1

@smaddock
Copy link

More testing, each "export failure" is reported as an error, not a notice. It also blocks our app for up to 40 seconds each time the export error occurs, so a process that normally takes ~6 seconds is taking ~5 minutes, and causing other async HTTP clients to time out so our process is not even completing.

@brettmc
Copy link
Collaborator

brettmc commented Jan 13, 2025

What's going on with your local collector? My understanding is that the collector was designed to be a high-availability, high-throughput solution to the problem of offloading telemetry data. For a shared-nothing PHP setup, it is usually a critical piece of infrastructure because each PHP process has to complete really quickly so as not to exhaust all your web server workers.

I'm assuming logs are being lost, not that it will keep trying, which is concerning

Yes, they'll be lost. We don't have a way to configure retries because the spec hasn't defined anything (eg, see open-telemetry/opentelemetry-specification#3639). In this scenario, you likely don't want it to keep retrying forever.

I'd start by looking at your collector setup. In my experience, it's been very reliable. If for some reason you cannot get the collector working reliably, or you just don't trust it and want 100% telemetry collection, you could look at otlp/stdout exporting, or even revert to your preferred logging system for that signal. If you're happy with some manual configuration for the time being, you can also look into otlp/file (which is the same thing as otlp/stdout but with one file per signal, and an actual file instead of stdout)

I could disable the logging to hide these errors

Is the issue that the errors are directly interfering with your application, or just that they exist? If the former, you could send them to the php error_log instead, or some other destination that won't trigger an error_handler.

How you configure otel to behave when the collector is unavailable is an important consideration. Clearly waiting for the default 10s timeout, then retrying x3 is very time-consuming. With a shared-nothing PHP webserver, each request is delayed and you'll likely run out of workers very quickly.
So, if waiting forever is not working, you can consider failing fast. Personally, I'd adjust the timeouts way down to <1s, because I'd rather lose telemetry than break the application.

@smaddock
Copy link

Out of scope of this project, but for completeness, our proxy collector is failing when a downstream collector is unreachable, so filed an issue with the proxy repo to separate those issues. grafana/alloy#2424

For the immediate issue here, shortening the timeouts to 500ms and no retries, our app is still timing out, so we may have to go back to logging to disk and scraping that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants