Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional Segfault with LongCounter instrument #1632

Closed
Ongy opened this issue Sep 23, 2022 · 6 comments · Fixed by #1638
Closed

Occasional Segfault with LongCounter instrument #1632

Ongy opened this issue Sep 23, 2022 · 6 comments · Fixed by #1638
Assignees
Labels
bug Something isn't working

Comments

@Ongy
Copy link

Ongy commented Sep 23, 2022

Describe your environment Describe any aspect of your environment relevant to the problem, including your platform, build system, version numbers of installed dependencies, etc. If you're reporting a problem with a specific version of a library in this repo, please check whether the problem has been fixed on main branch.

  • Version: v1.6.1
  • Environment: Archlinux
  • Build System: cmake for opentelemetry, meson for application
  • gcc (GCC) 12.2.0
  • default_options: ['c_std=c2x', 'cpp_std=c++20'],

Steps to reproduce
I have a setup with multiple metric instruments.
Opentelemetry SDK is configured to export into the prometheus integration, with view/integration setup largely like the example code.
Crash is nondetermenistic and takes between a couple seconds and over an hour to show up with a counter called a couople hundred times each second.

I'll try to build a smaller example to exhibit the error as well.

What is the expected behavior?
No crashes.

What is the actual behavior?
Segfault in opentelemetry-cpp code.

Additional context
I've extracted a call stack and some more information:

std::__atomic_base<bool>::exchange(std::memory_order __m, std::__atomic_base<bool>::__int_type __i, std::__atomic_base<bool> * const this) (/usr/include/c++/12.2.0/bits/atomic_base.h:506)
std::atomic<bool>::exchange(std::atomic<bool> * const this, bool __i, std::memory_order __m) (/usr/include/c++/12.2.0/atomic:120)
opentelemetry::v1::common::SpinLockMutex::lock(opentelemetry::v1::common::SpinLockMutex * const this) (/usr/include/opentelemetry/common/spin_lock_mutex.h:101)
std::lock_guard<opentelemetry::v1::common::SpinLockMutex>::lock_guard(std::lock_guard<opentelemetry::v1::common::SpinLockMutex> * const this, std::lock_guard<opentelemetry::v1::common::SpinLockMutex>::mutex_type & __m) (/usr/include/c++/12.2.0/bits/std_mutex.h:229)
libopentelemetry_metrics.so!opentelemetry::v1::sdk::metrics::AttributesHashMap::GetOrSetDefault(opentelemetry::v1::sdk::common::OrderedAttributeMap const&, std::function<std::unique_ptr<opentelemetry::v1::sdk::metrics::Aggregation, std::default_delete<opentelemetry::v1::sdk::metrics::Aggregation> > ()>)(opentelemetry::v1::sdk::metrics::AttributesHashMap * const this, const opentelemetry::v1::sdk::metrics::MetricAttributes & attributes, std::function<std::unique_ptr<opentelemetry::v1::sdk::metrics::Aggregation, std::default_delete<opentelemetry::v1::sdk::metrics::Aggregation> >()> aggregation_callback) (/home/ongy/Source/aur/opentelemetry-cpp/src/opentelemetry-cpp-git/sdk/include/opentelemetry/sdk/metrics/state/attributes_hashmap.h:67)
libopentelemetry_metrics.so!opentelemetry::v1::sdk::metrics::SyncMetricStorage::RecordLong(opentelemetry::v1::sdk::metrics::SyncMetricStorage * const this, long value, const opentelemetry::v1::context::Context & context) (/home/ongy/Source/aur/opentelemetry-cpp/src/opentelemetry-cpp-git/sdk/include/opentelemetry/sdk/metrics/state/sync_metric_storage.h:55)
libopentelemetry_metrics.so!opentelemetry::v1::sdk::metrics::SyncMultiMetricStorage::RecordLong(opentelemetry::v1::sdk::metrics::SyncMultiMetricStorage * const this, long value, const opentelemetry::v1::context::Context & context) (/home/ongy/Source/aur/opentelemetry-cpp/src/opentelemetry-cpp-git/sdk/include/opentelemetry/sdk/metrics/state/multi_metric_storage.h:31)
libopentelemetry_metrics.so!opentelemetry::v1::sdk::metrics::LongCounter::Add(opentelemetry::v1::sdk::metrics::LongCounter * const this, long value) (/home/ongy/Source/aur/opentelemetry-cpp/src/opentelemetry-cpp-git/sdk/src/metrics/sync_instruments.cc:57)
handle_connection_update(TrackerState & state, uint64_t orig, uint64_t repl, const UUID & id, nf_conntrack * ct) (/home/ongy/Source/conntracker/collectors/conntrack/conntracker.cpp:404)
[...]

Stepping through the debugger, I see that the this pointer is 0x0 in the GetOrSetDefault Aggregate callstack, and then the member lock_ passed to the guard is at 0x38 which is still in the 0 page..
What confuses me though, onn the frame one above, I can see that this->attributes_hashmap_ has a non-zero value.

@Ongy Ongy added the bug Something isn't working label Sep 23, 2022
@lalitb lalitb self-assigned this Sep 23, 2022
@lalitb lalitb added this to the Metrics v1.0.0 (GA) milestone Sep 23, 2022
@lalitb
Copy link
Member

lalitb commented Sep 23, 2022

@Ongy - can you share your code, or whether you are generating events across multiple threads?

@Ongy
Copy link
Author

Ongy commented Sep 23, 2022

Not sure yet whether I want to open that code. I'll try to do a minimal reproduction sample so it's not a pain to compile/run either way.

My side of the code is fully single threaded. Events will always come from the main task and outside the otlp there's nothing that starts threads (or forks)

@Ongy
Copy link
Author

Ongy commented Sep 23, 2022

https://github.com/Ongy/otlp-example exhibits this crash.
It's not very minimal yet, but roughtly the behaviour I have in my real code with spans/frequencies of updates.

@Ongy
Copy link
Author

Ongy commented Sep 23, 2022

@lalitb reduced the linked example quite a bit.

Somewhat expected note, commenting out https://github.com/Ongy/otlp-example/blob/main/conntrack/conntracker.cpp#L76 helps.
But that removes the entirety of other threads/cive/prometheus from what I can see.

@lalitb
Copy link
Member

lalitb commented Sep 23, 2022

But that removes the entirety of other threads/cive/prometheus from what I can see.

Yes commenting that would break the pipeline and no metrics would-be exporter. Your code looks pretty simple scenario, but I suspect some issue in sync-storage here. Will debug this shortly. Thanks for reporting the issue with minimal reproduction.

@lalitb
Copy link
Member

lalitb commented Sep 27, 2022

Ok, I am able to reproduce the issue consistently using sample shared, anywhere after 500K - 2 million executions. I was suspecting sync-storage, but this seems to be different. Also using std::mutext instead of common::SpinLockMutex doesn't fix the issue. Will be investigating further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants