Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1632 - Occasional Segfault with LongCounter instrument #1638

Merged
merged 8 commits into from
Sep 29, 2022

Conversation

lalitb
Copy link
Member

@lalitb lalitb commented Sep 27, 2022

Fixes #1632

Changes

The problem occurs when metric collection is happening while measurements are getting recorded in different threads ( PeriodicExportingMetricReader creates separate thread for collection)

Thread1:
Instrument::Record() -> SyncMetricStorage::Record() -> AttributeHashMap::GetOrSetDefault() ->mutex_lock -> map::insert()

Thread2:
SyncMetricStorage::Collect () -> std::move(AttributeHashMap) -> ...

While thread1 is recording in attribute-hashmap, thread2 can sometime clear the attribute-hashmap (by move it to different hashmap), which cause thread1 to try recording in invalid memory.

The solution is to remove locks from attribute-hashmap, and add it to SyncMetricStorage/AsyncMetricStorage whereever it is reading/updating/moving attribute-hashmap

i.e.,

Thread1:
Instrument::Record() -> SyncMetricStorage::Record () -> mutex_lock_ -> AttributeHashMap::GetOrSetDefault() -> map::insert()

Thread2:
SyncMetricStorage::Collect () ->mutex_lock -> std::move(AttributeHashMap) -> ...

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

@lalitb lalitb requested a review from a team September 27, 2022 19:05
@codecov
Copy link

codecov bot commented Sep 27, 2022

Codecov Report

Merging #1638 (78ce27f) into main (7278d83) will decrease coverage by 0.02%.
The diff coverage is 86.67%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1638      +/-   ##
==========================================
- Coverage   85.10%   85.09%   -0.01%     
==========================================
  Files         159      159              
  Lines        4999     5001       +2     
==========================================
+ Hits         4254     4255       +1     
- Misses        745      746       +1     
Impacted Files Coverage Δ
...ntelemetry/sdk/metrics/state/sync_metric_storage.h 63.16% <50.00%> (-1.54%) ⬇️
...telemetry/sdk/metrics/state/async_metric_storage.h 86.49% <100.00%> (+1.20%) ⬆️
...entelemetry/sdk/metrics/state/attributes_hashmap.h 95.84% <100.00%> (-0.94%) ⬇️
sdk/src/metrics/state/sync_metric_storage.cc 100.00% <100.00%> (ø)
sdk/src/trace/batch_span_processor.cc 91.41% <0.00%> (+0.79%) ⬆️

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good findings on the root cause.

Please see questions / comments on the fix.

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fixes.

@lalitb lalitb merged commit 9e87a6e into open-telemetry:main Sep 29, 2022
yxue pushed a commit to yxue/opentelemetry-cpp that referenced this pull request Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Occasional Segfault with LongCounter instrument
2 participants