Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: safe kafka partition extraction #872

Merged

Conversation

nozik
Copy link
Contributor

@nozik nozik commented Jan 24, 2022

Description

In some cases, which aren't consistently reproducible, the partition extraction of the kafka-python instrumentation fails. Since this isn't a crucial part of the instrumentation, we simply protect it with try/except to avoid a crash.

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

A simple try/catch to prevent an issue that happened in our production environment.

Does This PR Require a Core Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@nozik nozik requested a review from a team January 24, 2022 16:20
@nozik nozik removed their assignment Jan 25, 2022
return instance._partition(
topic, partition, key, value, key_bytes, value_bytes
)
except Exception as exception: # pylint: disable=W0703
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this affect users? In case this case does trigger, will we end up not recording a span, omitting some info from the span or will this affect the instrumented service or kafka client in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will affect the instrumented service - hence the need to protect it with try/except

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that. I'm asking how does this fix affect users? Will it result in missing spans, missing attributes on spans or something else entirely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing attribute - the partition

topic, partition, key, value, key_bytes, value_bytes
)
except Exception as exception: # pylint: disable=W0703
_LOG.debug("Unable to extract partition: %s", exception)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be info/warn so we can actually find and fix the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the impact of not collecting this data is low, I would not want to "bother" the user with a warning. I can go with info, but debug is also ok - whatever you think is best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to know what exactly in this function is brittle so we can fix it later instead of wrapping it in a try/except and forgetting about it forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree - that line that I've addedall_partitions = instance._metadata.partitions_for_topic(topic) would've solved the crash. But I wanted to be on the safe side

@owais
Copy link
Contributor

owais commented Jan 26, 2022

Please update the changelog.

Copy link
Contributor

@owais owais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should find out the real reason for exceptions and add defensive checks around that but in the spirit of getting this into the next release, we can merge as is.

@nozik
Copy link
Contributor Author

nozik commented Jan 26, 2022

@owais Do I need to update the changelog, even though this instrumentation wasn't released yet?

@owais
Copy link
Contributor

owais commented Jan 26, 2022

Yes, please update changelog

@nozik
Copy link
Contributor Author

nozik commented Jan 26, 2022

@owais Done

@owais owais merged commit ef7769c into open-telemetry:main Jan 28, 2022
@nozik nozik deleted the fix_kafka_partition_extraction_failure branch April 11, 2022 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants