Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add stats and performance logging #360

Merged
merged 8 commits into from
Sep 12, 2024
Merged

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Sep 12, 2024

Summary by CodeRabbit

  • New Features

    • Introduced a configuration hash for tracking changes and validating configuration integrity.
    • Added performance metrics logging capabilities, including structured logging for enhanced monitoring.
    • Enhanced telemetry information for sources and destinations with configuration hash tracking.
    • Implemented new methods for retrieving runtime information for connectors and writers.
    • Added structured telemetry logging for job-related information, improving clarity and organization.
  • Bug Fixes

    • Improved error handling in the hashing functions to prevent exceptions from unhashable values.
  • Documentation

    • Updated documentation to reflect new features and changes in telemetry and logging functionalities.

Copy link

coderabbitai bot commented Sep 12, 2024

Warning

Rate limit exceeded

@aaronsteers has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 19 minutes and 48 seconds before requesting another review.

How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Commits

Files that changed from the base of the PR and between 95dead4 and c7217b1.

Walkthrough

Walkthrough

The pull request introduces several enhancements across multiple files, including the addition of methods for retrieving connector and writer runtime information, a new function for generating structured stream success messages, and the introduction of data classes for managing runtime information. It also simplifies telemetry handling and enhances logging capabilities for performance metrics. These changes collectively aim to improve the tracking of connector performance, configuration integrity, and overall logging functionalities within the Airbyte application.

Changes

File Change Summary
airbyte/_connector_base.py Added _get_connector_runtime_info method and config_hash property for tracking connector metadata and configuration hash.
airbyte/_message_iterators.py Introduced _new_stream_success_message function to create structured stream success messages.
airbyte/_util/connector_info.py Added RuntimeInfoBase, WriterRuntimeInfo, and ConnectorRuntimeInfo classes for managing connector runtime information and serialization.
airbyte/_util/telemetry.py Simplified telemetry handling by replacing SourceTelemetryInfo and DestinationTelemetryInfo with direct usage of ConnectorRuntimeInfo, removed one_way_hash function.
airbyte/_writers/base.py Introduced _get_writer_runtime_info method and config_hash property for tracking writer metadata and configuration hash.
airbyte/caches/base.py Added config_hash property for cache configuration tracking (currently unimplemented).
airbyte/logs.py Added get_global_stats_log_path() and get_global_stats_logger() functions for structured logging of performance metrics.
airbyte/progress.py Introduced _send_telemetry method for consolidating telemetry sending, and added _job_info property for structured job-related information.
tests/unit_tests/test_anonymous_usage_stats.py Updated telemetry test functions to utilize runtime information methods for improved accuracy in telemetry tracking.

Possibly related PRs

Suggested labels

enable-ai-review

Wdyt?


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (3)
airbyte/_util/hashing.py (1)

14-35: LGTM! Just a couple of suggestions to improve the function's readability and maintainability. WDYT?

The function looks great! It correctly handles different types of input objects, recursively hashes nested dictionaries and lists, and uses a secure hash algorithm (SHA-256) to generate one-way hashes. The use of a seed to ensure a unique domain of hashes is also a good practice.

I have a couple of suggestions to improve the function's readability and maintainability:

  1. Consider adding type hints for the return value and the input parameter. This will make it easier for other developers to understand the expected input and output of the function.
  2. Consider adding a docstring that explains the purpose of the function and the input and output parameters. This will make it easier for other developers to understand what the function does and how to use it.

Here's an example of how you could apply these suggestions:

+from typing import Any
+
 def one_way_hash(
-    obj: Mapping | list | object,
+    obj: Mapping[Any, Any] | list[Any] | object,
     /,
-) -> str:
+) -> str:
+    """
+    Generate a one-way hash of the given object.
+
+    Args:
+        obj: The object to hash. Can be a mapping, list, or any other object.
+
+    Returns:
+        A one-way hash of the given object as a string.
+    """
     ...
airbyte/logs.py (1)

213-267: Looks good! Just a minor suggestion.

The function correctly sets up a structured logger for performance metrics using the structlog library. It handles the case when the logging root or log file path is not valid and configures the logger with appropriate processors and a file handler. The implementation looks solid.

One minor suggestion: Consider adding a docstring to document the return type and behavior of the function, especially the case when it returns a no-op logger. Wdyt?

airbyte/_connector_base.py (1)

128-142: LGTM! Just a minor suggestion.

The config_hash property method looks good! It correctly handles the case when _config_dict is None and potential exceptions during hashing.

What do you think about adding a docstring to explain the purpose and behavior of this method? It could help with maintainability and readability.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ddbf6f4 and 71ba085.

Files selected for processing (5)
  • airbyte/_connector_base.py (2 hunks)
  • airbyte/_util/hashing.py (1 hunks)
  • airbyte/_util/telemetry.py (4 hunks)
  • airbyte/logs.py (1 hunks)
  • airbyte/progress.py (5 hunks)
Additional comments not posted (11)
airbyte/logs.py (1)

195-210: LGTM!

The function correctly handles the case when AIRBYTE_LOGGING_ROOT is None and attempts to create the logging directory. It also logs a warning if the directory creation fails and returns None. The implementation looks good to me.

airbyte/_util/telemetry.py (7)

48-50: LGTM!

The import statements have been updated correctly to reflect the removal of the one_way_hash function from this file and importing it from the airbyte._util.hashing module instead.


202-202: Looks good!

The addition of the config_hash attribute to the SourceTelemetryInfo dataclass is a valuable enhancement for tracking the configuration states of sources in the telemetry data.


216-216: LGTM!

Assigning the config_hash from the source object to the SourceTelemetryInfo instance in the from_source class method is the correct way to populate the newly added attribute.


227-227: Looks good!

Adding the config_hash attribute to the DestinationTelemetryInfo dataclass enables tracking the configuration states of destinations in the telemetry data, which is a valuable enhancement.


235-246: LGTM!

The updates to the from_destination class method of DestinationTelemetryInfo handle the cases when the destination parameter is None or a string by returning a DestinationTelemetryInfo instance with default values. This ensures consistency in the telemetry data.


248-254: Looks good!

The additional condition in the from_destination class method of DestinationTelemetryInfo handles the case when the destination parameter is an instance of Destination. It correctly populates the DestinationTelemetryInfo instance with the relevant attributes, including the config_hash, ensuring comprehensive telemetry data for destinations.


260-260: LGTM!

Populating the version attribute of the DestinationTelemetryInfo instance with the reported_version of the destination's executor is the correct approach. The explicit casting of destination.executor to the Executor type adds type safety and improves code readability.

airbyte/progress.py (3)

436-453: LGTM!

The _job_info property method is a nice addition that provides a structured way to retrieve job information. The use of asdict is a good choice for converting dataclass instances to dictionaries. The function also handles the case when self._source, self._cache, or self._destination is None by not including their information in the returned dictionary.


Line range hint 454-517: Excellent refactoring!

The changes to the _log_read_metrics method significantly improve the structure and readability of the logging functionality. The use of the _job_info property simplifies the code and provides a more organized way to include job information in the logs. The switch to using a BoundLogger for performance metrics is also a great choice as it enhances the clarity and organization of log entries.

The function correctly handles the case when no records have been read or the file logger is not available by returning early. Overall, these changes make it easier to capture and analyze job-related telemetry data.


Line range hint 28-56: LGTM!

The changes to the imports in the _get_status_message method are necessary to support the modifications made to the _log_read_metrics method. The use of asdict and the telemetry-related imports indicate that the status message generation now includes more detailed telemetry information.

These changes are consistent with the overall goal of improving the logging and telemetry capabilities of the Airbyte application. Nice work!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (5)
airbyte/_util/connector_info.py (1)

25-30: LGTM, but consider adding a docstring to the class.

The ConnectorRuntimeInfo class is correctly defined as a data class that inherits from RuntimeInfoBase. The name, executor_type, version, and config_hash attributes are properly defined with the correct types and default values.

Wdyt about adding a docstring to the class to provide a brief description of its purpose? This could help improve the code's readability and maintainability.

airbyte/_writers/base.py (1)

34-39: Looks good, but consider implementing config_hash.

The _get_writer_runtime_info method is implemented correctly and follows the expected return type. However, I noticed that the config_hash property currently returns None.

What do you think about implementing the config_hash property to return a meaningful hash of the writer's configuration? This would enable better tracking and logging of writer configurations. Let me know if you need any help with that!

airbyte/_message_iterators.py (1)

36-51: LGTM! Just a minor suggestion.

The _new_stream_success_message function looks good! It correctly constructs an AirbyteMessage object for a stream success message using the appropriate types and models. The docstring is also clear and concise.

One minor suggestion: Consider adding a type hint for the return value to make it explicit that the function returns an AirbyteMessage object. Wdyt?

def _new_stream_success_message(stream_name: str) -> AirbyteMessage:

Also, did you consider including additional details in the success message, such as the number of records processed or the processing time? That could provide more insights into the stream processing. What are your thoughts?

airbyte/_util/telemetry.py (1)

Line range hint 1-309: Overall, the changes look great! 🚀

The changes enhance the telemetry capabilities by simplifying the data structures used for tracking sources and destinations while removing redundant hashing logic. This is a nice improvement to the telemetry code.

Just a minor suggestion: What do you think about adding a brief comment explaining the purpose of the ConnectorRuntimeInfo and WriterRuntimeInfo classes? This could help future readers understand their role in the telemetry code. wdyt?

airbyte/logs.py (1)

218-272: Looks good! Just a minor suggestion.

The function is well-implemented and correctly sets up a structured logger for performance metrics using the structlog library. The logger configuration is appropriate, and it handles the scenarios where the logging root or log file path is invalid.

One minor suggestion: The directory creation logic at lines 249-257 seems to be a duplicate of the logic at lines 206-213 in the get_global_stats_log_path function. Since get_global_stats_log_path is already called before this block, the directory should already exist at this point. Wdyt about removing this duplicate logic to simplify the code a bit?

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 71ba085 and 06175c9.

Files selected for processing (8)
  • airbyte/_connector_base.py (3 hunks)
  • airbyte/_message_iterators.py (2 hunks)
  • airbyte/_util/connector_info.py (1 hunks)
  • airbyte/_util/telemetry.py (6 hunks)
  • airbyte/_writers/base.py (2 hunks)
  • airbyte/caches/base.py (1 hunks)
  • airbyte/logs.py (3 hunks)
  • airbyte/progress.py (8 hunks)
Additional comments not posted (20)
airbyte/_util/connector_info.py (2)

13-16: LGTM!

The RuntimeInfoBase class is correctly defined and the to_dict method is properly implemented to convert the class attributes to a dictionary, excluding the attributes with None values.


19-22: LGTM!

The WriterRuntimeInfo class is correctly defined as a data class that inherits from RuntimeInfoBase. The type and config_hash attributes are properly defined with the correct types and default values.

airbyte/_util/telemetry.py (10)

40-41: LGTM!

The imports are necessary for type hinting and casting.


48-51: LGTM!

The imports are necessary for using the ConnectorRuntimeInfo and WriterRuntimeInfo classes in the telemetry code.


52-52: LGTM!

The import is necessary for using the one_way_hash function in the telemetry code.


201-203: LGTM!

The changes to the function signature are consistent with the simplification of telemetry information for sources and destinations. This looks good to me!


224-225: LGTM!

The change is consistent with the simplification of telemetry information for sources. This looks good to me!


226-228: LGTM!

The change is consistent with the simplification of telemetry information for destinations. This looks good to me!


229-230: LGTM!

The change is consistent with the addition of cache information to the telemetry payload. This looks good to me!


267-268: LGTM!

The change is consistent with the simplification of telemetry information for sources. This looks good to me!


287-288: LGTM!

The change is consistent with the simplification of telemetry information for sources. This looks good to me!


303-303: LGTM!

The change is consistent with the simplification of telemetry information for sources. This looks good to me!

airbyte/logs.py (1)

200-215: LGTM!

The function logic is correct, and the implementation is accurate. It handles the directory creation failure scenario appropriately by logging a warning and returning None.

airbyte/_connector_base.py (2)

81-88: LGTM!

The _get_connector_runtime_info method looks good. It correctly constructs and returns a ConnectorRuntimeInfo object with the relevant metadata. Using the config_hash property to get the hash of the current configuration is a nice touch for tracking configuration integrity.


138-152: Looks good!

The config_hash property method is implemented correctly. It computes the hash of the current configuration using the one_way_hash function and handles edge cases appropriately by returning None when the config is not set or when an exception occurs during hashing. This ensures the method doesn't fail unexpectedly. Nice work!

airbyte/progress.py (5)

385-408: LGTM!

The _send_telemetry method looks good. It consolidates the telemetry sending process and is used appropriately in the other methods. Accessing the non-public APIs to get the runtime info for participants is acceptable for an internal method.


433-448: Looks good!

The _job_info property is a nice addition. It constructs a structured dictionary containing job-related information, which is used to improve the organization of logged performance metrics in the _log_read_metrics method. Accessing the non-public APIs to get the runtime info for participants is acceptable for an internal property.


Line range hint 450-512: Looks great!

The updates to the _log_read_metrics method look good. It now leverages the new _job_info property to improve the organization of logged performance metrics, replacing the previous approach of separately logging each component of the job description. The method logs the performance metrics to both the file logger and the global stats logger, which is a nice touch.


Line range hint 526-542: LGTM!

The updates to the log_success method look good. It now leverages the new _send_telemetry method to send telemetry data with the EventState.SUCCEEDED state and the total number of records read.


Line range hint 544-559: Looks good!

The updates to the log_failure method look good. It now leverages the new _send_telemetry method to send telemetry data with the EventState.FAILED state, the total number of records read, and the exception that caused the failure.

airbyte/_writers/base.py Show resolved Hide resolved
airbyte/caches/base.py Outdated Show resolved Hide resolved
@aaronsteers aaronsteers marked this pull request as ready for review September 12, 2024 20:47
@aaronsteers aaronsteers merged commit c5fea25 into main Sep 12, 2024
13 checks passed
@aaronsteers aaronsteers deleted the aj/feat/perf-log-file branch September 12, 2024 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant