Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEN-356] Use ServiceSpec for loading sources based on connectors #18322

Merged
merged 20 commits into from
Oct 24, 2024

Conversation

sushi30
Copy link
Contributor

@sushi30 sushi30 commented Oct 18, 2024

Introducing the OpenMetadata service specification.

Main things to look out for:

TODO

  • Implement the BigQuery system metric source class
  • run e2e tests (link to GA)

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

- use source classes that can be overridden in system profiles
- use a manifest class instead of factory to specify which class to resolve for connectors
- example usage can be seen in redshift and snowflake
Copy link
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@sushi30 sushi30 changed the title ref(profiler): use di for system profile [GEN-356] ref(profiler): use di for system profile Oct 18, 2024
Copy link

)
from metadata.utils.manifest import BaseManifest, get_class_path

RedshiftManifest = BaseManifest(profler_class=get_class_path(RedshiftProfiler))
Copy link
Contributor

@TeddyCr TeddyCr Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking since we control the framework and limit "meta" management of class import and what not when creating additional classes, can we just dynamically import the class based on the service type?

So if I need to add a new class to support some functionality just for Athena, then I just need to implement that class and I don't need to thinking about managing this manifest.py file for the specific source.

I think it will help with the goal of minimizing the amount of pieces you need to touch.

Copy link
Contributor Author

@sushi30 sushi30 Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a dynamic import approach for all our classes raises 2 issues I can think of:

  1. How does one implement a profiler for a CustomDatabase in this case?
    The idea behind this implementation is to have a default map at the service type level and allow the contributors to have more flexibility when developing other modules for their connectors.

  2. Our dynamic module resolution is not type safe. One such example is that BigQuery is sometimes resolved as bigquery, Bigquery, or BigQuery. A dynamic import path approach requires the user to have this knowledge and implementing each separate class with these conventions in mind. One can easily forget what is the right casing fora class name and use the wrong conventions when writing the class (for example BigQuerySource instead of BigquerySource). Without type-safety the only way we can such errors is at runtime. With the current approach there is a single point of failure (the manifest). The connectors themselves are contained within a system that can be validated independently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi let's stick with strict imports here. It adds a bit of more "boilerplate" when we're developing a new connector but removes magical layers from the code

- used super() dependency injection in order for system metrics source
- formatting
- added docs for the new specification
- added some pylint ignores in the importer module
@sushi30
Copy link
Contributor Author

sushi30 commented Oct 22, 2024

@pmbrull re __init

I changed how the constructors are called to remove this. LMK if this addresses your comment.

- fixed postgres native lineage test
@sushi30 sushi30 changed the title [GEN-356] ref(profiler): use di for system profile [GEN-356] Use ServiceSpec for loading sources based on connectors Oct 23, 2024
@sushi30 sushi30 requested review from TeddyCr and pmbrull October 23, 2024 12:52
Copy link

@sushi30 sushi30 merged commit 95982b9 into main Oct 24, 2024
19 of 21 checks passed
@sushi30 sushi30 deleted the system-metrics-refactor branch October 24, 2024 05:47
harshach pushed a commit that referenced this pull request Nov 3, 2024
…8322)

* ref(profiler): use di for system profile

- use source classes that can be overridden in system profiles
- use a manifest class instead of factory to specify which class to resolve for connectors
- example usage can be seen in redshift and snowflake

* - added manifests for all custom profilers
- used super() dependency injection in order for system metrics source
- formatting

* - implement spec for all source types
- added docs for the new specification
- added some pylint ignores in the importer module

* remove TYPE_CHECKING in core.py

* - deleted valuedispatch function
- deleted get_system_metrics_by_dialect
- implemented BigQueryProfiler with a system metrics source
- moved import_source_class to BaseSpec

* - removed tests related to the profiler factory

* - reverted start_time
- removed DML_STAT_TO_DML_STATEMENT_MAPPING
- removed unused logger

* - reverted start_time
- removed DML_STAT_TO_DML_STATEMENT_MAPPING
- removed unused logger

* fixed tests

* format

* bigquery system profile e2e tests

* fixed module docstring

* - removed import_side_effects from redshift. we still use it in postgres for the orm conversion maps.
- removed leftover methods

* - tests for BaseSpec
- moved get_class_path to importer

* - moved constructors around to get rid of useless kwargs

* - changed test_system_metric

* - added linage and usage to service_spec
- fixed postgres native lineage test

* add comments on collaborative constructors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ingestion safe to test Add this label to run secure Github workflows on PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants