Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(integration/airbyte): Airbyte source ingestion integration #11

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

shubhamjagtap639
Copy link
Owner

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Copy link
Collaborator

@siddiquebagwan siddiquebagwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review comments

@config_class(AirbyteConfig)
@support_status(SupportStatus.CERTIFIED)
@capability(SourceCapability.PLATFORM_INSTANCE, "Enabled by default")
class AirbyteSource(Source):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable stateful ingsetion

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

class AirbyteSource(Source):
"""
This plugin extracts airbyte workspace, connections, sources, destinations and jobs.
This plugin is in beta and has only been tested on PostgreSQL.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a description related to when to provide api_key vs username/password?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that in source config description


@platform_name("Airbyte")
@config_class(AirbyteSourceConfig)
@support_status(SupportStatus.CERTIFIED)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to SupportStatus.INCUBATING

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

class AirbyteSourceConfig(StatefulIngestionConfigBase, DatasetSourceConfigMixin):
cloud_deploy: bool = pydantic.Field(
default=False,
description="Whether to fetch metadata from Airbyte Cloud or Airbyte OSS. For Airbyte Cloud provide api_key and for Airbyte OSS provide username/password",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabled to fetch metadata from Airbyte Cloud. If it is enabled then provide api_key in the recipe. username & password is required for Airbyte OSS only.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



class AirbyteSourceConfig(StatefulIngestionConfigBase, DatasetSourceConfigMixin):
cloud_deploy: bool = pydantic.Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a method to validate if cloud_deploye is true then api_key should be present, Check powerbi config class

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

config = AirbyteSourceConfig.parse_obj(config_dict)
return cls(config, ctx)

def get_workspace_workunit(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this pass method ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed pass and added code in that method



class OssAPIResolver(DataResolverBase):
BASE_URL = "http://localhost:8000/api/v1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take this from config and keep default to this value

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# If any entity does not support aspect 'status' then skip that entity from adding status aspect.
# Example like dataProcessInstance doesn't suppport status aspect.
# If not skipped gives error: java.lang.RuntimeException: Unknown aspect status for entity dataProcessInstance
skip_urns.add(urn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can say continue instead of collecting urns in skip_urns array

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At source_helpers.py:100 we are yielding all workunits which needs to ingest, hence we cannot have continue statement at source_helpers.py:98.
We can have CONTINUE logic next to source:helpers.py:102 but there we need to create logic to extract entity type from string urn.
I think current changes are small and simple so lets keep it otherwise let me know we can modify it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets discuss this in call

self.config.connector_platform_details[connector_type]
)
else:
connector_platform_detail = PlatformDetail()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this else and empty PlatformDetail?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty PlatformDetails will take default platform instance and env.
I will remove the else part and initialization of connector_platform_details will be done prior to if condition.


return DatasetUrn.create_from_ids(
platform_id=supported_data_platform[connector_type],
table_name=connector.name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct as per your concept mapping

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes correct. We decided to map source/destination as a input/output dataset of datajob entity.
I forgot to add it inside concept mapping table but already mentioned in Approch paragraph.

logger.info(f"Processing workspace id: {workspace.workspace_id}")
yield from self.get_workspace_workunit(workspace)

def get_report(self) -> SourceReport:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check powerbi, add one test connection method

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already added the test connection methods for both cloud and oss in airbyte/rest_api_wrapper/data_resolver.py.
And using it at airbyte_api.py:33.

# If any entity does not support aspect 'status' then skip that entity from adding status aspect.
# Example like dataProcessInstance doesn't suppport status aspect.
# If not skipped gives error: java.lang.RuntimeException: Unknown aspect status for entity dataProcessInstance
skip_urns.add(urn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets discuss this in call

)
if response.status_code == 401:
raise ConfigurationError(
"Please check if provided api key is correct or not."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configured credentials don't have required permissions.

if response.status_code == 401:
raise ConfigurationError(
"Please check if provided api key is correct or not."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shubhamjagtap639 Please document what minimum permission user or API key needs to fetch metadata

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no any permission thing present at API level

## Configuration Notes
1. Airbyte source is available for both Airbyte Cloud and Airbyte Open Source (OSS) users.
2. For Airbyte Cloud user need to provide api_key in recipe to ingest metadata and for Airbyte OSS username and password.
3. Refer Walkthrough demo [here](https://www.loom.com/share/7997a7c67cd642cc8d1c72ef0dfcc4bc) to create a api_key from [Developer Portal](https://portal.airbyte.com/) in case you are using Airbyte Cloud.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Airbyte Cloud refer demo here to create a api_key.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,16 @@
## Configuration Notes
1. Airbyte source is available for both Airbyte Cloud and Airbyte Open Source (OSS) users.
2. For Airbyte Cloud user need to provide api_key in recipe to ingest metadata and for Airbyte OSS username and password.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Airbyte Cloud users need to provide api_key in the recipe for authentication and Airbyte OSS users need to provide username and password.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

|--------------------------|-----------------------|
| `Workspace` | `DataFlow` |
| `Connection` | `DataJob` |
| `Sourc` | `Dataset` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -377,6 +377,7 @@ def get_long_description():
"powerbi-report-server": powerbi_report_server,
"vertica": sql_common | {"vertica-sqlalchemy-dialect[vertica-python]==0.0.1"},
"unity-catalog": databricks | sqllineage_lib,
"airbyte": {"requests"},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requests is a common library, check if it might be already there in this setup.py

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, If you see dbt-cloud and plusar source they are also defined in same way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


def _get_jobs_from_response(self, response: List[Dict]) -> List[Job]:
jobs: List[Job] = [
Job(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of these if else? lets discuss this in call

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aibyte cloud and OSS response contains some same metadata with different key.
Instead of defining seperate function in both CloudApiResolver and OssApiResolver I did this and define one function in abstraction class.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define two separate method please , eventually they will get call as per instance

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

| `Destination` | `Dataset` |
| `Connection Job History` | `DataProcessInstance` |

Source and destination gets mapped with Dataset as an Input and Output of Connection.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source and destination are mapped to Dataset as an Input and Output of Connection.

@@ -377,6 +377,7 @@ def get_long_description():
"powerbi-report-server": powerbi_report_server,
"vertica": sql_common | {"vertica-sqlalchemy-dialect[vertica-python]==0.0.1"},
"unity-catalog": databricks | sqllineage_lib,
"airbyte": {"requests"},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


def _get_jobs_from_response(self, response: List[Dict]) -> List[Job]:
jobs: List[Job] = [
Job(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define two separate method please , eventually they will get call as per instance

from tests.test_helpers import mce_helpers


def enable_logging():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this no need.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants