Dagster PoC for Metadata service #23989

bnchrch · 2023-03-13T16:18:55Z

Notes for reviewers

Large lines changed correspond to having the catalogs as test resources

Video walk through

https://www.loom.com/share/60d78c1e6b74491186bfe2d49d4f459d

What

Closes #24008

This adds the begining of the metadata service and the dagster orchestrator

How

Add a dagster job that on catalog change outputs a list of unique dockerImage versions and in which catalog they are available.

Example here

Recommended reading order

evantahler · 2023-03-13T17:31:20Z

airbyte-ci/connectors_ci/metadata_service/orchestrator/assets/catalog_assets.py

+@asset(required_resource_keys={"catalog_report_directory_manager"})
+def connector_catalog_location_markdown(context, all_destinations_dataframe, all_sources_dataframe):


👋 Bye, #23367

Haha not quite! But its days are numbered 🗡️

alafanechere

It's very cool to discover Dagster with your PR.
I left a couple of comments on the README because of my unfamiliarity with Dagster. Feel free to return these suggestions about Dagger 😄

airbyte-ci/connectors_ci/README.md

alafanechere · 2023-03-15T15:42:05Z

airbyte-ci/connectors_ci/metadata_service/orchestrator/.env.template

@@ -0,0 +1,2 @@
+METADATA_BUCKET="ben-ab-test-bucket"
+GCP_GSM_CREDENTIALS=


Missing new line at the end of the file.
Question: how are the variables defined in .env loaded as environment variables?

✨ Dagster ✨

https://docs.dagster.io/guides/dagster/using-environment-variables-and-secrets

airbyte-ci/connectors_ci/metadata_service/orchestrator/.gitignore

alafanechere · 2023-03-15T15:45:34Z

airbyte-ci/connectors_ci/metadata_service/orchestrator/README.md

+cp .env.template .env
+```
+
+### Create a GCP Service Account and Dev Bucket


We've been using minio for local development on buckets. do you think we could use it in this context too to not require interaction with gcloud (console) for development?

I like the idea of using minio! Where are we currently using it?

Dagster has the concept of file and io managers so we should be able to swap between local and remote implementations really easily.

Also Ive got this as a separate issue in our backlog just for now.

https://github.com/airbytehq/airbyte/issues/24090

I think platform uses (or used) Minio to store the logs on local deployment:
https://github.com/airbytehq/airbyte-cloud/blob/49530001e0732d40e187c5b4fd68e544ae5e980a/.env#L68

alafanechere · 2023-03-15T15:48:19Z

airbyte-ci/connectors_ci/metadata_service/orchestrator/README.md

+4. Add the following environment variables to your `.env` file:
+    - `METADATA_BUCKET`
+    - `GCP_GSM_CREDENTIALS`


Could you explain why we need to have the GCP_GSM_CREDENTIALS env var in the orchestrator context?

Sure, its because we both upload to GCS and watch for changes for files.

But I imagine your asking why were doing this in an ENV var?

If thats the question its simply because Dagster Cloud is currently best set up to handle config values through env vars, not local files.

airbyte-ci/connectors_ci/metadata_service/orchestrator/orchestrator/resources/gcp_resources.py

airbyte-ci/connectors_ci/metadata_service/orchestrator/orchestrator/sensors/catalog_sensors.py

alafanechere · 2023-03-15T16:27:29Z

airbyte-ci/connectors_ci/metadata_service/orchestrator/orchestrator/sensors/catalog_sensors.py

+    return ":".join(etags)
+
+
+def catalog_updated_sensor(job, resources_def) -> SensorDefinition:


Suggested change

def catalog_updated_sensor(job, resources_def) -> SensorDefinition:

def catalog_updated_sensor(job) -> SensorDefinition:

Don't you think it should be the sensor role to define which resources it needs and maybe create this?
If I'm following your implementation correctly we're defining global resources in init.py and we pass all these resources to this function. What if the resources dict grows with resources this sensor does not need? It will still "build_resources" on resources it does not need right?

Thats a fair critique! If we add more resources we will end up building more than we needed to.

The only reason I did it this way is I didnt want new developers to add new resources to one of our reports and accidentally (and silently) break the sensor.

hmm let me investigate if theres a way we can solve both issues

airbyte-ci/connectors_ci/metadata_service/orchestrator/orchestrator/sensors/catalog_sensors.py

evantahler

Approving because this is a great start to the project and things work!

comments below are NITs, and I also defer to the pythonic reviews of others.

evantahler · 2023-03-13T23:31:53Z

airbyte-ci/connectors_ci/README.md

@@ -0,0 +1,3 @@
+# Airbyte Connectors CI
+
+This folder is a collection of systems, tools and scripts that are used to run CI/CD systems specific to our connectors.


This sort of implies that these projects should move here:

https://github.com/airbytehq/airbyte/tree/master/tools/ci_code_validator

https://github.com/airbytehq/airbyte/tree/master/tools/ci_common_utils

https://github.com/airbytehq/airbyte/tree/master/tools/ci_connector_ops

https://github.com/airbytehq/airbyte/tree/master/tools/ci_credentials

Is that the plan?

Yup! All part of phase 3 of the tech spec

evantahler · 2023-03-16T23:25:50Z

airbyte-ci/connectors_ci/metadata_service/orchestrator/README.md

+All commands below assume you are in the `metadata_service/orchestrator` directory.
+### Installation
+```bash
+poetry install


how do I get poetry? (https://python-poetry.org/docs/)

Good callout! Ill add this

evantahler · 2023-03-16T23:30:22Z

airbyte-ci/connectors_ci/metadata_service/orchestrator/orchestrator/assets/catalog_assets.py

+def all_destinations_dataframe(cloud_destinations_dataframe, oss_destinations_dataframe) -> pd.DataFrame:
+    """
+    Merge the cloud and oss destinations catalogs into a single dataframe.
+    """
+
+    # Add a column 'is_cloud' to indicate if an image/version pair is in the cloud catalog
+    cloud_destinations_dataframe["is_cloud"] = True
+
+    # Add a column 'is_oss' to indicate if an image/version pair is in the oss catalog
+    oss_destinations_dataframe["is_oss"] = True
+
+    composite_key = ["destinationDefinitionId", "dockerRepository", "dockerImageTag"]
+
+    # Merge the two catalogs on the 'image' and 'version' columns, keeping only the unique pairs
+    merged_catalog = pd.merge(
+        cloud_destinations_dataframe, oss_destinations_dataframe, how="outer", on=composite_key
+    ).drop_duplicates(subset=composite_key)
+
+    # Replace NaN values in the 'is_cloud' and 'is_oss' columns with False
+    merged_catalog[["is_cloud", "is_oss"]] = merged_catalog[["is_cloud", "is_oss"]].fillna(False)
+
+    # Return the merged catalog with the desired columns
+    return merged_catalog
+
+
+@asset
+def all_sources_dataframe(cloud_sources_dataframe, oss_sources_dataframe) -> pd.DataFrame:
+    """
+    Merge the cloud and oss source catalogs into a single dataframe.
+    """
+
+    # Add a column 'is_cloud' to indicate if an image/version pair is in the cloud catalog
+    cloud_sources_dataframe["is_cloud"] = True
+
+    # Add a column 'is_oss' to indicate if an image/version pair is in the oss catalog
+    oss_sources_dataframe["is_oss"] = True
+
+    composite_key = ["sourceDefinitionId", "dockerRepository", "dockerImageTag"]
+
+    # Merge the two catalogs on the 'image' and 'version' columns, keeping only the unique pairs
+    merged_catalog = pd.merge(
+        cloud_sources_dataframe, oss_sources_dataframe, how="outer", on=composite_key
+    ).drop_duplicates(subset=composite_key)
+
+    # Replace NaN values in the 'is_cloud' and 'is_oss' columns with False
+    merged_catalog[["is_cloud", "is_oss"]] = merged_catalog[["is_cloud", "is_oss"]].fillna(False)
+
+    # Return the merged catalog with the desired columns
+    return merged_catalog


+1 on merging these projects. See comment above

evantahler · 2023-03-16T23:31:05Z

airbyte-ci/connectors_ci/metadata_service/orchestrator/orchestrator/sensors/catalog_sensors.py

+    @sensor(
+        name=f"{job.name}_on_catalog_updated",
+        job=job,
+        minimum_interval_seconds=30,


Looks like the minimum is where things start when initialized

evantahler · 2023-03-16T23:31:17Z

airbyte-ci/connectors_ci/metadata_service/orchestrator/orchestrator/sensors/catalog_sensors.py

+        name=f"{job.name}_on_catalog_updated",
+        job=job,
+        minimum_interval_seconds=30,
+        default_status=DefaultSensorStatus.STOPPED,


should the defaults be to be enabled?

Currently, Im for having them off by default.

The reason being is if this code went to production, it only requires us to flip the switch the first time then the sensor stays on indefinately (or at least until we purge the database backing dagster, which doesnt really happen)

Also by leaving it off we protect devs from running the full dag when they didnt want to when they start the system

airbyte-ci/connectors_ci/metadata_service/orchestrator/orchestrator/sensors/catalog_sensors.py

pedroslopez

Nice! Interesting to see how the pipelines/dags are built up

General Q that doesn't necessarily have to be addressed in this PR: Should we start referencing the catalog with the new Registry name instead? I know there are quite a few references to "catalog" in this particular pr 😛

pedroslopez · 2023-03-17T21:23:55Z

airbyte-ci/connectors/metadata_service/orchestrator/tests/fixtures/cloud_catalog.json

Nit: feels like we should be able to get away with specifying a minimal catalog with 2-3 connectors for testing purposes. This is ok though.

pedroslopez · 2023-03-17T21:36:21Z

airbyte-ci/connectors/metadata_service/orchestrator/orchestrator/assets/catalog_assets.py

+
+    metadata = {
+        "preview": MetadataValue.md(markdown),
+        "gcs_path": MetadataValue.url(file_handle.gcs_path),


How does this know to upload to GCS? Is this just something dagster has support for internally via an output that has this gcs_path attribute?

All part of the resource we pull in "catalog_report_directory_manager"

You can see the gcs stuff all get wired up in the resources folder

pedroslopez · 2023-03-17T21:39:23Z

airbyte-ci/connectors/metadata_service/orchestrator/tests/test_catalog.py

Just checkin, do we have CI set up to run these tests or have a ticket created to do so if not?

Ill add it to the cut over phase!

bnchrch · 2023-03-17T22:11:46Z

@pedroslopez I like the callout for removing the use of catalog in favour of registry.

After the types repo is all set next week im planning to come back to the lib/ folder to refactor the jsonschema definitions out.

Im thinking then is a good time for a rename.

Thoughts?

…-gcs-sensor

* Add airbyte-ci folders * Add poetry * Add first dagster job * Get sensors properly registering * Trigger job on new files * Add etag cursor * Wire up resources and ops * Parse destinations dataframe * Add multiple dataframes * Compute markdown * Write html and markdown * Move columns to variable * move to a folder centered file structure * Move to sensor factory * Add resource def to sensor * Use appropriate credentials * Use GCSFileManager * use catalog_directory_manager * Generalize the gcp catalog resources * Move bucket to env var * Clean up and add comments * Update readme * Move dependencies into orchestrator * Add gcs section to readme * Clean up debug * Add merge catalog tests * Run code formatter * Apply flake8 fixes * Remove suffix * Move tests up one level * Folder rename * Update readme and rename env * Add jinja templates * Rename connectors_ci to connectors for lib

bnchrch marked this pull request as draft March 13, 2023 16:21

evantahler reviewed Mar 13, 2023

View reviewed changes

bnchrch force-pushed the bnchrch/poc-dagster-gcs-sensor branch from 9ed8c7e to d705241 Compare March 13, 2023 22:14

bnchrch changed the title ~~Draft: Dagster PoC for Metadata service~~ Dagster PoC for Metadata service Mar 13, 2023

bnchrch requested a review from evantahler March 13, 2023 22:56

bnchrch marked this pull request as ready for review March 13, 2023 22:56

bnchrch requested review from alafanechere, erohmensing and pedroslopez March 13, 2023 22:57

alafanechere reviewed Mar 15, 2023

View reviewed changes

evantahler approved these changes Mar 16, 2023

View reviewed changes

bnchrch force-pushed the bnchrch/poc-dagster-gcs-sensor branch from f6167f4 to 8af5a08 Compare March 17, 2023 03:26

bnchrch added 18 commits March 17, 2023 11:34

Add airbyte-ci folders

1b81ee1

Add poetry

5bfb78d

Add first dagster job

41031e5

Get sensors properly registering

0e7e837

Trigger job on new files

d7901bb

Add etag cursor

65a55c4

Wire up resources and ops

3394aa4

Parse destinations dataframe

fe35573

Add multiple dataframes

56ac7a8

Compute markdown

9bc0fd2

Write html and markdown

5445994

Move columns to variable

4895484

move to a folder centered file structure

5c7e276

Move to sensor factory

07b0937

Add resource def to sensor

6ba22e5

Use appropriate credentials

23748d4

Use GCSFileManager

46bcadf

use catalog_directory_manager

86a85ef

bnchrch added 15 commits March 17, 2023 11:34

Generalize the gcp catalog resources

16ba801

Move bucket to env var

53eb3e0

Clean up and add comments

4932228

Update readme

fbc3e40

Move dependencies into orchestrator

e029a6f

Add gcs section to readme

867e7f3

Clean up debug

348ab2d

Add merge catalog tests

be69597

Run code formatter

7293821

Apply flake8 fixes

5ee0b9d

Remove suffix

e71885d

Move tests up one level

8dadfd3

Folder rename

d57a161

Update readme and rename env

fde8acc

Add jinja templates

465c00d

bnchrch force-pushed the bnchrch/poc-dagster-gcs-sensor branch from 22a4be1 to 465c00d Compare March 17, 2023 18:35

Rename connectors_ci to connectors for lib

fc8d1f3

pedroslopez approved these changes Mar 17, 2023

View reviewed changes

pedroslopez reviewed Mar 17, 2023

View reviewed changes

Merge remote-tracking branch 'origin/master' into bnchrch/poc-dagster…

f46cf62

…-gcs-sensor

bnchrch merged commit 2d6f5ee into master Mar 20, 2023

bnchrch deleted the bnchrch/poc-dagster-gcs-sensor branch March 20, 2023 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dagster PoC for Metadata service #23989

Dagster PoC for Metadata service #23989

bnchrch commented Mar 13, 2023 •

edited

Loading

evantahler Mar 13, 2023

bnchrch Mar 13, 2023

alafanechere left a comment

alafanechere Mar 15, 2023

bnchrch Mar 17, 2023

alafanechere Mar 15, 2023

bnchrch Mar 17, 2023

alafanechere Mar 20, 2023

alafanechere Mar 15, 2023

bnchrch Mar 17, 2023

alafanechere Mar 15, 2023

bnchrch Mar 17, 2023

evantahler left a comment

evantahler Mar 13, 2023

bnchrch Mar 17, 2023

evantahler Mar 16, 2023

bnchrch Mar 17, 2023

evantahler Mar 16, 2023

evantahler Mar 16, 2023

evantahler Mar 16, 2023

bnchrch Mar 17, 2023

pedroslopez left a comment

pedroslopez Mar 17, 2023

pedroslopez Mar 17, 2023 •

edited

Loading

bnchrch Mar 17, 2023

pedroslopez Mar 17, 2023

bnchrch Mar 17, 2023

bnchrch commented Mar 17, 2023

		@asset(required_resource_keys={"catalog_report_directory_manager"})
		def connector_catalog_location_markdown(context, all_destinations_dataframe, all_sources_dataframe):

		@@ -0,0 +1,2 @@
		METADATA_BUCKET="ben-ab-test-bucket"
		GCP_GSM_CREDENTIALS=

		return ":".join(etags)


		def catalog_updated_sensor(job, resources_def) -> SensorDefinition:

	def catalog_updated_sensor(job, resources_def) -> SensorDefinition:
	def catalog_updated_sensor(job) -> SensorDefinition:

		@@ -0,0 +1,3 @@
		# Airbyte Connectors CI

		This folder is a collection of systems, tools and scripts that are used to run CI/CD systems specific to our connectors.

Dagster PoC for Metadata service #23989

Dagster PoC for Metadata service #23989

Conversation

bnchrch commented Mar 13, 2023 • edited Loading

Notes for reviewers

Video walk through

What

How

Recommended reading order

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alafanechere left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evantahler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedroslopez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedroslopez Mar 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnchrch commented Mar 17, 2023

bnchrch commented Mar 13, 2023 •

edited

Loading

pedroslopez Mar 17, 2023 •

edited

Loading