Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dagster PoC for Metadata service #23989

Merged
merged 35 commits into from
Mar 20, 2023
Merged
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
1b81ee1
Add airbyte-ci folders
bnchrch Mar 9, 2023
5bfb78d
Add poetry
bnchrch Mar 9, 2023
41031e5
Add first dagster job
bnchrch Mar 9, 2023
0e7e837
Get sensors properly registering
bnchrch Mar 10, 2023
d7901bb
Trigger job on new files
bnchrch Mar 10, 2023
65a55c4
Add etag cursor
bnchrch Mar 10, 2023
3394aa4
Wire up resources and ops
bnchrch Mar 10, 2023
fe35573
Parse destinations dataframe
bnchrch Mar 11, 2023
56ac7a8
Add multiple dataframes
bnchrch Mar 11, 2023
9bc0fd2
Compute markdown
bnchrch Mar 11, 2023
5445994
Write html and markdown
bnchrch Mar 11, 2023
4895484
Move columns to variable
bnchrch Mar 11, 2023
5c7e276
move to a folder centered file structure
bnchrch Mar 11, 2023
07b0937
Move to sensor factory
bnchrch Mar 11, 2023
6ba22e5
Add resource def to sensor
bnchrch Mar 11, 2023
23748d4
Use appropriate credentials
bnchrch Mar 11, 2023
46bcadf
Use GCSFileManager
bnchrch Mar 11, 2023
86a85ef
use catalog_directory_manager
bnchrch Mar 11, 2023
16ba801
Generalize the gcp catalog resources
bnchrch Mar 11, 2023
53eb3e0
Move bucket to env var
bnchrch Mar 11, 2023
4932228
Clean up and add comments
bnchrch Mar 11, 2023
fbc3e40
Update readme
bnchrch Mar 13, 2023
e029a6f
Move dependencies into orchestrator
bnchrch Mar 13, 2023
867e7f3
Add gcs section to readme
bnchrch Mar 13, 2023
348ab2d
Clean up debug
bnchrch Mar 13, 2023
be69597
Add merge catalog tests
bnchrch Mar 13, 2023
7293821
Run code formatter
bnchrch Mar 13, 2023
5ee0b9d
Apply flake8 fixes
bnchrch Mar 13, 2023
e71885d
Remove suffix
bnchrch Mar 14, 2023
8dadfd3
Move tests up one level
bnchrch Mar 17, 2023
d57a161
Folder rename
bnchrch Mar 17, 2023
fde8acc
Update readme and rename env
bnchrch Mar 17, 2023
465c00d
Add jinja templates
bnchrch Mar 17, 2023
fc8d1f3
Rename connectors_ci to connectors for lib
bnchrch Mar 17, 2023
f46cf62
Merge remote-tracking branch 'origin/master' into bnchrch/poc-dagster…
bnchrch Mar 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add jinja templates
  • Loading branch information
bnchrch committed Mar 17, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit 465c00d56963520c45d76b7fc3a7610ff22b4316
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
METADATA_BUCKET="ben-ab-test-bucket"
GCP_GCS_CREDENTIALS=
Original file line number Diff line number Diff line change
@@ -61,13 +61,13 @@ To create a development bucket:
- Storage Object Viewer
4. Add the following environment variables to your `.env` file:
- `METADATA_BUCKET`
- `GCP_GCS_CREDENTIALS`
- `GCS_CREDENTIALS`

Note that the `GCP_GCS_CREDENTIALS` should be the raw json string of the service account credentials.
Note that the `GCS_CREDENTIALS` should be the raw json string of the service account credentials.

Here is an example of how to import the service account credentials into your environment:
```bash
export GCP_GCS_CREDENTIALS=`cat /path/to/credentials.json`
export GCS_CREDENTIALS=`cat /path/to/credentials.json`
```

## The Orchestrator
Original file line number Diff line number Diff line change
@@ -35,7 +35,7 @@
RESOURCES = {
"gcp_gcs_client": gcp_gcs_client.configured(
{
"gcp_gsm_cred_string": {"env": "GCP_GCS_CREDENTIALS"},
"gcp_gcs_cred_string": {"env": "GCS_CREDENTIALS"},
}
),
"gcs_bucket_manager": gcs_bucket_manager.configured({"gcs_bucket": {"env": "METADATA_BUCKET"}}),
Original file line number Diff line number Diff line change
@@ -2,9 +2,17 @@
import json

from dagster import MetadataValue, Output, asset, OpExecutionContext
from jinja2 import Environment, PackageLoader

from ..utils.html import html_body
def render_connector_catalog_locations_html(destinations_table, sources_table):
env = Environment(loader=PackageLoader("orchestrator", "templates"))
template = env.get_template("connector_catalog_locations.html")
return template.render(destinations_table=destinations_table, sources_table=sources_table)

def render_connector_catalog_locations_markdown(destinations_markdown, sources_markdown):
env = Environment(loader=PackageLoader("orchestrator", "templates"))
template = env.get_template("connector_catalog_locations.md")
return template.render(destinations_markdown=destinations_markdown, sources_markdown=sources_markdown)

@asset(required_resource_keys={"catalog_report_directory_manager"})
def connector_catalog_location_html(context, all_destinations_dataframe, all_sources_dataframe):
@@ -18,19 +26,16 @@ def connector_catalog_location_html(context, all_destinations_dataframe, all_sou
all_sources_dataframe.replace({True: "✅", False: "❌"}, inplace=True)
all_destinations_dataframe.replace({True: "✅", False: "❌"}, inplace=True)

title = "Connector Catalogs"
content = f"<h1>{title}</h1>"
content += "<h2>Sources</h2>"
content += all_sources_dataframe[columns_to_show].to_html()
content += "<h2>Destinations</h2>"
content += all_destinations_dataframe[columns_to_show].to_html()

html = html_body(title, content)
html = render_connector_catalog_locations_html(
destinations_table=all_destinations_dataframe[columns_to_show].to_html(),
sources_table=all_sources_dataframe[columns_to_show].to_html(),
)

catalog_report_directory_manager = context.resources.catalog_report_directory_manager
file_handle = catalog_report_directory_manager.write_data(html.encode(), ext="html", key="connector_catalog_locations")

metadata = {
"preview": html,
"gcs_path": MetadataValue.url(file_handle.gcs_path),
}

@@ -49,11 +54,10 @@ def connector_catalog_location_markdown(context, all_destinations_dataframe, all
all_sources_dataframe.replace({True: "✅", False: "❌"}, inplace=True)
all_destinations_dataframe.replace({True: "✅", False: "❌"}, inplace=True)

markdown = "# Connector Catalog Locations\n\n"
markdown += "## Sources\n"
markdown += all_sources_dataframe[columns_to_show].to_markdown()
markdown += "\n\n## Destinations\n"
markdown += all_destinations_dataframe[columns_to_show].to_markdown()
markdown = render_connector_catalog_locations_markdown(
destinations_markdown=all_destinations_dataframe[columns_to_show].to_markdown(),
sources_markdown=all_sources_dataframe[columns_to_show].to_markdown(),
)

catalog_report_directory_manager = context.resources.catalog_report_directory_manager
file_handle = catalog_report_directory_manager.write_data(markdown.encode(), ext="md", key="connector_catalog_locations")
Original file line number Diff line number Diff line change
@@ -7,13 +7,13 @@
from dagster_gcp.gcs.file_manager import GCSFileManager


@resource(config_schema={"gcp_gsm_cred_string": StringSource})
@resource(config_schema={"gcp_gcs_cred_string": StringSource})
def gcp_gcs_client(resource_context: InitResourceContext) -> storage.Client:
"""Create a connection to gcs."""

resource_context.log.info("retrieving gcp_gcs_client")
gcp_gsm_cred_string = resource_context.resource_config["gcp_gsm_cred_string"]
gcp_gsm_cred_json = json.loads(gcp_gsm_cred_string)
gcp_gcs_cred_string = resource_context.resource_config["gcp_gcs_cred_string"]
gcp_gsm_cred_json = json.loads(gcp_gcs_cred_string)
credentials = service_account.Credentials.from_service_account_info(gcp_gsm_cred_json)
return storage.Client(
credentials=credentials,
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>{% block title %}{% endblock %}</title>
</head>
<body>
{% block content%}
{% endblock %}
</body>
</html>
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{% extends 'base.html' %}

{% block title%}Connector Catalogs{% endblock %}

{% block content %}
<h1>Connector Catalogs</h1>
<h2>Sources</h2>
{{ sources_table }}

<h2>Destinations</h2>
{{ destinations_table }}
{% endblock %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Connector Catalog Locations
## Sources
{{ sources_markdown }}

## Destinations
{{ destinations_markdown }}

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -13,6 +13,7 @@ dagster = "^1.1.21"
pandas = "^1.5.3"
dagster-gcp = "^0.17.21"
google = "^3.0.0"
jinja2 = "^3.1.2"


[tool.poetry.group.dev.dependencies]
Original file line number Diff line number Diff line change
@@ -29,7 +29,7 @@ def debug_catalog_projection():
resources = {
"gcp_gcs_client": gcp_gcs_client.configured(
{
"gcp_gsm_cred_string": {"env": "GCP_GCS_CREDENTIALS"},
"gcp_gcs_cred_string": {"env": "GCS_CREDENTIALS"},
}
),
"gcs_bucket_manager": gcs_bucket_manager.configured({"gcs_bucket": {"env": "METADATA_BUCKET"}}),