Skip to content
This repository has been archived by the owner on Jan 27, 2025. It is now read-only.

Commit

Permalink
feat(ingest): bigquery - Promoting bigquery-beta to bigquery source (d…
Browse files Browse the repository at this point in the history
…atahub-project#6222)

Co-authored-by: Shirshanka Das <shirshanka@apache.org>
  • Loading branch information
2 people authored and cccs-tom committed Nov 18, 2022
1 parent c387816 commit 8d1e731
Show file tree
Hide file tree
Showing 11 changed files with 45 additions and 35 deletions.
16 changes: 11 additions & 5 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ CLI release is made through a different repository and release notes can be foun
If server with version `0.8.28` is being used then CLI used to connect to it should be `0.8.28.x`. Tests of new CLI are not ran with older server versions so it is not recommended to update the CLI if the server is not updated.

## Installation

### Using pip

We recommend python virtual environments (venv-s) to namespace pip modules. The folks over at [Acryl Data](https://www.acryl.io/) maintain a PyPI package for DataHub metadata ingestion. Here's an example setup:
Expand Down Expand Up @@ -66,7 +67,6 @@ We use a plugin architecture so that you can install only the dependencies you a
| [file](./generated/ingestion/sources/file.md) | _included by default_ | File source and sink |
| [athena](./generated/ingestion/sources/athena.md) | `pip install 'acryl-datahub[athena]'` | AWS Athena source |
| [bigquery](./generated/ingestion/sources/bigquery.md) | `pip install 'acryl-datahub[bigquery]'` | BigQuery source |
| [bigquery-usage](./generated/ingestion/sources/bigquery.md#module-bigquery-usage) | `pip install 'acryl-datahub[bigquery-usage]'` | BigQuery usage statistics source |
| [datahub-lineage-file](./generated/ingestion/sources/file-based-lineage.md) | _no additional dependencies_ | Lineage File source |
| [datahub-business-glossary](./generated/ingestion/sources/business-glossary.md) | _no additional dependencies_ | Business Glossary File source |
| [dbt](./generated/ingestion/sources/dbt.md) | _no additional dependencies_ | dbt source |
Expand Down Expand Up @@ -126,7 +126,9 @@ datahub check plugins
[extra requirements]: https://www.python-ldap.org/en/python-ldap-3.3.0/installing.html#build-prerequisites

## Environment variables supported

The env variables take precedence over what is in the DataHub CLI config created through `init` command. The list of supported environment variables are as follows

- `DATAHUB_SKIP_CONFIG` (default `false`) - Set to `true` to skip creating the configuration file.
- `DATAHUB_GMS_URL` (default `http://localhost:8080`) - Set to a URL of GMS instance
- `DATAHUB_GMS_HOST` (default `localhost`) - Set to a host of GMS instance. Prefer using `DATAHUB_GMS_URL` to set the URL.
Expand All @@ -136,7 +138,7 @@ The env variables take precedence over what is in the DataHub CLI config created
- `DATAHUB_TELEMETRY_ENABLED` (default `true`) - Set to `false` to disable telemetry. If CLI is being run in an environment with no access to public internet then this should be disabled.
- `DATAHUB_TELEMETRY_TIMEOUT` (default `10`) - Set to a custom integer value to specify timeout in secs when sending telemetry.
- `DATAHUB_DEBUG` (default `false`) - Set to `true` to enable debug logging for CLI. Can also be achieved through `--debug` option of the CLI.
- `DATAHUB_VERSION` (default `head`) - Set to a specific version to run quickstart with the particular version of docker images.
- `DATAHUB_VERSION` (default `head`) - Set to a specific version to run quickstart with the particular version of docker images.
- `ACTIONS_VERSION` (default `head`) - Set to a specific version to run quickstart with that image tag of `datahub-actions` container.

```shell
Expand Down Expand Up @@ -271,6 +273,7 @@ datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PR
The `put` group of commands allows you to write metadata into DataHub. This is a flexible way for you to issue edits to metadata from the command line.

#### put aspect

The **put aspect** (also the default `put`) command instructs `datahub` to set a specific aspect for an entity to a specified value.
For example, the command shown below sets the `ownership` aspect of the dataset `urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)` to the value in the file `ownership.json`.
The JSON in the `ownership.json` file needs to conform to the [`Ownership`](https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/common/Ownership.pdl) Aspect model as shown below.
Expand Down Expand Up @@ -299,14 +302,14 @@ Update succeeded with status 200
```

#### put platform

The **put platform** command (available in version>0.8.44.4) instructs `datahub` to create or update metadata about a data platform. This is very useful if you are using a custom data platform, to set up its logo and display name for a native UI experience.

```shell
datahub put platform --name longtail_schemas --display_name "Long Tail Schemas" --logo "https://flink.apache.org/img/logo/png/50/color_50.png"
✅ Successfully wrote data platform metadata for urn:li:dataPlatform:longtail_schemas to DataHub (DataHubRestEmitter: configured to talk to https://longtailcompanions.acryl.io/api/gms with token: eyJh**********Cics)
```


### migrate

The `migrate` group of commands allows you to perform certain kinds of migrations.
Expand All @@ -316,6 +319,7 @@ The `migrate` group of commands allows you to perform certain kinds of migration
The `dataplatform2instance` migration command allows you to migrate your entities from an instance-agnostic platform identifier to an instance-specific platform identifier. If you have ingested metadata in the past for this platform and would like to transfer any important metadata over to the new instance-specific entities, then you should use this command. For example, if your users have added documentation or added tags or terms to your datasets, then you should run this command to transfer this metadata over to the new entities. For further context, read the Platform Instance Guide [here](./platform-instances.md).

A few important options worth calling out:

- --dry-run / -n : Use this to get a report for what will be migrated before running
- --force / -F : Use this if you know what you are doing and do not want to get a confirmation prompt before migration is started
- --keep : When enabled, will preserve the old entities and not delete them. Default behavior is to soft-delete old entities.
Expand All @@ -324,6 +328,7 @@ A few important options worth calling out:
**_Note_**: Timeseries aspects such as Usage Statistics and Dataset Profiles are not migrated over to the new entity instances, you will get new data points created when you re-run ingestion using the `usage` or sources with profiling turned on.

##### Dry Run

```console
datahub migrate dataplatform2instance --platform elasticsearch --instance prod_index --dry-run
Starting migration: platform:elasticsearch, instance=prod_index, force=False, dry-run=True
Expand All @@ -341,6 +346,7 @@ Starting migration: platform:elasticsearch, instance=prod_index, force=False, dr
```

##### Real Migration (with soft-delete)

```
> datahub migrate dataplatform2instance --platform hive --instance
datahub migrate dataplatform2instance --platform hive --instance warehouse
Expand Down Expand Up @@ -373,7 +379,7 @@ to get the raw JSON difference in addition to the API output you can add the `--
```console
datahub timeline --urn "urn:li:dataset:(urn:li:dataPlatform:mysql,User.UserAccount,PROD)" --category TAG --start 7daysago
2022-02-17 14:03:42 - 0.0.0-computed
MODIFY TAG dataset:mysql:User.UserAccount : A change in aspect editableSchemaMetadata happened at time 2022-02-17 20:03:42.0
MODIFY TAG dataset:mysql:User.UserAccount : A change in aspect editableSchemaMetadata happened at time 2022-02-17 20:03:42.0
2022-02-17 14:17:30 - 0.0.0-computed
MODIFY TAG dataset:mysql:User.UserAccount : A change in aspect editableSchemaMetadata happened at time 2022-02-17 20:17:30.118
MODIFY TAG dataset:mysql:User.UserAccount : A change in aspect editableSchemaMetadata happened at time 2022-02-17 20:17:30.118
```
2 changes: 1 addition & 1 deletion metadata-ingestion/docs/sources/bigquery/README.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
To get all metadata from BigQuery you need to use two plugins `bigquery` and `bigquery-usage`. Both of them are described in this page. These will require 2 separate recipes. We understand this is not ideal and we plan to make this easier in the future.
Ingesting metadata from Bigquery requires either using the **bigquery** module with just one recipe (recommended) or the two separate modules **bigquery-legacy** and **bigquery-usage-legacy** (soon to be deprecated) with two separate recipes.
Original file line number Diff line number Diff line change
@@ -1,22 +1,21 @@
source:
type: bigquery-beta
type: bigquery-legacy
config:
# Coordinates
project_id: my_project_id

# `schema_pattern` for BQ Datasets
schema_pattern:
allow:
- finance_bq_dataset

table_pattern:
deny:
# The exact name of the table is revenue_table_name
# The reason we have this `.*` at the beginning is because the current implmenetation of table_pattern is testing
# project_id.dataset_name.table_name
# We will improve this in the future
- .*revenue_table_name
include_table_lineage: true
include_usage_statistics: true
profiling:
enabled: true
profile_table_level_only: true

sink:
# sink configs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
source:
type: bigquery-usage
type: bigquery-usage-legacy
config:
# Coordinates
projects:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@ There are two important concepts to understand and identify:
##### Basic Requirements (needed for metadata ingestion)
1. Identify your Extractor Project where the service account will run queries to extract metadata.

| permission                       | Description                                                                                                                         | Capability                                                               |
|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| `bigquery.jobs.create`           | Run jobs (e.g. queries) within the project. *This only needs for the extractor project where the service account belongs*           |                                                                                                               |
| `bigquery.jobs.list`             | Manage the queries that the service account has sent. *This only needs for the extractor project where the service account belongs* |                                                                                                               |
| `bigquery.readsessions.create`   | Create a session for streaming large results. *This only needs for the extractor project where the service account belongs*         |                                                                                                               |
| `bigquery.readsessions.getData` | Get data from the read session. *This only needs for the extractor project where the service account belongs*                       |
| `bigquery.tables.create`         | Create temporary tables when profiling tables. Tip: Use the `profiling.bigquery_temp_table_schema` to ensure that all temp tables (across multiple projects) are created in this project under a specific dataset.                 | Profiling                           |                                                                                                                 |
| `bigquery.tables.delete`         | Delete temporary tables when profiling tables. Tip: Use the `profiling.bigquery_temp_table_schema` to ensure that all temp tables (across multiple projects) are created in this project under a specific dataset.                   | Profiling                           |                                                                                                                 |
| permission                       | Description                                                                                                                         | Capability                                                               |
|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| `bigquery.jobs.create`           | Run jobs (e.g. queries) within the project. *This only needs for the extractor project where the service account belongs*           |                                                                                                               |
| `bigquery.jobs.list`             | Manage the queries that the service account has sent. *This only needs for the extractor project where the service account belongs* |                                                                                                               |
| `bigquery.readsessions.create`   | Create a session for streaming large results. *This only needs for the extractor project where the service account belongs*         |                                                                                                               |
| `bigquery.readsessions.getData` | Get data from the read session. *This only needs for the extractor project where the service account belongs*                       |
| `bigquery.tables.create`         | Create temporary tables when profiling tables. Tip: Use the `profiling.bigquery_temp_table_schema` to ensure that all temp tables (across multiple projects) are created in this project under a specific dataset.                 | Profiling                           |                                                                                                                 |
| `bigquery.tables.delete`         | Delete temporary tables when profiling tables. Tip: Use the `profiling.bigquery_temp_table_schema` to ensure that all temp tables (across multiple projects) are created in this project under a specific dataset.                   | Profiling                           |                                                                                                                 |
2. Grant the following permissions to the Service Account on every project where you would like to extract metadata from

:::info
Expand Down
9 changes: 5 additions & 4 deletions metadata-ingestion/docs/sources/bigquery/bigquery_recipe.yml
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
source:
type: bigquery
config:
# Coordinates
project_id: my_project_id

# `schema_pattern` for BQ Datasets
schema_pattern:
allow:
- finance_bq_dataset

table_pattern:
deny:
# The exact name of the table is revenue_table_name
# The reason we have this `.*` at the beginning is because the current implmenetation of table_pattern is testing
# project_id.dataset_name.table_name
# We will improve this in the future
- .*revenue_table_name
include_table_lineage: true
include_usage_statistics: true
profiling:
enabled: true
profile_table_level_only: true

sink:
# sink configs
18 changes: 11 additions & 7 deletions metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,13 +224,16 @@ def get_long_description():
# PyAthena is pinned with exact version because we use private method in PyAthena
"athena": sql_common | {"PyAthena[SQLAlchemy]==2.4.1"},
"azure-ad": set(),
"bigquery": sql_common
"bigquery-legacy": sql_common
| bigquery_common
| {"sqlalchemy-bigquery>=1.4.1", "sqllineage==1.3.6", "sqlparse"},
"bigquery-usage": bigquery_common | usage_common | {"cachetools"},
"bigquery-beta": sql_common
"bigquery-usage-legacy": bigquery_common | usage_common | {"cachetools"},
"bigquery": sql_common
| bigquery_common
| {"sqllineage==1.3.6", "sql_metadata"},
"bigquery-beta": sql_common
| bigquery_common
| {"sqllineage==1.3.6", "sql_metadata"}, # deprecated, but keeping the extra for backwards compatibility
"clickhouse": sql_common | {"clickhouse-sqlalchemy==0.1.8"},
"clickhouse-usage": sql_common
| usage_common
Expand Down Expand Up @@ -379,7 +382,8 @@ def get_long_description():
dependency
for plugin in [
"bigquery",
"bigquery-usage",
"bigquery-legacy",
"bigquery-usage-legacy",
"clickhouse",
"clickhouse-usage",
"delta-lake",
Expand Down Expand Up @@ -480,9 +484,9 @@ def get_long_description():
"sqlalchemy = datahub.ingestion.source.sql.sql_generic:SQLAlchemyGenericSource",
"athena = datahub.ingestion.source.sql.athena:AthenaSource",
"azure-ad = datahub.ingestion.source.identity.azure_ad:AzureADSource",
"bigquery = datahub.ingestion.source.sql.bigquery:BigQuerySource",
"bigquery-beta = datahub.ingestion.source.bigquery_v2.bigquery:BigqueryV2Source",
"bigquery-usage = datahub.ingestion.source.usage.bigquery_usage:BigQueryUsageSource",
"bigquery-legacy = datahub.ingestion.source.sql.bigquery:BigQuerySource",
"bigquery = datahub.ingestion.source.bigquery_v2.bigquery:BigqueryV2Source",
"bigquery-usage-legacy = datahub.ingestion.source.usage.bigquery_usage:BigQueryUsageSource",
"clickhouse = datahub.ingestion.source.sql.clickhouse:ClickHouseSource",
"clickhouse-usage = datahub.ingestion.source.usage.clickhouse_usage:ClickHouseUsageSource",
"delta-lake = datahub.ingestion.source.delta_lake:DeltaLakeSource",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ def cleanup(config: BigQueryV2Config) -> None:

@platform_name("BigQuery", doc_order=1)
@config_class(BigQueryV2Config)
@support_status(SupportStatus.INCUBATING)
@support_status(SupportStatus.CERTIFIED)
@capability(SourceCapability.PLATFORM_INSTANCE, "Enabled by default")
@capability(SourceCapability.DOMAINS, "Supported via the `domain` config field")
@capability(SourceCapability.CONTAINERS, "Enabled by default")
Expand Down
Loading

0 comments on commit 8d1e731

Please sign in to comment.