Skip to content

Commit

Permalink
feat: add ingestion overview pages (datahub-project#9210)
Browse files Browse the repository at this point in the history
  • Loading branch information
yoonhyejin authored Nov 20, 2023
1 parent a704290 commit 1ad4f96
Show file tree
Hide file tree
Showing 7 changed files with 359 additions and 224 deletions.
34 changes: 21 additions & 13 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -30,17 +30,16 @@ module.exports = {
],
},
{
Integrations: [
type: "category",
label: "Integrations",
link: { type: "doc", id: "metadata-ingestion/README" },
items: [
// The purpose of this section is to provide a deeper understanding of how ingestion works.
// Readers should be able to find details for ingesting from all systems, apply transformers, understand sinks,
// and understand key concepts of the Ingestion Framework (Sources, Sinks, Transformers, and Recipes)
{
type: "doc",
label: "Introduction",
id: "metadata-ingestion/README",
},
{
"Quickstart Guides": [
"metadata-ingestion/cli-ingestion",
{
BigQuery: [
"docs/quick-ingestion-guides/bigquery/overview",
Expand Down Expand Up @@ -85,15 +84,18 @@ module.exports = {
},
],
},
"metadata-ingestion/recipe_overview",
{
Sources: [
type: "category",
label: "Sources",
link: { type: "doc", id: "metadata-ingestion/source_overview" },
items: [
// collapse these; add push-based at top
{
type: "doc",
id: "docs/lineage/airflow",
label: "Airflow",
},

//"docker/airflow/local_airflow",
"metadata-integration/java/spark-lineage/README",
"metadata-ingestion/integration_docs/great-expectations",
Expand All @@ -106,18 +108,24 @@ module.exports = {
],
},
{
Sinks: [
type: "category",
label: "Sinks",
link: { type: "doc", id: "metadata-ingestion/sink_overview" },
items: [
{
type: "autogenerated",
dirName: "metadata-ingestion/sink_docs",
},
],
},
{
Transformers: [
"metadata-ingestion/docs/transformer/intro",
"metadata-ingestion/docs/transformer/dataset_transformer",
],
type: "category",
label: "Transformers",
link: {
type: "doc",
id: "metadata-ingestion/docs/transformer/intro",
},
items: ["metadata-ingestion/docs/transformer/dataset_transformer"],
},
{
"Advanced Guides": [
Expand Down
61 changes: 61 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,36 @@ Command Options:
--strict-warnings If enabled, ingestion runs with warnings will yield a non-zero error code
--test-source-connection When set, ingestion will only test the source connection details from the recipe
```
#### ingest --dry-run

The `--dry-run` option of the `ingest` command performs all of the ingestion steps, except writing to the sink. This is useful to validate that the
ingestion recipe is producing the desired metadata events before ingesting them into datahub.

```shell
# Dry run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --dry-run
# Short-form
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n
```

#### ingest --preview

The `--preview` option of the `ingest` command performs all of the ingestion steps, but limits the processing to only the first 10 workunits produced by the source.
This option helps with quick end-to-end smoke testing of the ingestion recipe.

```shell
# Preview
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml --preview
# Preview with dry-run
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview
```

By default `--preview` creates 10 workunits. But if you wish to try producing more workunits you can use another option `--preview-workunits`

```shell
# Preview 20 workunits without sending anything to sink
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yml -n --preview --preview-workunits=20
```

#### ingest deploy

Expand All @@ -115,6 +145,37 @@ To update an existing recipe please use the `--urn` parameter to specify the id
**Note:** Updating a recipe will result in a replacement of the existing options with what was specified in the cli command.
I.e: Not specifying a schedule in the cli update command will remove the schedule from the recipe to be updated.

#### ingest --no-default-report
By default, the cli sends an ingestion report to DataHub, which allows you to see the result of all cli-based ingestion in the UI. This can be turned off with the `--no-default-report` flag.

```shell
# Running ingestion with reporting to DataHub turned off
datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yaml --no-default-report
```

The reports include the recipe that was used for ingestion. This can be turned off by adding an additional section to the ingestion recipe.

```yaml
source:
# source configs

sink:
# sink configs

# Add configuration for the datahub reporter
reporting:
- type: datahub
config:
report_recipe: false

# Optional log to put failed JSONs into a file
# Helpful in case you are trying to debug some issue with specific ingestion failing
failure_log:
enabled: false
log_config:
filename: ./path/to/failure.json
```
### init
The init command is used to tell `datahub` about where your DataHub instance is located. The CLI will point to localhost DataHub by default.
Expand Down
Loading

0 comments on commit 1ad4f96

Please sign in to comment.