Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: final grammar fix! pages 120-139 #1913

Merged
merged 3 commits into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,11 @@ keywords: [airflow, github, google cloud composer]

This setup will allow you to deploy the main branch of your Airflow project from GitHub to Cloud Composer.

- Create a GitHub repository ie. by following our how-to guide on [deployment for Airflow](../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md)
- Create a GitHub repository, for example, by following our how-to guide on [deployment for Airflow](../../walkthroughs/deploy-a-pipeline/deploy-with-airflow-composer.md).

- In Google Cloud web interface, go to Source Repositories and create a repository that mirrors your
GitHub repository. This will simplify the authentication by doing it through this mirroring
service.
- In the Google Cloud web interface, go to Source Repositories and create a repository that mirrors your GitHub repository. This will simplify authentication by using this mirroring service.

- In Cloud Build, add a trigger on commit to main.
- In Cloud Build, add a trigger on commit to the main branch.

- Point it to your Cloud Build file. In our example, we place our file at `build/cloudbuild.yaml`.

Expand All @@ -26,31 +24,29 @@ This setup will allow you to deploy the main branch of your Airflow project from

![test-composer](/img/test-composer.png)

- In your `cloudbuild.yaml`, set the bucket name
- In your `cloudbuild.yaml`, set the bucket name.

- Make sure your repository code is pushed to main.
- Make sure your repository code is pushed to the main branch.

- Run the trigger you build (in Cloud Build).
- Run the trigger you built (in Cloud Build).

- Wait a minute, and check if your files arrived in the bucket. In our case, we added a `pipedrive`
folder, and we can see it appeared.
- Wait a minute, and check if your files have arrived in the bucket. In our case, we added a `pipedrive` folder, and we can see it appeared.

![bucket-details](/img/bucket-details.png)

### Airflow setup

### Adding the libraries needed

Assuming you already spun up a Cloud Composer.
Assuming you have already spun up a Cloud Composer:

- Make sure the user you added has rights to change the base image (add libraries). I already had
these added, you may get away with less (not clear in docs):
- Make sure the user you added has rights to change the base image (add libraries). I already had these added; you may get away with fewer (not clear in docs):

- Artifact Registry Administrator;
- Artifact Registry Repository Administrator;
- Remote Build Execution Artifact Admin;

- Navigate to your composer environment and add the needed libraries. In the case of this example
pipedrive pipeline, we only need dlt, so add `dlt` library.
- Navigate to your composer environment and add the needed libraries. In the case of this example pipedrive pipeline, we only need the sdf library, so add the `dlt` library.

![add-package](/img/add-package.png)

13 changes: 7 additions & 6 deletions docs/website/docs/reference/explainers/how-dlt-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ keywords: [architecture, extract, normalize, load]
# How `dlt` works

`dlt` automatically turns JSON returned by any [source](../../general-usage/glossary.md#source)
(e.g. an API) into a live dataset stored in the
[destination](../../general-usage/glossary.md#destination) of your choice (e.g. Google BigQuery). It
(e.g., an API) into a live dataset stored in the
[destination](../../general-usage/glossary.md#destination) of your choice (e.g., Google BigQuery). It
does this by first [extracting](how-dlt-works.md#extract) the JSON data, then
[normalizing](how-dlt-works.md#normalize) it to a schema, and finally [loading](how-dlt-works#load)
it to the location where you will store it.
Expand All @@ -24,14 +24,15 @@ JSON and provides it to `dlt` as input, which then normalizes that data.
## Normalize

The configurable normalization engine in `dlt` recursively unpacks this nested structure into
relational tables (i.e. inferring data types, linking tables to create nested relationships,
relational tables (i.e., inferring data types, linking tables to create nested relationships,
etc.), making it ready to be loaded. This creates a
[schema](../../general-usage/glossary.md#schema), which will automatically evolve to any future
source data changes (e.g. new fields or tables).
[schema](../../general-usage/glossary.md#schema), which will automatically evolve to accommodate any future
source data changes (e.g., new fields or tables).

## Load

The data is then loaded into your chosen [destination](../../general-usage/glossary.md#destination).
`dlt` uses configurable, idempotent, atomic loads that ensure data safely ends up there. For
example, you don't need to worry about the size of the data you are loading and if the process is
example, you don't need to worry about the size of the data you are loading, and if the process is
interrupted, it is safe to retry without creating errors.

12 changes: 6 additions & 6 deletions docs/website/docs/walkthroughs/add-a-verified-source.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ List available sources to see their names and descriptions:
dlt init --list-sources
```

Now pick one of the source names, for example `pipedrive` and a destination i.e. `bigquery`:
Now pick one of the source names, for example, `pipedrive` and a destination, i.e., `bigquery`:

```sh
dlt init pipedrive bigquery
Expand Down Expand Up @@ -80,7 +80,7 @@ For adding them locally or on your orchestrator, please see the following guide

## 3. Customize or write a pipeline script

Once you initialized the pipeline, you will have a sample file `pipedrive_pipeline.py`.
Once you have initialized the pipeline, you will have a sample file `pipedrive_pipeline.py`.

This is the developer's suggested way to use the pipeline, so you can use it as a starting point -
in our case, we can choose to run a method that loads all data, or we can choose which endpoints
Expand All @@ -95,7 +95,7 @@ You can modify an existing verified source in place.
- If that modification is generally useful for anyone using this source, consider contributing it
back via a PR. This way, we can ensure it is tested and maintained.
- If that modification is not a generally shared case, then you are responsible for maintaining it.
We suggest making any of your own customisations modular is possible, so you can keep pulling the
We suggest making any of your own customizations modular if possible, so you can keep pulling the
burnash marked this conversation as resolved.
Show resolved Hide resolved
updated source from the community repo in the event of source maintenance.

## 5. Add more sources to your project
Expand All @@ -120,7 +120,7 @@ the parent folder:
dlt init pipedrive bigquery
```

## 7. Advanced: Using dlt init with branches, local folders or git repos
## 7. Advanced: Using dlt init with branches, local folders, or git repos

To find out more info about this command, use --help:

Expand All @@ -134,9 +134,9 @@ To deploy from a branch of the `verified-sources` repo, you can use the followin
dlt init source destination --branch <branch_name>
```

To deploy from another repo, you could fork the verified-sources repo and then provide the new repo
url as below, replacing `dlt-hub` with your fork name:
To deploy from another repo, you could fork the verified-sources repo and then provide the new repo URL as below, replacing `dlt-hub` with your fork name:

```sh
dlt init pipedrive bigquery --location "https://github.com/dlt-hub/verified-sources"
```

49 changes: 22 additions & 27 deletions docs/website/docs/walkthroughs/add-incremental-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ slug: sql-incremental-configuration

# Add incremental configuration to SQL resources
Incremental loading is the act of loading only new or changed data and not old records that have already been loaded.
For example, a bank loading only the latest transactions or a company updating its database with new or modified user
For example, a bank loads only the latest transactions, or a company updates its database with new or modified user
information. In this article, we’ll discuss a few incremental loading strategies.

:::important
Processing data incrementally, or in batches, enhances efficiency, reduces costs, lowers latency, improves scalability,
and optimizes resource utilization.
and optimizes resource utilization.
:::

### Incremental loading strategies
Expand All @@ -28,25 +28,26 @@ In this guide, we will discuss various incremental loading methods using `dlt`,

## Code examples



### 1. Full load (replace)

A full load strategy completely overwrites the existing data with the new dataset. This is useful when you want to
refresh the entire table with the latest data.
A full load strategy completely overwrites the existing data with the new dataset. This is useful when you want to refresh the entire table with the latest data.

:::note
This strategy technically does not load only new data but instead reloads all data: old and new.
:::

Here’s a walkthrough:

1. The initial table, named "contact", in the SQL source looks like this:
1. The initial table, named "contact," in the SQL source looks like this:

| id | name | created_at |
| --- | --- | --- |
| 1 | Alice | 2024-07-01 |
| 2 | Bob | 2024-07-02 |

2. The python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline. Please note the `write_disposition = "replace` used below.
2. The Python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline. Please note the `write_disposition = "replace"` used below.

```py
def load_full_table_resource() -> None:
Expand Down Expand Up @@ -94,24 +95,22 @@ Here’s a walkthrough:

**What happened?**

After running the pipeline, the original data in the "contact" table (Alice and Bob) is completely replaced with the new
updated table with data “Charlie” and “Dave” added and “Bob” removed. This strategy is useful for scenarios where the entire
dataset needs to be refreshed/replaced with the latest information.
After running the pipeline, the original data in the "contact" table (Alice and Bob) is completely replaced with the new updated table with data “Charlie” and “Dave” added and “Bob” removed. This strategy is useful for scenarios where the entire dataset needs to be refreshed or replaced with the latest information.

### 2. Append new records based on incremental ID

This strategy appends only new records to the table based on an incremental ID. It is useful for scenarios where each new record has a unique, incrementing identifier.

Here’s a walkthrough:

1. The initial table, named "contact", in the SQL source looks like this:
1. The initial table, named "contact," in the SQL source looks like this:

| id | name | created_at |
| --- | --- | --- |
| 1 | Alice | 2024-07-01 |
| 2 | Bob | 2024-07-02 |

2. The python code demonstrates loading data from an SQL source into BigQuery using an incremental variable, `id`. This variable tracks new or updated records in the `dlt` pipeline. Please note the `write_disposition = "append` used below.
2. The Python code demonstrates loading data from an SQL source into BigQuery using an incremental variable, `id`. This variable tracks new or updated records in the `dlt` pipeline. Please note the `write_disposition = "append"` used below.

```py
def load_incremental_id_table_resource() -> None:
Expand All @@ -133,7 +132,7 @@ Here’s a walkthrough:
print(info)
```

3. After running the `dlt` pipeline, the data loaded into BigQuery "contact" table looks like:
3. After running the `dlt` pipeline, the data loaded into the BigQuery "contact" table looks like:

| Row | id | name | created_at | _dlt_load_id | _dlt_id |
| --- | --- | --- | --- | --- | --- |
Expand Down Expand Up @@ -161,20 +160,20 @@ Here’s a walkthrough:

In this scenario, the pipeline appends new records (Charlie and Dave) to the existing data (Alice and Bob) without affecting the pre-existing entries. This strategy is ideal when only new data needs to be added, preserving the historical data.

### 3. Append new records based on timestamp ("created_at")
### Append new records based on timestamp ("created_at")

This strategy appends only new records to the table based on a date/timestamp field. It is useful for scenarios where records are created with a timestamp, and you want to load only those records created after a certain date.

Here’s a walkthrough:

1. The initial dataset, named "contact", in the SQL source looks like this:
1. The initial dataset, named "contact," in the SQL source looks like this:

| id | name | created_at |
| --- | --- | --- |
| 1 | Alice | 2024-07-01 00:00:00 |
| 2 | Bob | 2024-07-02 00:00:00 |

2. The python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline. Please note the `write_disposition = "append"`, with `created_at` being used as the incremental parameter.
2. The Python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline. Please note the `write_disposition = "append"`, with `created_at` being used as the incremental parameter.

```py
def load_incremental_timestamp_table_resource() -> None:
Expand All @@ -199,7 +198,7 @@ Here’s a walkthrough:
load_incremental_timestamp_table_resource()
```

3. After running the `dlt` pipeline, the data loaded into BigQuery "contact" table looks like:
3. After running the `dlt` pipeline, the data loaded into the BigQuery "contact" table looks like:

| Row | id | name | created_at | _dlt_load_id | _dlt_id |
| --- | --- | --- | --- | --- | --- |
Expand All @@ -225,13 +224,11 @@ Here’s a walkthrough:

**What happened?**

The pipeline adds new records (Charlie and Dave) that have a `created_at` timestamp after the specified initial value while
retaining the existing data (Alice and Bob). This approach is useful for loading data incrementally based on when it was created.
The pipeline adds new records (Charlie and Dave) that have a `created_at` timestamp after the specified initial value while retaining the existing data (Alice and Bob). This approach is useful for loading data incrementally based on when it was created.

### 4. Merge (Update/Insert) records based on timestamp ("last_modified_at") and ID
### 4. Merge (update/insert) records based on timestamp ("last_modified_at") and ID

This strategy merges records based on a composite key of ID and a timestamp field. It updates existing records and inserts
new ones as necessary.
This strategy merges records based on a composite key of ID and a timestamp field. It updates existing records and inserts new ones as necessary.

Here’s a walkthrough:

Expand All @@ -242,7 +239,7 @@ Here’s a walkthrough:
| 1 | Alice | 2024-07-01 00:00:00 |
| 2 | Bob | 2024-07-02 00:00:00 |

2. The Python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline Please note the `write_disposition = "merge"`, with `last_modified_at` being used as the incremental parameter.
2. The Python code illustrates the process of loading data from an SQL source into BigQuery using the `dlt` pipeline. Please note the `write_disposition = "merge"`, with `last_modified_at` being used as the incremental parameter.

```py
def load_merge_table_resource() -> None:
Expand Down Expand Up @@ -292,9 +289,7 @@ Here’s a walkthrough:

**What happened?**

The pipeline updates the record for Alice with the new data, including the updated `last_modified_at` timestamp, and adds a
new record for Hank. This method is beneficial when you need to ensure that records are both updated and inserted based on a
specific timestamp and ID.
The pipeline updates the record for Alice with the new data, including the updated `last_modified_at` timestamp, and adds a new record for Hank. This method is beneficial when you need to ensure that records are both updated and inserted based on a specific timestamp and ID.

The examples provided explain how to use `dlt` to achieve different incremental loading scenarios, highlighting the changes before and after running each pipeline.

The examples provided explain how to use `dlt` to achieve different incremental loading scenarios, highlighting the changes
before and after running each pipeline.
Loading
Loading