Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: fix capitalization of some terms, fix typos #1988

Merged
merged 7 commits into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions docs/website/docs/dlt-ecosystem/destinations/athena.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,9 @@ Athena tables store timestamps with millisecond precision, and with that precisi

Athena does not support JSON fields, so JSON is stored as a string.

> ❗**Athena does not support TIME columns in parquet files**. `dlt` will fail such jobs permanently. Convert `datetime.time` objects to `str` or `datetime.datetime` to load them.
:::caution
**Athena does not support TIME columns in parquet files**. `dlt` will fail such jobs permanently. Convert `datetime.time` objects to `str` or `datetime.datetime` to load them.
:::

### Table and column identifiers

Expand Down Expand Up @@ -137,9 +139,10 @@ For every table created as an Iceberg table, the Athena destination will create

The `merge` write disposition is supported for Athena when using Iceberg tables.

> Note that:
> 1. There is a risk of tables ending up in an inconsistent state in case a pipeline run fails mid-flight because Athena doesn't support transactions, and `dlt` uses multiple DELETE/UPDATE/INSERT statements to implement `merge`.
> 2. `dlt` creates additional helper tables called `insert_<table name>` and `delete_<table name>` in the staging schema to work around Athena's lack of temporary tables.
:::note
1. There is a risk of tables ending up in an inconsistent state in case a pipeline run fails mid-flight because Athena doesn't support transactions, and `dlt` uses multiple DELETE/UPDATE/INSERT statements to implement `merge`.
2. `dlt` creates additional helper tables called `insert_<table name>` and `delete_<table name>` in the staging schema to work around Athena's lack of temporary tables.
:::

### dbt support

Expand All @@ -156,8 +159,7 @@ aws_data_catalog="awsdatacatalog"

## Supported file formats

You can choose the following file formats:
* [parquet](../file-formats/parquet.md) is used by default
* [Parquet](../file-formats/parquet.md) is used by default.

## Athena adapter

Expand Down
20 changes: 11 additions & 9 deletions docs/website/docs/dlt-ecosystem/destinations/bigquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,8 +146,8 @@ this moment (they are stored as JSON), may be created. You can select certain re
[destination.bigquery]
autodetect_schema=true
```
We recommend yielding [arrow tables](../verified-sources/arrow-pandas.md) from your resources and using the `parquet` file format to load the data. In that case, the schemas generated by `dlt` and BigQuery
will be identical. BigQuery will also preserve the column order from the generated parquet files. You can convert `json` data into arrow tables with [pyarrow or duckdb](../verified-sources/arrow-pandas.md#loading-json-documents).
We recommend yielding [Arrow tables](../verified-sources/arrow-pandas.md) from your resources and using the Parquet file format to load the data. In that case, the schemas generated by `dlt` and BigQuery
will be identical. BigQuery will also preserve the column order from the generated parquet files. You can convert JSON data into Arrow tables with [pyarrow or duckdb](../verified-sources/arrow-pandas.md#loading-json-documents).

```py
import pyarrow.json as paj
Expand Down Expand Up @@ -187,25 +187,25 @@ pipeline.run(
In the example below, we represent JSON data as tables up to nesting level 1. Above this nesting level, we let BigQuery create nested fields.

:::caution
If you yield data as Python objects (dicts) and load this data as `parquet`, the nested fields will be converted into strings. This is one of the consequences of
If you yield data as Python objects (dicts) and load this data as Parquet, the nested fields will be converted into strings. This is one of the consequences of
`dlt` not being able to infer nested fields.
:::

## Supported file formats

You can configure the following file formats to load data to BigQuery:

* [jsonl](../file-formats/jsonl.md) is used by default.
* [parquet](../file-formats/parquet.md) is supported.
* [JSONL](../file-formats/jsonl.md) is used by default.
* [Parquet](../file-formats/parquet.md) is supported.

When staging is enabled:

* [jsonl](../file-formats/jsonl.md) is used by default.
* [parquet](../file-formats/parquet.md) is supported.
* [JSONL](../file-formats/jsonl.md) is used by default.
* [Parquet](../file-formats/parquet.md) is supported.

:::caution
**BigQuery cannot load JSON columns from Parquet files**. `dlt` will fail such jobs permanently. Instead:
* Switch to `jsonl` to load and parse JSON properly.
* Switch to JSONL to load and parse JSON properly.
* Use schema [autodetect and nested fields](#use-bigquery-schema-autodetect-for-nested-fields)
:::

Expand Down Expand Up @@ -344,7 +344,8 @@ Some things to note with the adapter's behavior:
- You can cluster on as many columns as you would like.
- Sequential adapter calls on the same resource accumulate parameters, akin to an OR operation, for a unified execution.

> ❗ At the time of writing, table level options aren't supported for `ALTER` operations.
:::caution
At the time of writing, table level options aren't supported for `ALTER` operations.

Note that `bigquery_adapter` updates the resource *in place*, but returns the resource for convenience, i.e., both the following are valid:

Expand All @@ -354,6 +355,7 @@ my_resource = bigquery_adapter(my_resource, partition="partition_column_name")
```

Refer to the [full API specification](../../api_reference/destinations/impl/bigquery/bigquery_adapter) for more details.
:::

<!--@@@DLT_TUBA bigquery-->

21 changes: 10 additions & 11 deletions docs/website/docs/dlt-ecosystem/destinations/clickhouse.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Let's start by initializing a new `dlt` project as follows:
dlt init chess clickhouse
```

> 💡 This command will initialize your pipeline with chess as the source and ClickHouse as the destination.
`dlt init` command will initialize your pipeline with chess as the source and ClickHouse as the destination.

The above command generates several files and directories, including `.dlt/secrets.toml` and a requirements file for ClickHouse. You can install the necessary dependencies specified in the requirements file by executing it as follows:

Expand Down Expand Up @@ -118,29 +118,28 @@ Data is loaded into ClickHouse using the most efficient method depending on the

## Datasets

`Clickhouse` does not support multiple datasets in one database; dlt relies on datasets to exist for multiple reasons.
To make `clickhouse` work with `dlt`, tables generated by `dlt` in your `clickhouse` database will have their names prefixed with the dataset name, separated by
ClickHouse does not support multiple datasets in one database; dlt relies on datasets to exist for multiple reasons.
To make ClickHouse work with `dlt`, tables generated by `dlt` in your ClickHouse database will have their names prefixed with the dataset name, separated by
the configurable `dataset_table_separator`.
Additionally, a special sentinel table that doesn't contain any data will be created, so dlt knows which virtual datasets already exist in a
clickhouse
destination.

## Supported file formats

- [jsonl](../file-formats/jsonl.md) is the preferred format for both direct loading and staging.
- [parquet](../file-formats/parquet.md) is supported for both direct loading and staging.
- [JSONL](../file-formats/jsonl.md) is the preferred format for both direct loading and staging.
- [Parquet](../file-formats/parquet.md) is supported for both direct loading and staging.

The `clickhouse` destination has a few specific deviations from the default SQL destinations:

1. `Clickhouse` has an experimental `object` datatype, but we've found it to be a bit unpredictable, so the dlt clickhouse destination will load the `json` datatype to a `text` column.
1. ClickHouse has an experimental `object` datatype, but we've found it to be a bit unpredictable, so the dlt `clickhouse` destination will load the `json` datatype to a `text` column.
If you need
this feature, get in touch with our Slack community, and we will consider adding it.
2. `Clickhouse` does not support the `time` datatype. Time will be loaded to a `text` column.
3. `Clickhouse` does not support the `binary` datatype. Binary will be loaded to a `text` column. When loading from `jsonl`, this will be a base64 string; when loading from parquet, this will be
2. ClickHouse does not support the `time` datatype. Time will be loaded to a `text` column.
3. ClickHouse does not support the `binary` datatype. Binary will be loaded to a `text` column. When loading from JSONL, this will be a base64 string; when loading from parquet, this will be
the `binary` object converted to `text`.
4. `Clickhouse` accepts adding columns to a populated table that aren’t null.
5. `Clickhouse` can produce rounding errors under certain conditions when using the float/double datatype. Make sure to use decimal if you can’t afford to have rounding errors. Loading the value
12.7001 to a double column with the loader file format jsonl set will predictably produce a rounding error, for example.
4. ClickHouse accepts adding columns to a populated table that aren’t null.
5. ClickHouse can produce rounding errors under certain conditions when using the float/double datatype. Make sure to use decimal if you can’t afford to have rounding errors. Loading the value 12.7001 to a double column with the loader file format jsonl set will predictably produce a rounding error, for example.

## Supported column hints

Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ The JSONL format has some limitations when used with Databricks:

1. Compression must be disabled to load jsonl files in Databricks. Set `data_writer.disable_compression` to `true` in the dlt config when using this format.
2. The following data types are not supported when using the JSONL format with `databricks`: `decimal`, `json`, `date`, `binary`. Use `parquet` if your data contains these types.
3. The `bigint` data type with precision is not supported with the `jsonl` format.
3. The `bigint` data type with precision is not supported with the JSONL format.

## Staging support

Expand Down
8 changes: 6 additions & 2 deletions docs/website/docs/dlt-ecosystem/destinations/dremio.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,13 +74,17 @@ profile_name="dlt-ci-user"
- `replace`
- `merge`

> The `merge` write disposition uses the default DELETE/UPDATE/INSERT strategy to merge data into the destination. Be aware that Dremio does not support transactions, so a partial pipeline failure can result in the destination table being in an inconsistent state. The `merge` write disposition will eventually be implemented using [MERGE INTO](https://docs.dremio.com/current/reference/sql/commands/apache-iceberg-tables/apache-iceberg-merge/) to resolve this issue.
:::note
The `merge` write disposition uses the default DELETE/UPDATE/INSERT strategy to merge data into the destination. Be aware that Dremio does not support transactions, so a partial pipeline failure can result in the destination table being in an inconsistent state. The `merge` write disposition will eventually be implemented using [MERGE INTO](https://docs.dremio.com/current/reference/sql/commands/apache-iceberg-tables/apache-iceberg-merge/) to resolve this issue.
:::

## Data loading

Data loading happens by copying staged parquet files from an object storage bucket to the destination table in Dremio using [COPY INTO](https://docs.dremio.com/cloud/reference/sql/commands/copy-into-table/) statements. The destination table format is specified by the storage format for the data source in Dremio. Typically, this will be Apache Iceberg.

> ❗ **Dremio cannot load `fixed_len_byte_array` columns from `parquet` files**.
:::caution
Dremio cannot load `fixed_len_byte_array` columns from Parquet files.
:::

## Dataset creation

Expand Down
8 changes: 4 additions & 4 deletions docs/website/docs/dlt-ecosystem/destinations/duckdb.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ python3 chess_pipeline.py
All write dispositions are supported.

## Data loading
`dlt` will load data using large INSERT VALUES statements by default. Loading is multithreaded (20 threads by default). If you are okay with installing `pyarrow`, we suggest switching to `parquet` as the file format. Loading is faster (and also multithreaded).
`dlt` will load data using large INSERT VALUES statements by default. Loading is multithreaded (20 threads by default). If you are okay with installing `pyarrow`, we suggest switching to Parquet as the file format. Loading is faster (and also multithreaded).

### Data types
`duckdb` supports various [timestamp types](https://duckdb.org/docs/sql/data_types/timestamp.html). These can be configured using the column flags `timezone` and `precision` in the `dlt.resource` decorator or the `pipeline.run` method.
Expand Down Expand Up @@ -95,11 +95,11 @@ dlt.config["schema.naming"] = "duck_case"
## Supported file formats
You can configure the following file formats to load data into duckdb:
* [insert-values](../file-formats/insert-format.md) is used by default.
* [parquet](../file-formats/parquet.md) is supported.
* [Parquet](../file-formats/parquet.md) is supported.
:::note
`duckdb` cannot COPY many parquet files to a single table from multiple threads. In this situation, `dlt` serializes the loads. Still, that may be faster than INSERT.
`duckdb` cannot COPY many Parquet files to a single table from multiple threads. In this situation, dlt serializes the loads. Still, that may be faster than INSERT.
:::
* [jsonl](../file-formats/jsonl.md)
* [JSONL](../file-formats/jsonl.md)

:::tip
`duckdb` has [timestamp types](https://duckdb.org/docs/sql/data_types/timestamp.html) with resolutions from milliseconds to nanoseconds. However,
Expand Down
10 changes: 5 additions & 5 deletions docs/website/docs/dlt-ecosystem/destinations/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -612,9 +612,9 @@ Adopting this layout offers several advantages:
## Supported file formats

You can choose the following file formats:
* [jsonl](../file-formats/jsonl.md) is used by default
* [parquet](../file-formats/parquet.md) is supported
* [csv](../file-formats/csv.md) is supported
* [JSONL](../file-formats/jsonl.md) is used by default
* [Parquet](../file-formats/parquet.md) is supported
* [CSV](../file-formats/csv.md) is supported

## Supported table formats

Expand Down Expand Up @@ -643,7 +643,7 @@ def my_delta_resource():
...
```

> `dlt` always uses `parquet` as `loader_file_format` when using the `delta` table format. Any setting of `loader_file_format` is disregarded.
> `dlt` always uses Parquet as `loader_file_format` when using the `delta` table format. Any setting of `loader_file_format` is disregarded.

#### Delta table partitioning
A Delta table can be partitioned ([Hive-style partitioning](https://delta.io/blog/pros-cons-hive-style-partionining/)) by specifying one or more `partition` column hints. This example partitions the Delta table by the `foo` column:
Expand Down Expand Up @@ -709,7 +709,7 @@ When a load generates a new state, for example when using incremental loads, a n
When running your pipeline, you might encounter an error like `[Errno 36] File name too long Error`. This error occurs because the generated file name exceeds the maximum allowed length on your filesystem.

To prevent the file name length error, set the `max_identifier_length` parameter for your destination. This truncates all identifiers (including filenames) to a specified maximum length.
For example:
For example:

```py
from dlt.destinations import duckdb
Expand Down
16 changes: 9 additions & 7 deletions docs/website/docs/dlt-ecosystem/destinations/redshift.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,16 +75,18 @@ All [write dispositions](../../general-usage/incremental-loading#choosing-a-writ
[SQL Insert](../file-formats/insert-format) is used by default.

When staging is enabled:
* [jsonl](../file-formats/jsonl.md) is used by default.
* [parquet](../file-formats/parquet.md) is supported.
* [JSONL](../file-formats/jsonl.md) is used by default.
* [Parquet](../file-formats/parquet.md) is supported.

> ❗ **Redshift cannot load `VARBYTE` columns from `json` files**. `dlt` will fail such jobs permanently. Switch to `parquet` to load binaries.
:::caution
- **Redshift cannot load `VARBYTE` columns from JSON files**. `dlt` will fail such jobs permanently. Switch to Parquet to load binaries.

> ❗ **Redshift cannot load `TIME` columns from `json` or `parquet` files**. `dlt` will fail such jobs permanently. Switch to direct `insert_values` to load time columns.
- **Redshift cannot load `TIME` columns from JSON or Parquet files**. `dlt` will fail such jobs permanently. Switch to direct `insert_values` to load time columns.

> ❗ **Redshift cannot detect compression type from `json` files**. `dlt` assumes that `jsonl` files are gzip compressed, which is the default.
- **Redshift cannot detect compression type from JSON files**. `dlt` assumes that JSONL files are gzip compressed, which is the default.

> ❗ **Redshift loads `json` types as strings into SUPER with `parquet`**. Use `jsonl` format to store JSON in SUPER natively or transform your SUPER columns with `PARSE_JSON`.
- **Redshift loads JSON types as strings into SUPER with Parquet**. Use JSONL format to store JSON in SUPER natively or transform your SUPER columns with `PARSE_JSON`.
:::

## Supported column hints

Expand Down Expand Up @@ -147,7 +149,7 @@ pipeline = dlt.pipeline(

## Supported loader file formats

Supported loader file formats for Redshift are `sql` and `insert_values` (default). When using a staging location, Redshift supports `parquet` and `jsonl`.
Supported loader file formats for Redshift are `sql` and `insert_values` (default). When using a staging location, Redshift supports Parquet and JSONL.

<!--@@@DLT_TUBA redshift-->

14 changes: 7 additions & 7 deletions docs/website/docs/dlt-ecosystem/destinations/snowflake.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,17 +170,17 @@ pipeline.run(events())

## Supported file formats
* [insert-values](../file-formats/insert-format.md) is used by default.
* [parquet](../file-formats/parquet.md) is supported.
* [jsonl](../file-formats/jsonl.md) is supported.
* [csv](../file-formats/csv.md) is supported.
* [Parquet](../file-formats/parquet.md) is supported.
* [JSONL](../file-formats/jsonl.md) is supported.
* [CSV](../file-formats/csv.md) is supported.

When staging is enabled:
* [jsonl](../file-formats/jsonl.md) is used by default.
* [parquet](../file-formats/parquet.md) is supported.
* [csv](../file-formats/csv.md) is supported.
* [JSONL](../file-formats/jsonl.md) is used by default.
* [Parquet](../file-formats/parquet.md) is supported.
* [CSV](../file-formats/csv.md) is supported.

:::caution
When loading from `parquet`, Snowflake will store `json` types (JSON) in `VARIANT` as a string. Use the `jsonl` format instead or use `PARSE_JSON` to update the `VARIANT` field after loading.
When loading from Parquet, Snowflake will store `json` types (JSON) in `VARIANT` as a string. Use the JSONL format instead or use `PARSE_JSON` to update the `VARIANT` field after loading.
:::

### Custom CSV formats
Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/destinations/sqlalchemy.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ For example, SQLite does not have `DATETIME` or `TIMESTAMP` types, so `timestamp
## Supported file formats

* [typed-jsonl](../file-formats/jsonl.md) is used by default. JSON-encoded data with typing information included.
* [parquet](../file-formats/parquet.md) is supported.
* [Parquet](../file-formats/parquet.md) is supported.

## Supported column hints

Expand Down
4 changes: 2 additions & 2 deletions docs/website/docs/dlt-ecosystem/destinations/synapse.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,10 @@ Data is loaded via `INSERT` statements by default.

## Supported file formats
* [insert-values](../file-formats/insert-format.md) is used by default
* [parquet](../file-formats/parquet.md) is used when [staging](#staging-support) is enabled
* [Parquet](../file-formats/parquet.md) is used when [staging](#staging-support) is enabled

## Data type limitations
* **Synapse cannot load `TIME` columns from `parquet` files**. `dlt` will fail such jobs permanently. Use the `insert_values` file format instead, or convert `datetime.time` objects to `str` or `datetime.datetime` to load `TIME` columns.
* **Synapse cannot load `TIME` columns from Parquet files**. `dlt` will fail such jobs permanently. Use the `insert_values` file format instead, or convert `datetime.time` objects to `str` or `datetime.datetime` to load `TIME` columns.
* **Synapse does not have a nested/JSON/struct data type**. The `dlt` `json` data type is mapped to the `nvarchar` type in Synapse.

## Table index type
Expand Down
2 changes: 1 addition & 1 deletion docs/website/docs/dlt-ecosystem/file-formats/csv.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ import SetTheFormat from './_set_the_format.mdx';
`dlt` uses it for specific use cases - mostly for performance and compatibility reasons.

Internally, we use two implementations:
- **pyarrow** csv writer - a very fast, multithreaded writer for [arrow tables](../verified-sources/arrow-pandas.md)
- **pyarrow** CSV writer - a very fast, multithreaded writer for [Arrow tables](../verified-sources/arrow-pandas.md)
- **python stdlib writer** - a csv writer included in the Python standard library for Python objects

## Supported destinations
Expand Down
Loading
Loading