Skip to content

Commit

Permalink
docs: fixing grammar files 60-80 (#1864)
Browse files Browse the repository at this point in the history
* fixing grammar files 60-80

* Update docs/website/docs/dlt-ecosystem/verified-sources/google_sheets.md

* Apply suggestions from code review

* Update docs/website/docs/dlt-ecosystem/verified-sources/matomo.md

* Update docs/website/docs/dlt-ecosystem/verified-sources/personio.md

* Update docs/website/docs/dlt-ecosystem/verified-sources/personio.md

---------

Co-authored-by: Anton Burnashev <anton.burnashev@gmail.com>
  • Loading branch information
sh-rp and burnash authored Sep 25, 2024
1 parent d9e9dea commit 27c110a
Show file tree
Hide file tree
Showing 20 changed files with 350 additions and 436 deletions.
7 changes: 4 additions & 3 deletions docs/website/docs/dlt-ecosystem/file-formats/insert-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ keywords: [insert values, file formats]
---
import SetTheFormat from './_set_the_format.mdx';

# SQL INSERT File Format
# SQL INSERT file format

This file format contains an INSERT...VALUES statement to be executed on the destination during the `load` stage.

Expand All @@ -18,12 +18,13 @@ Additional data types are stored as follows:

This file format is [compressed](../../reference/performance.md#disabling-and-enabling-file-compression) by default.

## Supported Destinations
## Supported destinations

This format is used by default by: **DuckDB**, **Postgres**, **Redshift**, **Synapse**, **MSSQL**, **Motherduck**

It is also supported by: **Filesystem** if you'd like to store INSERT VALUES statements for some reason
It is also supported by: **Filesystem** if you'd like to store INSERT VALUES statements for some reason.

## How to configure

<SetTheFormat file_type="insert_values"/>

11 changes: 5 additions & 6 deletions docs/website/docs/dlt-ecosystem/file-formats/jsonl.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,9 @@ keywords: [jsonl, file formats]
---
import SetTheFormat from './_set_the_format.mdx';

# jsonl - JSON Delimited
# jsonl - JSON delimited

JSON Delimited is a file format that stores several JSON documents in one file. The JSON
documents are separated by a new line.
JSON delimited is a file format that stores several JSON documents in one file. The JSON documents are separated by a new line.

Additional data types are stored as follows:

Expand All @@ -18,13 +17,13 @@ Additional data types are stored as follows:
- `HexBytes` is stored as a hex encoded string;
- `json` is serialized as a string.

This file format is
[compressed](../../reference/performance.md#disabling-and-enabling-file-compression) by default.
This file format is [compressed](../../reference/performance.md#disabling-and-enabling-file-compression) by default.

## Supported Destinations
## Supported destinations

This format is used by default by: **BigQuery**, **Snowflake**, **Filesystem**.

## How to configure

<SetTheFormat file_type="jsonl"/>

47 changes: 23 additions & 24 deletions docs/website/docs/dlt-ecosystem/file-formats/parquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,21 @@ import SetTheFormat from './_set_the_format.mdx';

[Apache Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. `dlt` is capable of storing data in this format when configured to do so.

To use this format, you need a `pyarrow` package. You can get this package as a `dlt` extra as well:
To use this format, you need the `pyarrow` package. You can get this package as a `dlt` extra as well:

```sh
pip install "dlt[parquet]"
```

## Supported Destinations
## Supported destinations

Supported by: **BigQuery**, **DuckDB**, **Snowflake**, **Filesystem**, **Athena**, **Databricks**, **Synapse**

## How to configure

<SetTheFormat file_type="parquet"/>

## Destination AutoConfig
## Destination autoconfig
`dlt` uses [destination capabilities](../../walkthroughs/create-new-destination.md#3-set-the-destination-capabilities) to configure the parquet writer:
* It uses decimal and wei precision to pick the right **decimal type** and sets precision and scale.
* It uses timestamp precision to pick the right **timestamp type** resolution (seconds, micro, or nano).
Expand All @@ -32,17 +32,17 @@ Supported by: **BigQuery**, **DuckDB**, **Snowflake**, **Filesystem**, **Athena*

Under the hood, `dlt` uses the [pyarrow parquet writer](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html) to create the files. The following options can be used to change the behavior of the writer:

- `flavor`: Sanitize schema or set other compatibility options to work with various target systems. Defaults to None which is **pyarrow** default.
- `flavor`: Sanitize schema or set other compatibility options to work with various target systems. Defaults to None, which is the **pyarrow** default.
- `version`: Determine which Parquet logical types are available for use, whether the reduced set from the Parquet 1.x.x format or the expanded logical types added in later format versions. Defaults to "2.6".
- `data_page_size`: Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). Defaults to None which is **pyarrow** default.
- `data_page_size`: Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). Defaults to None, which is the **pyarrow** default.
- `row_group_size`: Set the number of rows in a row group. [See here](#row-group-size) how this can optimize parallel processing of queries on your destination over the default setting of `pyarrow`.
- `timestamp_timezone`: A string specifying timezone, default is UTC.
- `coerce_timestamps`: resolution to which coerce timestamps, choose from **s**, **ms**, **us**, **ns**
- `allow_truncated_timestamps` - will raise if precision is lost on truncated timestamp.
- `timestamp_timezone`: A string specifying the timezone, default is UTC.
- `coerce_timestamps`: resolution to which to coerce timestamps, choose from **s**, **ms**, **us**, **ns**
- `allow_truncated_timestamps` - will raise if precision is lost on truncated timestamps.

:::tip
Default parquet version used by `dlt` is 2.4. It coerces timestamps to microseconds and truncates nanoseconds silently. Such setting
provides best interoperability with database systems, including loading panda frames which have nanosecond resolution by default
The default parquet version used by `dlt` is 2.4. It coerces timestamps to microseconds and truncates nanoseconds silently. Such a setting
provides the best interoperability with database systems, including loading panda frames which have nanosecond resolution by default.
:::

Read the [pyarrow parquet docs](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html) to learn more about these settings.
Expand All @@ -68,28 +68,27 @@ NORMALIZE__DATA_WRITER__TIMESTAMP_TIMEZONE
```

### Timestamps and timezones
`dlt` adds timezone (UTC adjustment) to all timestamps regardless of a precision (from seconds to nanoseconds). `dlt` will also create TZ aware timestamp columns in
the destinations. [duckdb is an exception here](../destinations/duckdb.md#supported-file-formats)
`dlt` adds timezone (UTC adjustment) to all timestamps regardless of the precision (from seconds to nanoseconds). `dlt` will also create TZ-aware timestamp columns in
the destinations. [DuckDB is an exception here](../destinations/duckdb.md#supported-file-formats).

### Disable timezones / utc adjustment flags
### Disable timezones / UTC adjustment flags
You can generate parquet files without timezone adjustment information in two ways:
1. Set the **flavor** to spark. All timestamps will be generated via deprecated `int96` physical data type, without the logical one
2. Set the **timestamp_timezone** to empty string (ie. `DATA_WRITER__TIMESTAMP_TIMEZONE=""`) to generate logical type without UTC adjustment.
1. Set the **flavor** to spark. All timestamps will be generated via the deprecated `int96` physical data type, without the logical one.
2. Set the **timestamp_timezone** to an empty string (i.e., `DATA_WRITER__TIMESTAMP_TIMEZONE=""`) to generate a logical type without UTC adjustment.

To our best knowledge, arrow will convert your timezone aware DateTime(s) to UTC and store them in parquet without timezone information.
To our best knowledge, Arrow will convert your timezone-aware DateTime(s) to UTC and store them in parquet without timezone information.


### Row group size
The `pyarrow` parquet writer writes each item, i.e. table or record batch, in a separate row group.
This may lead to many small row groups which may not be optimal for certain query engines. For example, `duckdb` parallelizes on a row group.
`dlt` allows controlling the size of the row group by
[buffering and concatenating tables](../../reference/performance.md#controlling-in-memory-buffers) and batches before they are written. The concatenation is done as a zero-copy to save memory.
You can control the size of the row group by setting the maximum number of rows kept in the buffer.

The `pyarrow` parquet writer writes each item, i.e., table or record batch, in a separate row group. This may lead to many small row groups, which may not be optimal for certain query engines. For example, `duckdb` parallelizes on a row group. `dlt` allows controlling the size of the row group by [buffering and concatenating tables](../../reference/performance.md#controlling-in-memory-buffers) and batches before they are written. The concatenation is done as a zero-copy to save memory. You can control the size of the row group by setting the maximum number of rows kept in the buffer.

```toml
[extract.data_writer]
buffer_max_items=10e6
```
Mind that `dlt` holds the tables in memory. Thus, 1,000,000 rows in the example above may consume a significant amount of RAM.

`row_group_size` configuration setting has limited utility with `pyarrow` writer. It may be useful when you write single very large pyarrow tables
or when your in memory buffer is really large.
Keep in mind that `dlt` holds the tables in memory. Thus, 1,000,000 rows in the example above may consume a significant amount of RAM.

The `row_group_size` configuration setting has limited utility with the `pyarrow` writer. It may be useful when you write single very large pyarrow tables or when your in-memory buffer is really large.

Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import Admonition from "@theme/Admonition";
import Link from '../../_book-onboarding-call.md';

<Admonition title="Need help deploying these sources, or figuring out how to run them in your data stack?">
<Admonition title="Need help deploying these sources or figuring out how to run them in your data stack?">
<a href="https://dlthub.com/community">Join our Slack community</a> or <Link/>.
</Admonition>
</Admonition>

Loading

0 comments on commit 27c110a

Please sign in to comment.