Docs: make naming consistent in the cloud storage & file system source (

#1835)
dlt-hub · Oct 4, 2024 · 6c504d0 · 6c504d0
1 parent d44d2be
commit 6c504d0
Show file tree

Hide file tree

Showing 6 changed files with 37 additions and 54 deletions.
diff --git a/docs/website/docs/dlt-ecosystem/file-formats/csv.md b/docs/website/docs/dlt-ecosystem/file-formats/csv.md
@@ -1,13 +1,13 @@
 ---
-title: csv
-description: The csv file format
+title: CSV
+description: The CSV file format
 keywords: [csv, file formats]
 ---
 import SetTheFormat from './_set_the_format.mdx';
 
 # CSV file format
 
-**csv** is the most basic file format for storing tabular data, where all values are strings and are separated by a delimiter (typically a comma).
+**CSV** is the most basic file format for storing tabular data, where all values are strings and are separated by a delimiter (typically a comma).
 `dlt` uses it for specific use cases - mostly for performance and compatibility reasons.
 
 Internally, we use two implementations:
@@ -16,7 +16,7 @@ Internally, we use two implementations:
 
 ## Supported destinations
 
-The `csv` format is supported by the following destinations: **Postgres**, **Filesystem**, **Snowflake**
+The CSV format is supported by the following destinations: **Postgres**, **Filesystem**, **Snowflake**
 
 ## How to configure
 

diff --git a/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md b/docs/website/docs/dlt-ecosystem/file-formats/jsonl.md
@@ -1,11 +1,11 @@
 ---
-title: jsonl
-description: The jsonl file format
-keywords: [jsonl, file formats]
+title: JSONL
+description: The JSONL file format or JSON Delimited stores several JSON documents in one file. The JSON documents are separated by a new line.
+keywords: [jsonl, file formats, json delimited, jsonl file format]
 ---
 import SetTheFormat from './_set_the_format.mdx';
 
-# jsonl - JSON delimited
+# JSONL - JSON Lines - JSON Delimited
 
 JSON delimited is a file format that stores several JSON documents in one file. The JSON documents are separated by a new line.
 

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/advanced.md b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/advanced.md
@@ -1,5 +1,5 @@
 ---
-title: Advanced Filesystem Usage
+title: Advanced filesystem usage
 description: Use filesystem source as a building block
 keywords: [readers source and filesystem, files, filesystem, readers source, cloud storage]
 ---
@@ -54,7 +54,7 @@ When using a nested or recursive glob pattern, `relative_path` will include the
 
 ## Create your own transformer
 
-Although the `filesystem` resource yields the files from cloud storage or a local filesystem, you need to apply a transformer resource to retrieve the records from files. `dlt` natively supports three file types: `csv`, `parquet`, and `jsonl` (more details in [filesystem transformer resource](../filesystem/basic#2-choose-the-right-transformer-resource)).
+Although the `filesystem` resource yields the files from cloud storage or a local filesystem, you need to apply a transformer resource to retrieve the records from files. dlt natively supports three file types: [CSV](../../file-formats/csv.md), [Parquet](../../file-formats/parquet.md), and [JSONL](../../file-formats/jsonl.md) (more details in [filesystem transformer resource](../filesystem/basic#2-choose-the-right-transformer-resource)).
 
 But you can easily create your own. In order to do this, you just need a function that takes as input a `FileItemDict` iterator and yields a list of records (recommended for performance) or individual records.
 

diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/basic.md b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/basic.md
@@ -1,14 +1,14 @@
 ---
 title: Filesystem source
 description: Learn how to set up and configure
-keywords: [readers source and filesystem, files, filesystem, readers source, cloud storage]
+keywords: [readers source and filesystem, files, filesystem, readers source, cloud storage, object storage, local file system]
 ---
 import Header from '../_source-info-header.md';
 <Header/>
 
-Filesystem source allows loading files from remote locations (AWS S3, Google Cloud Storage, Google Drive, Azure Blob Storage, SFTP server) or the local filesystem seamlessly. Filesystem source natively supports `csv`, `parquet`, and `jsonl` files and allows customization for loading any type of structured files.
+Filesystem source allows loading files from remote locations (AWS S3, Google Cloud Storage, Google Drive, Azure Blob Storage, SFTP server) or the local filesystem seamlessly. Filesystem source natively supports [CSV](../../file-formats/csv.md), [Parquet](../../file-formats/parquet.md), and [JSONL](../../file-formats/jsonl.md) files and allows customization for loading any type of structured files.
 
-To load unstructured data (`.pdf`, `.txt`, e-mail), please refer to the [unstructured data source](https://github.com/dlt-hub/verified-sources/tree/master/sources/unstructured_data).
+To load unstructured data (PDF, plain text, e-mail), please refer to the [unstructured data source](https://github.com/dlt-hub/verified-sources/tree/master/sources/unstructured_data).
 
 ## How filesystem source works
 
@@ -145,11 +145,8 @@ You don't need any credentials for the local filesystem.
 
 ### Add credentials to dlt pipeline
 
-To provide credentials to the filesystem source, you can use [any method available](../../../general-usage/credentials/setup#available-config-providers) in `dlt`.
-One of the easiest ways is to use configuration files. The `.dlt` folder in your working directory
-contains two files: `config.toml` and `secrets.toml`. Sensitive information, like passwords and
-access tokens, should only be put into `secrets.toml`, while any other configuration, like the path to
-a bucket, can be specified in `config.toml`.
+To provide credentials to the filesystem source, you can use [any method available](../../../general-usage/credentials/setup#available-config-providers) in dlt.
+One of the easiest ways is to use configuration files. The `.dlt` folder in your working directory contains two files: `config.toml` and `secrets.toml`. Sensitive information, like passwords and access tokens, should only be put into `secrets.toml`, while any other configuration, like the path to a bucket, can be specified in `config.toml`.
 
 <Tabs
   groupId="filesystem-type"
@@ -252,35 +249,28 @@ bucket_url='~\Documents\csv_files\'
 
 </Tabs>
 
-You can also specify the credentials using environment variables. The name of the corresponding environment
-variable should be slightly different from the corresponding name in the TOML file. Simply replace dots `.` with double
-underscores `__`:
+You can also specify the credentials using environment variables. The name of the corresponding environment variable should be slightly different from the corresponding name in the TOML file. Simply replace dots `.` with double underscores `__`:
 
 ```sh
 export SOURCES__FILESYSTEM__AWS_ACCESS_KEY_ID = "Please set me up!"
 export SOURCES__FILESYSTEM__AWS_SECRET_ACCESS_KEY = "Please set me up!"
 ```
 
 :::tip
-`dlt` supports more ways of authorizing with cloud storage, including identity-based
-and default credentials. To learn more about adding credentials to your pipeline, please refer to the
-[Configuration and secrets section](../../../general-usage/credentials/complex_types#gcp-credentials).
+dlt supports more ways of authorizing with cloud storage, including identity-based and default credentials. To learn more about adding credentials to your pipeline, please refer to the [Configuration and secrets section](../../../general-usage/credentials/complex_types#gcp-credentials).
 :::
 
 ## Usage
 
-The filesystem source is quite unique since it provides you with building blocks for loading data from files.
-First, it iterates over files in the storage and then processes each file to yield the records.
-Usually, you need two resources:
+The filesystem source is quite unique since it provides you with building blocks for loading data from files. First, it iterates over files in the storage and then processes each file to yield the records. Usually, you need two resources:
 
 1. The `filesystem` resource enumerates files in a selected bucket using a glob pattern, returning details as `FileItem` in customizable page sizes.
 2. One of the available transformer resources to process each file in a specific transforming function and yield the records.
 
 ### 1. Initialize a `filesystem` resource
 
 :::note
-If you use just the `filesystem` resource, it will only list files in the storage based on glob parameters and yield the
-files [metadata](advanced#fileitem-fields). The `filesystem` resource itself does not read or copy files.
+If you use just the `filesystem` resource, it will only list files in the storage based on glob parameters and yield the files [metadata](advanced#fileitem-fields). The `filesystem` resource itself does not read or copy files.
 :::
 
 All parameters of the resource can be specified directly in code:
@@ -319,9 +309,8 @@ Full list of `filesystem` resource parameters:
 
 ### 2. Choose the right transformer resource
 
-The current implementation of the filesystem source natively supports three file types: `csv`, `parquet`, and `jsonl`.
-You can apply any of the above or [create your own transformer](advanced#create-your-own-transformer). To apply the selected transformer
-resource, use pipe notation `|`:
+The current implementation of the filesystem source natively supports three file types: CSV, Parquet, and JSONL.
+You can apply any of the above or [create your own transformer](advanced#create-your-own-transformer). To apply the selected transformer resource, use pipe notation `|`:
 
 ```py
 from dlt.sources.filesystem import filesystem, read_csv
@@ -334,17 +323,13 @@ filesystem_pipe = filesystem(
 
 #### Available transformers
 
-- `read_csv()` - processes `csv` files using `pandas`
-- `read_jsonl()` - processes `jsonl` files chunk by chunk
-- `read_parquet()` - processes `parquet` files using `pyarrow`
-- `read_csv_duckdb()` - this transformer processes `csv` files using DuckDB, which usually shows better performance than `pandas`.
+- `read_csv()` - processes CSV files using [Pandas](https://pandas.pydata.org/)
+- `read_jsonl()` - processes JSONL files chunk by chunk
+- `read_parquet()` - processes Parquet files using [PyArrow](https://arrow.apache.org/docs/python/)
+- `read_csv_duckdb()` - this transformer processes CSV files using DuckDB, which usually shows better performance than pandas.
 
 :::tip
-We advise that you give each resource a
-[specific name](../../../general-usage/resource#duplicate-and-rename-resources)
-before loading with `pipeline.run`. This will ensure that data goes to a table with the name you
-want and that each pipeline uses a
-[separate state for incremental loading.](../../../general-usage/state#read-and-write-pipeline-state-in-a-resource)
+We advise that you give each resource a [specific name](../../../general-usage/resource#duplicate-and-rename-resources) before loading with `pipeline.run`. This will ensure that data goes to a table with the name you want and that each pipeline uses a [separate state for incremental loading.](../../../general-usage/state#read-and-write-pipeline-state-in-a-resource)
 :::
 
 ### 3. Create and run a pipeline
@@ -406,6 +391,7 @@ print(load_info)
 
 In this example, we load only new records based on the field called `updated_at`. This method may be useful if you are not able to
 filter files by modification date because, for example, all files are modified each time a new record appears.
+
 ```py
 import dlt
 from dlt.sources.filesystem import filesystem, read_csv
@@ -462,6 +448,7 @@ print(load_info)
 
 :::tip
 You could also use `file_glob` to filter files by names. It works very well in simple cases, for example, filtering by extension:
+
 ```py
 from dlt.sources.filesystem import filesystem
 
@@ -505,16 +492,12 @@ bucket_url = '\\?\C:\a\b\c'
 
 ### If you get an empty list of files
 
-If you are running a `dlt` pipeline with the filesystem source and get zero records, we recommend you check
+If you are running a dlt pipeline with the filesystem source and get zero records, we recommend you check
 the configuration of `bucket_url` and `file_glob` parameters.
 
-For example, with Azure Blob storage, people sometimes mistake the account name for the container name. Make sure
-you've set up a URL as `"az://<container name>/"`.
+For example, with Azure Blob Storage, people sometimes mistake the account name for the container name. Make sure you've set up a URL as `"az://<container name>/"`.
 
-Also, please reference the [glob](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.glob)
-function to configure the resource correctly. Use `**` to include recursive files. Note that the local
-filesystem supports full Python [glob](https://docs.python.org/3/library/glob.html#glob.glob) functionality,
-while cloud storage supports a restricted `fsspec` [version](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.glob).
+Also, please reference the [glob](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.glob) function to configure the resource correctly. Use `**` to include recursive files. Note that the local filesystem supports full Python [glob](https://docs.python.org/3/library/glob.html#glob.glob) functionality, while cloud storage supports a restricted `fsspec` [version](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.glob).
 
 <!--@@@DLT_TUBA filesystem-->
 
diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/index.md b/docs/website/docs/dlt-ecosystem/verified-sources/filesystem/index.md
@@ -1,18 +1,18 @@
 ---
-title: Filesystem & cloud storage
-description: dlt-verified source for Filesystem & cloud storage
-keywords: [readers source and filesystem, files, filesystem, readers source, cloud storage]
+title: Cloud storage and filesystem
+description: dlt-verified source for reading files from cloud storage and local file system
+keywords: [file system, files, filesystem, readers source, cloud storage, object storage, local file system]
 ---
 
-The Filesystem source allows seamless loading of files from the following locations:
+The filesystem source allows seamless loading of files from the following locations:
 * AWS S3
 * Google Cloud Storage
 * Google Drive
 * Azure Blob Storage
 * remote filesystem (via SFTP)
 * local filesystem
 
-The Filesystem source natively supports `csv`, `parquet`, and `jsonl` files and allows customization for loading any type of structured file.
+The filesystem source natively supports [CSV](../../file-formats/csv.md), [Parquet](../../file-formats/parquet.md), and [JSONL](../../file-formats/jsonl.md) files and allows customization for loading any type of structured file.
 
 import DocCardList from '@theme/DocCardList';
 

diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js
@@ -107,7 +107,7 @@ const sidebars = {
         },
         {
           type: 'category',
-          label: 'Filesystem & cloud storage',
+          label: 'Cloud storage and filesystem',
           description: 'AWS S3, Google Cloud Storage, Azure, SFTP, local file system',
             link: {
             type: 'doc',