Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: update terminology in the filesystem destination #1804

Merged
merged 1 commit into from
Sep 16, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 25 additions & 15 deletions docs/website/docs/dlt-ecosystem/destinations/filesystem.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# Filesystem & buckets
The Filesystem destination stores data in remote file systems and bucket storages like **S3**, **Google Storage**, or **Azure Blob Storage**. Underneath, it uses [fsspec](https://github.com/fsspec/filesystem_spec) to abstract file operations. Its primary role is to be used as a staging for other destinations, but you can also quickly build a data lake with it.
# Cloud storage and filesystem
The filesystem destination stores data in remote file systems and cloud storage services like **AWS S3**, **Google Cloud Storage**, or **Azure Blob Storage**. Underneath, it uses [fsspec](https://github.com/fsspec/filesystem_spec) to abstract file operations. Its primary role is to be used as a staging for other destinations, but you can also quickly build a data lake with it.

> 💡 Please read the notes on the layout of the data files. Currently, we are getting feedback on it. Please join our Slack (icon at the top of the page) and help us find the optimal layout.
:::tip
Please read the notes on the layout of the data files. Currently, we are getting feedback on it. Please join our Slack (icon at the top of the page) and help us find the optimal layout.
:::

## Install dlt with filesystem
**To install the dlt library with filesystem dependencies:**

Install the dlt library with filesystem dependencies:

```sh
pip install "dlt[filesystem]"
```
Expand All @@ -24,22 +28,24 @@ so pip does not fail on backtracking.
## Initialise the dlt project

Let's start by initializing a new dlt project as follows:
```sh
dlt init chess filesystem
```
```sh
dlt init chess filesystem
```

:::note
This command will initialize your pipeline with chess as the source and the AWS S3 filesystem as the destination.
This command will initialize your pipeline with chess as the source and the AWS S3 as the destination.
:::

## Set up bucket storage and credentials
## Set up the destination and credentials

### AWS S3
The command above creates a sample `secrets.toml` and requirements file for AWS S3 bucket. You can install those dependencies by running:
```sh
pip install -r requirements.txt
```

To edit the `dlt` credentials file with your secret info, open `.dlt/secrets.toml`, which looks like this:
To edit the dlt credentials file with your secret info, open `.dlt/secrets.toml`, which looks like this:

```toml
[destination.filesystem]
bucket_url = "s3://[your_bucket_name]" # replace with your bucket name,
Expand All @@ -49,19 +55,21 @@ aws_access_key_id = "please set me up!" # copy the access key here
aws_secret_access_key = "please set me up!" # copy the secret access key here
```

If you have your credentials stored in `~/.aws/credentials`, just remove the **[destination.filesystem.credentials]** section above, and `dlt` will fall back to your **default** profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: `dlt-ci-user`):
If you have your credentials stored in `~/.aws/credentials`, just remove the **[destination.filesystem.credentials]** section above, and dlt will fall back to your **default** profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: `dlt-ci-user`):

```toml
[destination.filesystem.credentials]
profile_name="dlt-ci-user"
```

You can also pass an AWS region:
You can also specify an AWS region:

```toml
[destination.filesystem.credentials]
region_name="eu-central-1"
```

You need to create an S3 bucket and a user who can access that bucket. `dlt` does not create buckets automatically.
You need to create an S3 bucket and a user who can access that bucket. dlt does not create buckets automatically.

1. You can create the S3 bucket in the AWS console by clicking on "Create Bucket" in S3 and assigning the appropriate name and permissions to the bucket.
2. Once the bucket is created, you'll have the bucket URL. For example, If the bucket name is `dlt-ci-test-bucket`, then the bucket URL will be:
Expand All @@ -71,7 +79,7 @@ You need to create an S3 bucket and a user who can access that bucket. `dlt` doe
```

3. To grant permissions to the user being used to access the S3 bucket, go to the IAM > Users, and click on “Add Permissions”.
4. Below you can find a sample policy that gives a minimum permission required by `dlt` to a bucket we created above. The policy contains permissions to list files in a bucket, get, put, and delete objects. **Remember to place your bucket name in the Resource section of the policy!**
4. Below you can find a sample policy that gives a minimum permission required by dlt to a bucket we created above. The policy contains permissions to list files in a bucket, get, put, and delete objects. **Remember to place your bucket name in the Resource section of the policy!**

```json
{
Expand Down Expand Up @@ -146,6 +154,7 @@ if you have default google cloud credentials in your environment (i.e. on cloud
Use **Cloud Storage** admin to create a new bucket. Then assign the **Storage Object Admin** role to your service account.

### Azure Blob Storage

Run `pip install "dlt[az]"` which will install the `adlfs` package to interface with Azure Blob Storage.

Edit the credentials in `.dlt/secrets.toml`, you'll see AWS credentials by default replace them with your Azure credentials.
Expand Down Expand Up @@ -196,6 +205,7 @@ max_concurrency=3
:::

### Local file system

If for any reason you want to have those files in a local folder, set up the `bucket_url` as follows (you are free to use `config.toml` for that as there are no secrets required)

```toml
Expand Down Expand Up @@ -318,7 +328,7 @@ For more details on managing file compression, please visit our documentation on
All the files are stored in a single folder with the name of the dataset that you passed to the `run` or `load` methods of the `pipeline`. In our example chess pipeline, it is **chess_players_games_data**.

:::note
Bucket storages are, in fact, key-blob storage so the folder structure is emulated by splitting file names into components by separator (`/`).
Object storages are, in fact, key-blob storage so the folder structure is emulated by splitting file names into components by separator (`/`).
:::

You can control files layout by specifying the desired configuration. There are several ways to do this.
Expand Down
Loading