Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quickstart now uses DuckDB WASM instead of CLI #6092

Merged
merged 9 commits into from
Jun 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/img/quickstart/duckdb-editor-02.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/duckdb-editor-03.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/duckdb-editor-04.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/duckdb-editor-06.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/assets/img/quickstart/duckdb-main-01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/assets/img/quickstart/duckdb-main-02.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/assets/img/quickstart/duckdb-main-03.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/quickstart-repo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/assets/img/quickstart/quickstart-repo.png
Binary file not shown.
Binary file modified docs/assets/img/quickstart/repo-contents.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ implement this logic yourself.

Instead, make updates to the desired data assets on a branch and then utilize a lakeFS merge to atomically expose the data to downstream consumers.

To learn more about atomic cross-collection updates, check out [this video](https://www.youtube.com/watch?v=9OsjUvk5UJU) which describes the concept in more detail, along with [this notebook](https://github.com/treeverse/lakeFS-samples/blob/main/notebooks/write-audit-publish/wap-lakefs.ipynb) in the [lakeFS samples repository](https://github.com/treeverse/lakeFS-samples/).
To learn more about atomic cross-collection updates, check out [this video](https://www.youtube.com/watch?v=9OsjUvk5UJU) which describes the concept in more detail, along with [this notebook](https://github.com/treeverse/lakeFS-samples/blob/main/00_notebooks/write-audit-publish/wap-lakefs.ipynb) in the [lakeFS samples repository](https://github.com/treeverse/lakeFS-samples/).



Expand Down
93 changes: 26 additions & 67 deletions docs/quickstart/branch.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ Now that lakectl is configured, we can use it to create the branch. Run the foll
```bash
docker exec lakefs \
lakectl branch create \
lakefs://quickstart/denmark-lakes \
--source lakefs://quickstart/main
lakefs://quickstart/denmark-lakes \
--source lakefs://quickstart/main
```

You should get a confirmation message like this:
Expand All @@ -40,41 +40,24 @@ created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8

## Transforming the Data

Now we'll make a change to the data. lakeFS has several native clients, as well as an S3-compatible endpoint. This means that anything that can use S3 will work with lakeFS. Pretty neat. We're going to use DuckDB, but unlike in the previous step where it was run within the lakeFS web page, we've got a standalone container running.
Now we'll make a change to the data. lakeFS has several native clients, as well as an [S3-compatible endpoint](https://docs.lakefs.io/understand/architecture.html#s3-gateway). This means that anything that can use S3 will work with lakeFS. Pretty neat.

### Setting up DuckDB
We're going to use DuckDB which is embedded within the web interface of lakeFS.

Run the following in a new terminal window to launch the DuckDB CLI:
From the lakeFS **Objects** page select the `lakes.parquet` file to open the DuckDB editor:

```bash
docker exec --interactive --tty lakefs duckdb
```

The first thing to do is configure the S3 connection so that DuckDB can access lakeFS, as well as tell DuckDB to report back how many rows are changed by the query we'll soon be executing. Run this from the DuckDB prompt:

```sql
SET s3_url_style='path';
SET s3_region='us-east-1';
SET s3_use_ssl=false;
SET s3_endpoint='localhost:8000';
.changes on
```
<img src="/assets/img/quickstart/duckdb-main-01.png" alt="The lakeFS object viewer with embedded DuckDB to query parquet files. A query has run automagically to preview the contents of the selected parquet file." class="quickstart"/>

In addition, replace your credentials in the following and then run it too.
To start with, we'll load the lakes data into a DuckDB table so that we can manipulate it. Replace the previous text in the DuckDB editor with this:

```sql
SET s3_access_key_id='YOUR-ACCESS-KEY-ID';
SET s3_secret_access_key='YOUR-SECRET-ACCESS-KEY';
CREATE OR REPLACE TABLE lakes AS
SELECT * FROM READ_PARQUET('lakefs://quickstart/denmark-lakes/lakes.parquet');
```

Now we'll load the lakes data into a DuckDB table so that we can manipulate it:
You'll see a row count of 100,000 to confirm that the DuckDB table has been created.

```sql
CREATE TABLE lakes AS
SELECT * FROM read_parquet('s3://quickstart/denmark-lakes/lakes.parquet');
```

Just to check that it's the same we saw before we're run the same query:
Just to check that it's the same data that we saw before we'll run the same query. Note that we are querying a DuckDB table (`lakes`), rather than using a function to query a parquet file directly.

```sql
SELECT country, COUNT(*)
Expand All @@ -84,18 +67,7 @@ ORDER BY COUNT(*)
DESC LIMIT 5;
```

```
┌──────────────────────────┬──────────────┐
│ Country │ count_star() │
│ varchar │ int64 │
├──────────────────────────┼──────────────┤
│ Canada │ 83819 │
│ United States of America │ 6175 │
│ Russia │ 2524 │
│ Denmark │ 1677 │
│ China │ 966 │
└──────────────────────────┴──────────────┘
```
<img src="/assets/img/quickstart/duckdb-editor-02.png" alt="The DuckDB editor pane querying the lakes table" class="quickstart"/>

### Making a Change to the Data

Expand All @@ -105,13 +77,10 @@ Now we can change our table, which was loaded from the original `lakes.parquet`,
DELETE FROM lakes WHERE Country != 'Denmark';
```

You'll see that 98k rows have been deleted:

```sql
changes: 98323 total_changes: 198323
```
<img src="/assets/img/quickstart/duckdb-editor-03.png" alt="The DuckDB editor pane deleting rows from the lakes table" class="quickstart"/>

We can verify that it's worked by reissuing the same query as before:

```sql
SELECT country, COUNT(*)
FROM lakes
Expand All @@ -120,23 +89,19 @@ ORDER BY COUNT(*)
DESC LIMIT 5;
```

```
┌─────────┬──────────────┐
│ Country │ count_star() │
│ varchar │ int64 │
├─────────┼──────────────┤
│ Denmark │ 1677 │
└─────────┴──────────────┘
```

<img src="/assets/img/quickstart/duckdb-editor-04.png" alt="The DuckDB editor pane querying the lakes table showing only rows for Denmark remain" class="quickstart"/>

## Write the Data back to lakeFS

The changes so far have only been to DuckDB's copy of the data. Let's now push it back to lakeFS. Note the S3 path is different this time as we're writing it to the `denmark-lakes` branch, not `main`:
The changes so far have only been to DuckDB's copy of the data. Let's now push it back to lakeFS. Note the path is different this time as we're writing it to the `denmark-lakes` branch, not `main`:

```sql
COPY lakes TO 's3://quickstart/denmark-lakes/lakes.parquet'
(FORMAT 'PARQUET', ALLOW_OVERWRITE TRUE);
COPY lakes TO 'lakefs://quickstart/denmark-lakes/lakes.parquet';
```

<img src="/assets/img/quickstart/duckdb-editor-05.png" alt="The DuckDB editor pane writing data back to the denmark-lakes branch" class="quickstart"/>

## Verify that the Data's Changed on the Branch

Let's just confirm for ourselves that the parquet file itself has the new data. We'll drop the `lakes` table just to be sure, and then query the parquet file directly:
Expand All @@ -145,28 +110,22 @@ Let's just confirm for ourselves that the parquet file itself has the new data.
DROP TABLE lakes;

SELECT country, COUNT(*)
FROM read_parquet('s3://quickstart/denmark-lakes/lakes.parquet')
FROM READ_PARQUET('lakefs://quickstart/denmark-lakes/lakes.parquet')
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
```

```
┌─────────┬──────────────┐
│ Country │ count_star() │
│ varchar │ int64 │
├─────────┼──────────────┤
│ Denmark │ 1677 │
└─────────┴──────────────┘
```
<img src="/assets/img/quickstart/duckdb-editor-06.png" alt="The DuckDB editor pane show the parquet file on denmark-lakes branch has been changed" class="quickstart"/>


## What about the data in `main`?

So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `main` branch? Absolutely nothing! See for yourself by returning to the lakeFS object view and re-running the same query:
So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `main` branch? Absolutely nothing! See for yourself by running the same query as above, but against the `main` branch:

```sql
SELECT country, COUNT(*)
FROM read_parquet('lakefs://quickstart/main/lakes.parquet')
FROM READ_PARQUET('lakefs://quickstart/main/lakes.parquet')
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
Expand Down
4 changes: 2 additions & 2 deletions docs/quickstart/commit-and-merge.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ As above, we'll use `lakectl` to do this too. The syntax just requires us to spe

```bash
docker exec lakefs \
lakectl merge \
lakefs://quickstart/denmark-lakes \
lakectl merge \
lakefs://quickstart/denmark-lakes \
lakefs://quickstart/main
```

Expand Down
8 changes: 4 additions & 4 deletions docs/quickstart/launch.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ _The quickstart uses Docker to bring up the lakeFS container, pre-populate it wi
Launch the lakeFS container:

```bash
docker run --name lakefs \
--rm --publish 8000:8000 \
treeverse/lakefs:latest-duckdb \
docker run --name lakefs --pull always \
--rm --publish 8000:8000 \
treeverse/lakefs:latest \
run --local-settings
```

Expand Down Expand Up @@ -50,4 +50,4 @@ You're now ready to dive into lakeFS!

You will see the sample repository created and the quickstart guide within it. You can follow along there, or here - it's the same :)

<img width="75%" src="/assets/img/quickstart/quickstart-repo.png" alt="The quickstart sample repo in lakeFS" class="quickstart"/>
<img width="75%" src="/assets/img/quickstart/quickstart-repo.gif" alt="The quickstart sample repo in lakeFS" class="quickstart"/>
2 changes: 1 addition & 1 deletion docs/quickstart/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Copy and paste the following SQL statement into the DuckDB query panel and click

```sql
SELECT country, COUNT(*)
FROM read_parquet('lakefs://quickstart/main/lakes.parquet')
FROM READ_PARQUET('lakefs://quickstart/main/lakes.parquet')
GROUP BY country
ORDER BY COUNT(*)
DESC LIMIT 5;
Expand Down
Loading