Quickstart now uses DuckDB WASM instead of CLI (#6092)

* Update Quickstart to use DuckDB WASM * Updated Quickstart docs and samplerepo README to use DuckDB WASM * Fix images for obsolete references to lakefs_object, also a typo in alt text * Use repo name placeholder * Add statement terminator (c.f. #6107)
treeverse · Jun 26, 2023 · ad1bdda · ad1bdda
1 parent ed46ba9
commit ad1bdda
Show file tree

Hide file tree

Showing 29 changed files with 70 additions and 179 deletions.
diff --git a/docs/assets/img/quickstart/duckdb-editor-02.png b/docs/assets/img/quickstart/duckdb-editor-02.png
diff --git a/docs/assets/img/quickstart/duckdb-editor-03.png b/docs/assets/img/quickstart/duckdb-editor-03.png
diff --git a/docs/assets/img/quickstart/duckdb-editor-04.png b/docs/assets/img/quickstart/duckdb-editor-04.png
diff --git a/docs/assets/img/quickstart/duckdb-editor-05.png b/docs/assets/img/quickstart/duckdb-editor-05.png
diff --git a/docs/assets/img/quickstart/duckdb-editor-06.png b/docs/assets/img/quickstart/duckdb-editor-06.png
diff --git a/docs/assets/img/quickstart/duckdb-main-01.png b/docs/assets/img/quickstart/duckdb-main-01.png
diff --git a/docs/assets/img/quickstart/duckdb-main-02.png b/docs/assets/img/quickstart/duckdb-main-02.png
diff --git a/docs/assets/img/quickstart/duckdb-main-03.png b/docs/assets/img/quickstart/duckdb-main-03.png
diff --git a/docs/assets/img/quickstart/quickstart-repo.gif b/docs/assets/img/quickstart/quickstart-repo.gif
diff --git a/docs/assets/img/quickstart/quickstart-repo.png b/docs/assets/img/quickstart/quickstart-repo.png
diff --git a/docs/assets/img/quickstart/repo-contents.png b/docs/assets/img/quickstart/repo-contents.png
diff --git a/docs/quickstart/branch.md b/docs/quickstart/branch.md
@@ -27,8 +27,8 @@ Now that lakectl is configured, we can use it to create the branch. Run the foll
 ```bash
 docker exec lakefs \
     lakectl branch create \
-	    lakefs://quickstart/denmark-lakes \
-		--source lakefs://quickstart/main
+            lakefs://quickstart/denmark-lakes \
+		    --source lakefs://quickstart/main
 ```
 
 You should get a confirmation message like this:
@@ -40,41 +40,24 @@ created branch 'denmark-lakes' 3384cd7cdc4a2cd5eb6249b52f0a709b49081668bb1574ce8
 
 ## Transforming the Data
 
-Now we'll make a change to the data. lakeFS has several native clients, as well as an S3-compatible endpoint. This means that anything that can use S3 will work with lakeFS. Pretty neat. We're going to use DuckDB, but unlike in the previous step where it was run within the lakeFS web page, we've got a standalone container running. 
+Now we'll make a change to the data. lakeFS has several native clients, as well as an [S3-compatible endpoint](https://docs.lakefs.io/understand/architecture.html#s3-gateway). This means that anything that can use S3 will work with lakeFS. Pretty neat.
 
-### Setting up DuckDB
+We're going to use DuckDB which is embedded within the web interface of lakeFS. 
 
-Run the following in a new terminal window to launch the DuckDB CLI:
+From the lakeFS **Objects** page select the `lakes.parquet` file to open the DuckDB editor: 
 
-```bash
-docker exec --interactive --tty lakefs duckdb
-```
-
-The first thing to do is configure the S3 connection so that DuckDB can access lakeFS, as well as tell DuckDB to report back how many rows are changed by the query we'll soon be executing. Run this from the DuckDB prompt: 
-
-```sql
-SET s3_url_style='path';
-SET s3_region='us-east-1';
-SET s3_use_ssl=false;
-SET s3_endpoint='localhost:8000';
-.changes on
-```
+<img src="/assets/img/quickstart/duckdb-main-01.png" alt="The lakeFS object viewer with embedded DuckDB to query parquet files. A query has run automagically to preview the contents of the selected parquet file." class="quickstart"/>
 
-In addition, replace your credentials in the following and then run it too. 
+To start with, we'll load the lakes data into a DuckDB table so that we can manipulate it. Replace the previous text in the DuckDB editor with this: 
 
 ```sql
-SET s3_access_key_id='YOUR-ACCESS-KEY-ID';
-SET s3_secret_access_key='YOUR-SECRET-ACCESS-KEY';
+CREATE OR REPLACE TABLE lakes AS 
+    SELECT * FROM READ_PARQUET('lakefs://quickstart/denmark-lakes/lakes.parquet');
 ```
 
-Now we'll load the lakes data into a DuckDB table so that we can manipulate it:
+You'll see a row count of 100,000 to confirm that the DuckDB table has been created. 
 
-```sql
-CREATE TABLE lakes AS 
-    SELECT * FROM read_parquet('s3://quickstart/denmark-lakes/lakes.parquet');
-```
-
-Just to check that it's the same we saw before we're run the same query: 
+Just to check that it's the same data that we saw before we'll run the same query. Note that we are querying a DuckDB table (`lakes`), rather than using a function to query a parquet file directly. 
 
 ```sql
 SELECT   country, COUNT(*)
@@ -84,18 +67,7 @@ ORDER BY COUNT(*)
 DESC LIMIT 5;
 ```
 
-```
-┌──────────────────────────┬──────────────┐
-│         Country          │ count_star() │
-│         varchar          │    int64     │
-├──────────────────────────┼──────────────┤
-│ Canada                   │        83819 │
-│ United States of America │         6175 │
-│ Russia                   │         2524 │
-│ Denmark                  │         1677 │
-│ China                    │          966 │
-└──────────────────────────┴──────────────┘
-```
+<img src="/assets/img/quickstart/duckdb-editor-02.png" alt="The DuckDB editor pane querying the lakes table" class="quickstart"/>
 
 ### Making a Change to the Data
 
@@ -105,13 +77,10 @@ Now we can change our table, which was loaded from the original `lakes.parquet`,
 DELETE FROM lakes WHERE Country != 'Denmark';
 ```
 
-You'll see that 98k rows have been deleted: 
-
-```sql
-changes: 98323   total_changes: 198323
-```
+<img src="/assets/img/quickstart/duckdb-editor-03.png" alt="The DuckDB editor pane deleting rows from the lakes table" class="quickstart"/>
 
 We can verify that it's worked by reissuing the same query as before:
+
 ```sql
 SELECT   country, COUNT(*)
 FROM     lakes
@@ -120,23 +89,19 @@ ORDER BY COUNT(*)
 DESC LIMIT 5;
 ```
 
-```
-┌─────────┬──────────────┐
-│ Country │ count_star() │
-│ varchar │    int64     │
-├─────────┼──────────────┤
-│ Denmark │         1677 │
-└─────────┴──────────────┘
-```
+
+<img src="/assets/img/quickstart/duckdb-editor-04.png" alt="The DuckDB editor pane querying the lakes table showing only rows for Denmark remain" class="quickstart"/>
+
 ## Write the Data back to lakeFS
 
-The changes so far have only been to DuckDB's copy of the data. Let's now push it back to lakeFS. Note the S3 path is different this time as we're writing it to the `denmark-lakes` branch, not `main`: 
+The changes so far have only been to DuckDB's copy of the data. Let's now push it back to lakeFS. Note the path is different this time as we're writing it to the `denmark-lakes` branch, not `main`: 
 
 ```sql
-COPY lakes TO 's3://quickstart/denmark-lakes/lakes.parquet' 
-    (FORMAT 'PARQUET', ALLOW_OVERWRITE TRUE);
+COPY lakes TO 'lakefs://quickstart/denmark-lakes/lakes.parquet';
 ```
 
+<img src="/assets/img/quickstart/duckdb-editor-05.png" alt="The DuckDB editor pane writing data back to the denmark-lakes branch" class="quickstart"/>
+
 ## Verify that the Data's Changed on the Branch
 
 Let's just confirm for ourselves that the parquet file itself has the new data. We'll drop the `lakes` table just to be sure, and then query the parquet file directly:
@@ -145,28 +110,22 @@ Let's just confirm for ourselves that the parquet file itself has the new data.
 DROP TABLE lakes;
 
 SELECT   country, COUNT(*)
-FROM     read_parquet('s3://quickstart/denmark-lakes/lakes.parquet')
+FROM     READ_PARQUET('lakefs://quickstart/denmark-lakes/lakes.parquet')
 GROUP BY country
 ORDER BY COUNT(*) 
 DESC LIMIT 5;
 ```
 
-```
-┌─────────┬──────────────┐
-│ Country │ count_star() │
-│ varchar │    int64     │
-├─────────┼──────────────┤
-│ Denmark │         1677 │
-└─────────┴──────────────┘
-```
+<img src="/assets/img/quickstart/duckdb-editor-06.png" alt="The DuckDB editor pane show the parquet file on denmark-lakes branch has been changed" class="quickstart"/>
+
 
 ## What about the data in `main`?
 
-So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `main` branch? Absolutely nothing! See for yourself by returning to the lakeFS object view and re-running the same query:
+So we've changed the data in our `denmark-lakes` branch, deleting swathes of the dataset. What's this done to our original data in the `main` branch? Absolutely nothing! See for yourself by running the same query as above, but against the `main` branch:
 
 ```sql
 SELECT   country, COUNT(*)
-FROM     read_parquet('lakefs://quickstart/main/lakes.parquet')
+FROM     READ_PARQUET('lakefs://quickstart/main/lakes.parquet')
 GROUP BY country
 ORDER BY COUNT(*) 
 DESC LIMIT 5;

diff --git a/docs/quickstart/commit-and-merge.md b/docs/quickstart/commit-and-merge.md
@@ -39,8 +39,8 @@ As above, we'll use `lakectl` to do this too. The syntax just requires us to spe
 
 ```bash
 docker exec lakefs \
-    lakectl merge \
-	    lakefs://quickstart/denmark-lakes \
+	lakectl merge \
+		lakefs://quickstart/denmark-lakes \
 		lakefs://quickstart/main
 ```
 

diff --git a/docs/quickstart/launch.md b/docs/quickstart/launch.md
@@ -18,9 +18,9 @@ _The quickstart uses Docker to bring up the lakeFS container, pre-populate it wi
 Launch the lakeFS container:
 
 ```bash
-docker run --name lakefs \
-           --rm --publish 8000:8000 \
-           treeverse/lakefs:latest-duckdb \
+docker run --name lakefs --pull always \
+             --rm --publish 8000:8000 \
+             treeverse/lakefs:latest \
              run --local-settings
 ```
 
@@ -50,4 +50,4 @@ You're now ready to dive into lakeFS!
 
 You will see the sample repository created and the quickstart guide within it. You can follow along there, or here - it's the same :) 
 
-<img width="75%" src="/assets/img/quickstart/quickstart-repo.png" alt="The quickstart sample repo in lakeFS" class="quickstart"/>
+<img width="75%" src="/assets/img/quickstart/quickstart-repo.gif" alt="The quickstart sample repo in lakeFS" class="quickstart"/>
diff --git a/docs/quickstart/query.md b/docs/quickstart/query.md
@@ -28,7 +28,7 @@ Copy and paste the following SQL statement into the DuckDB query panel and click
 
 ```sql
 SELECT   country, COUNT(*)
-FROM     read_parquet('lakefs://quickstart/main/lakes.parquet')
+FROM     READ_PARQUET('lakefs://quickstart/main/lakes.parquet')
 GROUP BY country
 ORDER BY COUNT(*) 
 DESC LIMIT 5;