treeverse · yonipeleg33 · Oct 29, 2024 · Oct 27, 2024 · Oct 27, 2024 · Oct 27, 2024
diff --git a/docs/_includes/toc_2-4.html b/docs/_includes/toc_2-4.html
@@ -0,0 +1,8 @@
+<div class="toc-block">
+## Table of contents
+{: .no_toc .text-delta }
+
+1. TOC
+{:toc}
+{::options toc_levels="2..4" /}
+</div>
diff --git a/docs/howto/garbage-collection/standalone-gc.md b/docs/howto/garbage-collection/standalone-gc.md
@@ -23,7 +23,7 @@ experimental
 {: .note .warning }
 > Standalone GC is experimental and offers limited capabilities compared to the [Spark-backed GC]({% link howto/garbage-collection/gc.md %}). Read through the [limitations](./standalone-gc.md#limitations) carefully before using it.
 
-{% include toc_2-3.html %}
+{% include toc_2-4.html %}
 
 ## About
 
@@ -38,8 +38,6 @@ Standalone GC is a limited version of the Spark-backed GC that runs without any
 
 ### Lab tests
 
-<TODO: update with final results once ready>
-
 Repository spec:
 
 - 100k objects
@@ -52,14 +50,14 @@ Machine spec:
 
 In this setup, we measured:
 
-- Time: ~1 minute
+- Time: < 5m
 - Disk space: 120MiB
 
 ## Installation
 
 ### Step 1: Obtain Dockerhub token
 As an enterprise customer, you should already have a dockerhub token for the `externallakefs` user.
-If not, contact us at ___ (TODO: add mail/whaterver).
+If not, contact us at [support@treeverse.io](mailto:support@treeverse.io).
 
 ### Step 2: Login to Dockerhub with this token
 ```bash
@@ -69,7 +67,7 @@ docker login -u <your token>
 ### Step 3: Download the docker image
 Download the image from the [lakefs-sgc](https://hub.docker.com/repository/docker/treeverse/lakefs-sgc/general) repository:
 ```bash
-docker pull treeverse/lakefs-sgc:tagname
+docker pull treeverse/lakefs-sgc:<tag>
 ```
 
 ## Usage
@@ -81,23 +79,45 @@ to be set up correctly, and reads the AWS credentials from the machine.
 This means, you should set up your machine however AWS expects you to set it. \
 For example, by following their guide on [configuring the AWS CLI](https://docs.aws.amazon.com/cli/v1/userguide/cli-chap-configure.html).
 
+#### S3-compatible clients
+Naturally, this method of configuration allows for `lakefs-sgc` to work with any S3-compatible client (such as [MinIO](https://min.io/)). \
+An example setup for working with MinIO:
+1. Add a profile to your `~/.aws/config` file:
+    ```
+   [profile minio]
+   region = us-east-1
+   endpoint_url = <your MinIO URL>
+   s3 =
+       signature_version = s3v4
+    ```
+
+2. Add an access and secret keys to your `~/.aws/credentials` file:
+    ```
+   [minio]
+   aws_access_key_id     = <your MinIO access key>
+   aws_secret_access_key = <your MinIO secret key>
+    ```
+3. Run the `lakefs-sgc` docker image and pass it the `minio` profile - see [example](./standalone-gc.md#mounting-the-aws-directory) below.
+
 ### Configuration
 The following configuration keys are available:
 
-| Key                        | Description                                                                                        | Default value                  | Possible values                                         |
-|----------------------------|----------------------------------------------------------------------------------------------------|--------------------------------|---------------------------------------------------------|
-| `logging.format`           | Logs output format                                                                                 | "text"                         | "text","json"                                           |
-| `logging.level`            | Logs level                                                                                         | "info"                         | "error","warn",info","debug","trace"                    |
-| `logging.output`           | Where to output the logs to                                                                        | "-"                            | "-" (stdout), "=" (stderr), or any string for file path |
-| `logging.file_max_size_mb` | Max file size for logs output (relevant only if `logging.output` is set to a file path)            | 102400 (100MiB)                | number                                                  |
-| `logging.files_keep`       | Number of files to keep for logs rotation (relevant only if `logging.output` is set to a file path | 100                            | number                                                  |
-| `cache_dir`                | Directory to use for caching data during run                                                       | ~/.lakefs-sgc/data             | string                                                  |
-| `aws.max_page_size`        | Max number of items per page when listing objects in AWS                                           | not set (AWS defaults to 1000) | number                                                  |
-| `objects_min_age`          | Ignore any object that is last modified within this time frame ("cutoff time")                     | "6h"                           | duration                                                |
-| `lakefs.endpoint_url`      | The URL to the lakeFS installation - should end with `/api/v1`                                     | NOT SET                        | URL                                                     |
-| `lakefs.access_key_id`     | Access key to the lakeFS installation                                                              | NOT SET                        | string                                                  |
-| `lakefs.secret_access_key` | Secret access key to the lakeFS installation                                                       | NOT SET                        | string                                                  |
+| Key                            | Description                                                                                                                                                   | Default value      | Possible values                                         |
+|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|---------------------------------------------------------|
+| `logging.format`               | Logs output format                                                                                                                                            | "text"             | "text","json"                                           |
+| `logging.level`                | Logs level                                                                                                                                                    | "info"             | "error","warn",info","debug","trace"                    |
+| `logging.output`               | Where to output the logs to                                                                                                                                   | "-"                | "-" (stdout), "=" (stderr), or any string for file path |
+| `cache_dir`                    | Directory to use for caching data during run                                                                                                                  | ~/.lakefs-sgc/data | string                                                  |
+| `aws.max_page_size`            | Max number of items per page when listing objects in AWS                                                                                                      | 1000               | number                                                  |
+| `aws.s3.addressing_path_style` | Whether or not to use [path-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access) when reading objects from AWS | true               | boolean                                                 |
+| `objects_min_age`*             | Ignore any object that is last modified within this time frame ("cutoff time")                                                                                | "6h"               | duration                                                |
+| `lakefs.endpoint_url`          | The URL to the lakeFS installation - should end with `/api/v1`                                                                                                | NOT SET            | URL                                                     |
+| `lakefs.access_key_id`         | Access key to the lakeFS installation                                                                                                                         | NOT SET            | string                                                  |
+| `lakefs.secret_access_key`     | Secret access key to the lakeFS installation                                                                                                                  | NOT SET            | string                                                  |
 
+{: .note }
+> **WARNING:** Changing `objects_min_age` is dangerous and can lead to undesired behaviour, such as causing ongoing writes to fail.
+It's recommended to not change this property.
 
 These keys can be provided in the following ways:
 1. Config file: Create a YAML file with the keys, each `.` is a new nesting level. \
@@ -139,8 +159,9 @@ Flags:
 - `--parallelism`: number of parallel downloads for metadataDir (default 10)
 - `--presign`: use pre-signed URLs when downloading/uploading data (recommended) (default true)
 
-### Example - docker run command
-Here's an example for running the `treeverse/lakefs-sgc` docker image, with AWS credentials parsed from the `~/.aws/credentials` file:
+### Example run commands
-### Example run commands
+## How to Run Standalone GC 
+### Run Commands 
-### Example run commands
+## How to Run Standalone GC 
+### Run Commands 
+
+#### Directly passing in credentials parsed from `~/.aws/credentials`
 
 ```bash
 docker run \
@@ -149,41 +170,75 @@ docker run \
 -e AWS_ACCESS_KEY_ID="$(grep 'aws_access_key_id' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
 -e AWS_SECRET_ACCESS_KEY="$(grep 'aws_secret_access_key' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
 -e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \
+-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<lakeFS Endpoint URL> \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \
+-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<lakeFS Endpoint URL> \
--e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your accesss key> \
--e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your secret key> \
+-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
+-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
+-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<lakeFS accesss key> \
+-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<lakeFS secret key> \
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
+-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<lakeFS accesss key> \
+-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<lakeFS secret key> \
 -e LAKEFS_SGC_LOGGING_LEVEL=debug \
-treeverse/lakefs-sgc:<version> run <repository>
+treeverse/lakefs-sgc:<tag> run <repository>
 ```
 
+#### Mounting the `~/.aws` directory
+
+When working with S3-compatible clients, it's often more convenient to mount the ~/.aws` file and pass in the desired profile.
+
+First, change the permissions for `~/.aws/*` to allow the docker container to read this directory:
+```bash
+chmod 644 ~/.aws/*
+```
+
+Then, run the docker image and mount `~/.aws` to the `lakefs-sgc` home directory on the docker container:
+```bash
+docker run \
+--network=host \
+-v ~/.aws:/home/lakefs-sgc/.aws \
+-e AWS_REGION=us-east-1 \
+-e AWS_PROFILE=<your profile> \
+-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your endpoint URL> \
+-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
+-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
+-e LAKEFS_SGC_LOGGING_LEVEL=debug \
+treeverse/lakefs-sgc:<tag> run <repository>
+```
 ### Output
-### Output
+### Get the List of Objects Marked for Deletion
+
+The output of an SGC job includes the list of objects marked for deletion. it is located at... 
-### Output
+### Get the List of Objects Marked for Deletion
+
+The output of an SGC job includes the list of objects marked for deletion. it is located at... 
 `lakefs-sgc` will write its reports to `<REPOSITORY_STORAGE_NAMESPACE>/_lakefs/retention/gc/reports/<RUN_ID>/`. \
 _RUN_ID_ is generated during runtime by the Standalone GC. You can find it in the logs:
 ```
 "Marking objects for deletion" ... run_id=gcoca17haabs73f2gtq0
 ```
 
-In this prefix, you'll find 3 objects:
-- `deleted.json` - Containing all marked objects in a json format. 
-- `deleted.parquet` - Containing all marked objects in a parquet format, with the following schema:
+In this prefix, you'll find 2 objects:
+- `deleted.csv` - Containing all marked objects in a CSV containing one `address` column. Example:
+   ```csv
+   address
+   "data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa69g,_e7P9j-1ahTXtofw7tWwJUIhTfL0rEs_dvBrClzc_QE"
+   "data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa78g,mKZnS-5YbLzmK0pKsGGimdxxBlt8QZzCyw1QeQrFvFE"
+   ...
    ```
-  ┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
-  │ column_name │ column_type │  null   │   key   │ default │  extra  │
-  │   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
-  ├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
-  │ address     │ VARCHAR     │ YES     │         │         │         │
-  └─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
-  ```
 - `summary.json` - A small json summarizing the GC run. Example:
-```json
-{
-    "run_id": "gcoca17haabs73f2gtq0",
-    "success": true,
-    "first_slice": "gcss5tpsrurs73cqi6e0",
-    "start_time": "2024-10-27T13:19:26.890099059Z",
-    "cutoff_time": "2024-10-27T07:19:26.890099059Z",
-    "num_deleted_objects": 33000
-}
-```
+   ```json
+   {
+       "run_id": "gcoca17haabs73f2gtq0",
+       "success": true,
+       "first_slice": "gcss5tpsrurs73cqi6e0",
+       "start_time": "2024-10-27T13:19:26.890099059Z",
+       "cutoff_time": "2024-10-27T07:19:26.890099059Z",
+       "num_deleted_objects": 33000
+   }
+   ```
 
 ### Deleting marked objects
-### Deleting marked objects
+### Delete marked objects
-### Deleting marked objects
+### Delete marked objects
-To delete the objects marked by the GC, you'll need to read the `deleted.parquet` or `deleted.json` files, and manually delete each address from AWS.
+
+To delete the objects marked by the GC, you'll need to read the `deleted.csv` file, and manually delete each address from AWS.
+
+Example bash command to move all the marked objects to a different bucket on S3:
+```bash
+# Change these to your correct values
+storage_ns=<your storage namespace (s3://...)>
+output_bucket=<your output bucket (s3://...)>
+run_id=<GC run id>
+
+# Download the CSV file
+aws s3 cp "$storage_ns/_lakefs/retention/gc/reports/$run_id/deleted.csv" "./run_id-$run_id.csv"
+
+# Move all addresses to the output bucket under the run_id prefix
+cat run_id-$run_id.csv | tail -n +2 | head -n 10 | xargs -I {} aws s3 mv "$storage_ns/{}" "$output_bucket/run_id=$run_id/"
+```