treeverse · yonipeleg33 · Oct 29, 2024 · Oct 27, 2024 · Oct 27, 2024 · Oct 27, 2024
diff --git a/docs/howto/garbage-collection/standalone-gc.md b/docs/howto/garbage-collection/standalone-gc.md
@@ -33,16 +33,16 @@ Standalone GC is a limited version of the Spark-backed GC that runs without any
 
 1. Except for the [Lab tests](./standalone-gc.md#lab-tests) performed, there are no further guarantees about the performance profile of the Standalone GC. 
 2. Horizontal scale is not supported - Only a single instance of `lakefs-sgc` can operate at a time on a given repository.
-3. It only marks objects and does not delete them - Equivalent to the GC's [mark only mode]({% link howto/garbage-collection/gc.md %}#mark-only-mode). \
-   More about that in the [Output](./standalone-gc.md#output) section.
+3. Standalone GC only marks objects and does not delete them - Equivalent to the GC's [mark only mode]({% link howto/garbage-collection/gc.md %}#mark-only-mode). \
+   More about that in the [Get the List of Objects Marked for Deletion](./standalone-gc.md#get-the-list-of-objects-marked-for-deletion) section.
 
 ### Lab tests
 
 Repository spec:
 
 - 100k objects
-- < 200 commits
-- 1 branch
+- 250 commits
+- 100 branches
 
 Machine spec:
 - 4GiB RAM
@@ -51,7 +51,7 @@ Machine spec:
 In this setup, we measured:
 
 - Time: < 5m
-- Disk space: 120MiB
+- Disk space: 123MB
 
 ## Installation
 
@@ -61,7 +61,7 @@ If not, contact us at [support@treeverse.io](mailto:support@treeverse.io).
 
 ### Step 2: Login to Dockerhub with this token
 ```bash
-docker login -u <your token>
+docker login -u <token>
 ```
 
 ### Step 3: Download the docker image
@@ -72,6 +72,66 @@ docker pull treeverse/lakefs-sgc:<tag>
 
 ## Usage
 
+### Permissions
+To run `lakefs-sgc`, you'll need AWS and LakeFS users, with the following permissions:
+#### AWS
+The minimal required permissions on AWS are:
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Action": [
+        "s3:PutObject",
+        "s3:GetObject"
+      ],
+      "Resource": [
+        "arn:aws:s3:::<bucket>/*"
-        "arn:aws:s3:::<bucket>/*"
+        "arn:aws:s3:::<storage_namespace>/_lakefs/*"
-        "arn:aws:s3:::<bucket>/*"
+        "arn:aws:s3:::<storage_namespace>/_lakefs/*"
+      ]
+    },
+    {
+      "Effect": "Allow",
+      "Action": [
+        "s3:ListBucket"
+      ],
+      "Resource": [
+        "arn:aws:s3:::<bucket>"
+      ]
+    },
+    {
+      "Effect": "Allow",
+      "Action": [
+        "s3:ListAllMyBuckets"
+      ],
+      "Resource": [
+        "arn:aws:s3:::*"
+      ]
+    }
+  ]
+}
+```
+
+#### LakeFS
+The minimal required permissions on LakeFS are:
+```json
+{
+  "statement": [
+    {
+      "action": [
+        "fs:ReadConfig",
+        "fs:ReadRepository",
+        "retention:PrepareGarbageCollectionCommits",
+        "retention:PrepareGarbageCollectionUncommitted",
+        "fs:ListObjects",
+        "fs:ReadConfig"
+      ],
+      "effect": "allow",
+      "resource": "arn:lakefs:fs:::repository/<repository>"
+    }
+  ]
+}
+```
 ### AWS Credentials
 Currently, `lakefs-sgc` does not provide an option to explicitly set AWS credentials. It relies on the hosting machine
 to be set up correctly, and reads the AWS credentials from the machine.
@@ -86,16 +146,16 @@ An example setup for working with MinIO:
     ```
    [profile minio]
    region = us-east-1
-   endpoint_url = <your MinIO URL>
+   endpoint_url = <MinIO URL>
    s3 =
        signature_version = s3v4
     ```
 
 2. Add an access and secret keys to your `~/.aws/credentials` file:
     ```
    [minio]
-   aws_access_key_id     = <your MinIO access key>
-   aws_secret_access_key = <your MinIO secret key>
+   aws_access_key_id     = <MinIO access key>
+   aws_secret_access_key = <MinIO secret key>
     ```
 3. Run the `lakefs-sgc` docker image and pass it the `minio` profile - see [example](./standalone-gc.md#mounting-the-aws-directory) below.
 
@@ -110,15 +170,10 @@ The following configuration keys are available:
 | `cache_dir`                    | Directory to use for caching data during run                                                                                                                  | ~/.lakefs-sgc/data | string                                                  |
 | `aws.max_page_size`            | Max number of items per page when listing objects in AWS                                                                                                      | 1000               | number                                                  |
 | `aws.s3.addressing_path_style` | Whether or not to use [path-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access) when reading objects from AWS | true               | boolean                                                 |
-| `objects_min_age`*             | Ignore any object that is last modified within this time frame ("cutoff time")                                                                                | "6h"               | duration                                                |
 | `lakefs.endpoint_url`          | The URL to the lakeFS installation - should end with `/api/v1`                                                                                                | NOT SET            | URL                                                     |
 | `lakefs.access_key_id`         | Access key to the lakeFS installation                                                                                                                         | NOT SET            | string                                                  |
 | `lakefs.secret_access_key`     | Secret access key to the lakeFS installation                                                                                                                  | NOT SET            | string                                                  |
 
-{: .note }
-> **WARNING:** Changing `objects_min_age` is dangerous and can lead to undesired behaviour, such as causing ongoing writes to fail.
-It's recommended to not change this property.
-
 These keys can be provided in the following ways:
 1. Config file: Create a YAML file with the keys, each `.` is a new nesting level. \
    For example, `logging.level` will be:
@@ -139,8 +194,8 @@ logging:
   level: debug
 lakefs:
   endpoint_url: https://your.url/api/v1
-  access_key_id: <your lakeFS access key>
-  secret_access_key: <your lakeFS secret key>
+  access_key_id: <lakeFS access key>
+  secret_access_key: <lakeFS secret key>
 ```
 
 ### Command line reference
@@ -159,7 +214,7 @@ Flags:
 - `--parallelism`: number of parallel downloads for metadataDir (default 10)
 - `--presign`: use pre-signed URLs when downloading/uploading data (recommended) (default true)
 
-### Example run commands
+### How to Run Standalone GC
 
 #### Directly passing in credentials parsed from `~/.aws/credentials`
 
@@ -169,9 +224,9 @@ docker run \
 -e AWS_SESSION_TOKEN="$(grep 'aws_session_token' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
 -e AWS_ACCESS_KEY_ID="$(grep 'aws_access_key_id' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
 -e AWS_SECRET_ACCESS_KEY="$(grep 'aws_secret_access_key' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
--e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \
--e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
--e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
+-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<lakefs endpoint URL> \
+-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<lakefs accesss key> \
+-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<lakefs secret key> \
 -e LAKEFS_SGC_LOGGING_LEVEL=debug \
 treeverse/lakefs-sgc:<tag> run <repository>
 ```
@@ -191,14 +246,14 @@ docker run \
 --network=host \
 -v ~/.aws:/home/lakefs-sgc/.aws \
 -e AWS_REGION=us-east-1 \
--e AWS_PROFILE=<your profile> \
--e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your endpoint URL> \
--e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
--e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
+-e AWS_PROFILE=<profile> \
+-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<lakefs endpoint URL> \
+-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<lakefs accesss key> \
+-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<lakefs secret key> \
 -e LAKEFS_SGC_LOGGING_LEVEL=debug \
 treeverse/lakefs-sgc:<tag> run <repository>
 ```
-### Output
+### Get the List of Objects Marked for Deletion
 `lakefs-sgc` will write its reports to `<REPOSITORY_STORAGE_NAMESPACE>/_lakefs/retention/gc/reports/<RUN_ID>/`. \
 _RUN_ID_ is generated during runtime by the Standalone GC. You can find it in the logs:
 ```
@@ -207,7 +262,7 @@ _RUN_ID_ is generated during runtime by the Standalone GC. You can find it in th
 
 In this prefix, you'll find 2 objects:
 - `deleted.csv` - Containing all marked objects in a CSV containing one `address` column. Example:
-   ```csv
+   ```
    address
    "data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa69g,_e7P9j-1ahTXtofw7tWwJUIhTfL0rEs_dvBrClzc_QE"
    "data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa78g,mKZnS-5YbLzmK0pKsGGimdxxBlt8QZzCyw1QeQrFvFE"
@@ -225,15 +280,17 @@ In this prefix, you'll find 2 objects:
    }
    ```
 
-### Deleting marked objects
+### Delete marked objects
 
 To delete the objects marked by the GC, you'll need to read the `deleted.csv` file, and manually delete each address from AWS.
 
-Example bash command to move all the marked objects to a different bucket on S3:
+It is recommended to move all the marked objects to a different bucket instead of deleting them directly.
+
+Here's an example bash script to perform this operation:
 ```bash
 # Change these to your correct values
-storage_ns=<your storage namespace (s3://...)>
-output_bucket=<your output bucket (s3://...)>
+storage_ns=<storage namespace (s3://...)>
+output_bucket=<output bucket (s3://...)>
 run_id=<GC run id>
 
 # Download the CSV file