Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for standalone (sparkless) GC #8307

Merged
merged 20 commits into from
Oct 29, 2024
Merged
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 87 additions & 30 deletions docs/howto/garbage-collection/standalone-gc.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,16 @@ Standalone GC is a limited version of the Spark-backed GC that runs without any

1. Except for the [Lab tests](./standalone-gc.md#lab-tests) performed, there are no further guarantees about the performance profile of the Standalone GC.
2. Horizontal scale is not supported - Only a single instance of `lakefs-sgc` can operate at a time on a given repository.
3. It only marks objects and does not delete them - Equivalent to the GC's [mark only mode]({% link howto/garbage-collection/gc.md %}#mark-only-mode). \
More about that in the [Output](./standalone-gc.md#output) section.
3. Standalone GC only marks objects and does not delete them - Equivalent to the GC's [mark only mode]({% link howto/garbage-collection/gc.md %}#mark-only-mode). \
More about that in the [Get the List of Objects Marked for Deletion](./standalone-gc.md#get-the-list-of-objects-marked-for-deletion) section.

### Lab tests

Repository spec:

- 100k objects
- < 200 commits
- 1 branch
- 250 commits
- 100 branches

Machine spec:
- 4GiB RAM
Expand All @@ -51,7 +51,7 @@ Machine spec:
In this setup, we measured:

- Time: < 5m
- Disk space: 120MiB
- Disk space: 123MB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a limitation that says that sgc only implements the mark stage without sweeping, and sweep requires user action

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added a bullet to "Limitations", and a new "Output" section describing this.

## Installation

Expand All @@ -61,7 +61,7 @@ If not, contact us at [support@treeverse.io](mailto:support@treeverse.io).

### Step 2: Login to Dockerhub with this token
```bash
docker login -u <your token>
docker login -u <token>
```

### Step 3: Download the docker image
Expand All @@ -72,6 +72,66 @@ docker pull treeverse/lakefs-sgc:<tag>

## Usage
Copy link
Contributor

@talSofer talSofer Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add these two steps here:
3. running the job with example params
4. How to find the output and guidance for how to read it and a CTA to delete the objects manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running the job with example params

I already added an example - take a look at "Example - docker run command"

How to find the output and guidance for how to read it and a CTA to delete the objects manually

Done - in the new "Output" section. Not sure WDYM by a "CTA", I just added a sentence explaining that the user should read the report and delete manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - in the new "Output" section. Not sure WDYM by a "CTA", I just added a sentence explaining that the user should read the report and delete manually.

Update: I added a dedicated section for "Deleting marked objects" with the same sentence ^


### Permissions
To run `lakefs-sgc`, you'll need AWS and LakeFS users, with the following permissions:
#### AWS
The minimal required permissions on AWS are:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<bucket>/*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too permesive. Should be this, no?

Suggested change
"arn:aws:s3:::<bucket>/*"
"arn:aws:s3:::<storage_namespace>/_lakefs/*"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use only _lakefs as it needs access to the _data prefix as well.
But you're right that it doesn't need permissions for the entire bucket, only the storage namespace prefix.
Changed accordingly.

]
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<bucket>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets"
],
"Resource": [
"arn:aws:s3:::*"
]
}
]
}
```

#### LakeFS
The minimal required permissions on LakeFS are:
```json
{
"statement": [
{
"action": [
"fs:ReadConfig",
"fs:ReadRepository",
"retention:PrepareGarbageCollectionCommits",
"retention:PrepareGarbageCollectionUncommitted",
"fs:ListObjects",
"fs:ReadConfig"
],
"effect": "allow",
"resource": "arn:lakefs:fs:::repository/<repository>"
}
]
}
```
### AWS Credentials
Currently, `lakefs-sgc` does not provide an option to explicitly set AWS credentials. It relies on the hosting machine
to be set up correctly, and reads the AWS credentials from the machine.
Expand All @@ -86,16 +146,16 @@ An example setup for working with MinIO:
```
[profile minio]
region = us-east-1
endpoint_url = <your MinIO URL>
endpoint_url = <MinIO URL>
s3 =
signature_version = s3v4
```

2. Add an access and secret keys to your `~/.aws/credentials` file:
```
[minio]
aws_access_key_id = <your MinIO access key>
aws_secret_access_key = <your MinIO secret key>
aws_access_key_id = <MinIO access key>
aws_secret_access_key = <MinIO secret key>
```
3. Run the `lakefs-sgc` docker image and pass it the `minio` profile - see [example](./standalone-gc.md#mounting-the-aws-directory) below.

Expand All @@ -110,15 +170,10 @@ The following configuration keys are available:
| `cache_dir` | Directory to use for caching data during run | ~/.lakefs-sgc/data | string |
| `aws.max_page_size` | Max number of items per page when listing objects in AWS | 1000 | number |
| `aws.s3.addressing_path_style` | Whether or not to use [path-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access) when reading objects from AWS | true | boolean |
| `objects_min_age`* | Ignore any object that is last modified within this time frame ("cutoff time") | "6h" | duration |
| `lakefs.endpoint_url` | The URL to the lakeFS installation - should end with `/api/v1` | NOT SET | URL |
| `lakefs.access_key_id` | Access key to the lakeFS installation | NOT SET | string |
| `lakefs.secret_access_key` | Secret access key to the lakeFS installation | NOT SET | string |

{: .note }
> **WARNING:** Changing `objects_min_age` is dangerous and can lead to undesired behaviour, such as causing ongoing writes to fail.
It's recommended to not change this property.

These keys can be provided in the following ways:
1. Config file: Create a YAML file with the keys, each `.` is a new nesting level. \
For example, `logging.level` will be:
Expand All @@ -139,8 +194,8 @@ logging:
level: debug
lakefs:
endpoint_url: https://your.url/api/v1
access_key_id: <your lakeFS access key>
secret_access_key: <your lakeFS secret key>
access_key_id: <lakeFS access key>
secret_access_key: <lakeFS secret key>
```

### Command line reference
Expand All @@ -159,7 +214,7 @@ Flags:
- `--parallelism`: number of parallel downloads for metadataDir (default 10)
- `--presign`: use pre-signed URLs when downloading/uploading data (recommended) (default true)

### Example run commands
### How to Run Standalone GC

#### Directly passing in credentials parsed from `~/.aws/credentials`

Expand All @@ -169,9 +224,9 @@ docker run \
-e AWS_SESSION_TOKEN="$(grep 'aws_session_token' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
-e AWS_ACCESS_KEY_ID="$(grep 'aws_access_key_id' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
-e AWS_SECRET_ACCESS_KEY="$(grep 'aws_secret_access_key' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<lakefs endpoint URL> \
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<lakefs secret key> \
-e LAKEFS_SGC_LOGGING_LEVEL=debug \
treeverse/lakefs-sgc:<tag> run <repository>
```
Expand All @@ -191,14 +246,14 @@ docker run \
--network=host \
-v ~/.aws:/home/lakefs-sgc/.aws \
-e AWS_REGION=us-east-1 \
-e AWS_PROFILE=<your profile> \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your endpoint URL> \
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
-e AWS_PROFILE=<profile> \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<lakefs endpoint URL> \
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<lakefs secret key> \
-e LAKEFS_SGC_LOGGING_LEVEL=debug \
treeverse/lakefs-sgc:<tag> run <repository>
```
### Output
### Get the List of Objects Marked for Deletion
`lakefs-sgc` will write its reports to `<REPOSITORY_STORAGE_NAMESPACE>/_lakefs/retention/gc/reports/<RUN_ID>/`. \
_RUN_ID_ is generated during runtime by the Standalone GC. You can find it in the logs:
```
Expand All @@ -207,7 +262,7 @@ _RUN_ID_ is generated during runtime by the Standalone GC. You can find it in th

In this prefix, you'll find 2 objects:
- `deleted.csv` - Containing all marked objects in a CSV containing one `address` column. Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a docs question, but why this file called deleted if it contains objects that are marked for deletion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's aligned with the GC's output
(cc @itaiad200 @Jonathan-Rosenberg - right?)

```csv
```
address
"data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa69g,_e7P9j-1ahTXtofw7tWwJUIhTfL0rEs_dvBrClzc_QE"
"data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa78g,mKZnS-5YbLzmK0pKsGGimdxxBlt8QZzCyw1QeQrFvFE"
Expand All @@ -225,15 +280,17 @@ In this prefix, you'll find 2 objects:
}
```

### Deleting marked objects
### Delete marked objects

To delete the objects marked by the GC, you'll need to read the `deleted.csv` file, and manually delete each address from AWS.

Example bash command to move all the marked objects to a different bucket on S3:
It is recommended to move all the marked objects to a different bucket instead of deleting them directly.

Here's an example bash script to perform this operation:
```bash
# Change these to your correct values
storage_ns=<your storage namespace (s3://...)>
output_bucket=<your output bucket (s3://...)>
storage_ns=<storage namespace (s3://...)>
output_bucket=<output bucket (s3://...)>
run_id=<GC run id>

# Download the CSV file
Expand Down
Loading