Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for standalone (sparkless) GC #8307

Merged
merged 20 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/_includes/toc_2-4.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<div class="toc-block">
## Table of contents
{: .no_toc .text-delta }

1. TOC
{:toc}
{::options toc_levels="2..4" /}
</div>
145 changes: 100 additions & 45 deletions docs/howto/garbage-collection/standalone-gc.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ experimental
{: .note .warning }
> Standalone GC is experimental and offers limited capabilities compared to the [Spark-backed GC]({% link howto/garbage-collection/gc.md %}). Read through the [limitations](./standalone-gc.md#limitations) carefully before using it.

{% include toc_2-3.html %}
{% include toc_2-4.html %}

## About
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; add a whitespace after every markdown heading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Expand All @@ -38,8 +38,6 @@ Standalone GC is a limited version of the Spark-backed GC that runs without any

### Lab tests

<TODO: update with final results once ready>

Repository spec:

- 100k objects
Expand All @@ -52,14 +50,14 @@ Machine spec:

In this setup, we measured:

- Time: ~1 minute
- Time: < 5m
- Disk space: 120MiB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a limitation that says that sgc only implements the mark stage without sweeping, and sweep requires user action

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added a bullet to "Limitations", and a new "Output" section describing this.

## Installation

### Step 1: Obtain Dockerhub token
As an enterprise customer, you should already have a dockerhub token for the `externallakefs` user.
If not, contact us at ___ (TODO: add mail/whaterver).
If not, contact us at [support@treeverse.io](mailto:support@treeverse.io).

### Step 2: Login to Dockerhub with this token
```bash
Expand All @@ -69,7 +67,7 @@ docker login -u <your token>
### Step 3: Download the docker image
Download the image from the [lakefs-sgc](https://hub.docker.com/repository/docker/treeverse/lakefs-sgc/general) repository:
```bash
docker pull treeverse/lakefs-sgc:tagname
docker pull treeverse/lakefs-sgc:<tag>
```

## Usage
Copy link
Contributor

@talSofer talSofer Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add these two steps here:
3. running the job with example params
4. How to find the output and guidance for how to read it and a CTA to delete the objects manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running the job with example params

I already added an example - take a look at "Example - docker run command"

How to find the output and guidance for how to read it and a CTA to delete the objects manually

Done - in the new "Output" section. Not sure WDYM by a "CTA", I just added a sentence explaining that the user should read the report and delete manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - in the new "Output" section. Not sure WDYM by a "CTA", I just added a sentence explaining that the user should read the report and delete manually.

Update: I added a dedicated section for "Deleting marked objects" with the same sentence ^

Expand All @@ -81,23 +79,45 @@ to be set up correctly, and reads the AWS credentials from the machine.
This means, you should set up your machine however AWS expects you to set it. \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How configurations work for on-prem users who use Minio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added "S3-compatible clients" section and example (cc @itaiad200)

For example, by following their guide on [configuring the AWS CLI](https://docs.aws.amazon.com/cli/v1/userguide/cli-chap-configure.html).

#### S3-compatible clients
Naturally, this method of configuration allows for `lakefs-sgc` to work with any S3-compatible client (such as [MinIO](https://min.io/)). \
An example setup for working with MinIO:
1. Add a profile to your `~/.aws/config` file:
```
[profile minio]
region = us-east-1
endpoint_url = <your MinIO URL>
s3 =
signature_version = s3v4
```

2. Add an access and secret keys to your `~/.aws/credentials` file:
```
[minio]
aws_access_key_id = <your MinIO access key>
aws_secret_access_key = <your MinIO secret key>
```
3. Run the `lakefs-sgc` docker image and pass it the `minio` profile - see [example](./standalone-gc.md#mounting-the-aws-directory) below.

### Configuration
The following configuration keys are available:

| Key | Description | Default value | Possible values |
|----------------------------|----------------------------------------------------------------------------------------------------|--------------------------------|---------------------------------------------------------|
| `logging.format` | Logs output format | "text" | "text","json" |
| `logging.level` | Logs level | "info" | "error","warn",info","debug","trace" |
| `logging.output` | Where to output the logs to | "-" | "-" (stdout), "=" (stderr), or any string for file path |
| `logging.file_max_size_mb` | Max file size for logs output (relevant only if `logging.output` is set to a file path) | 102400 (100MiB) | number |
| `logging.files_keep` | Number of files to keep for logs rotation (relevant only if `logging.output` is set to a file path | 100 | number |
| `cache_dir` | Directory to use for caching data during run | ~/.lakefs-sgc/data | string |
| `aws.max_page_size` | Max number of items per page when listing objects in AWS | not set (AWS defaults to 1000) | number |
| `objects_min_age` | Ignore any object that is last modified within this time frame ("cutoff time") | "6h" | duration |
| `lakefs.endpoint_url` | The URL to the lakeFS installation - should end with `/api/v1` | NOT SET | URL |
| `lakefs.access_key_id` | Access key to the lakeFS installation | NOT SET | string |
| `lakefs.secret_access_key` | Secret access key to the lakeFS installation | NOT SET | string |
| Key | Description | Default value | Possible values |
|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|---------------------------------------------------------|
| `logging.format` | Logs output format | "text" | "text","json" |
| `logging.level` | Logs level | "info" | "error","warn",info","debug","trace" |
| `logging.output` | Where to output the logs to | "-" | "-" (stdout), "=" (stderr), or any string for file path |
| `cache_dir` | Directory to use for caching data during run | ~/.lakefs-sgc/data | string |
| `aws.max_page_size` | Max number of items per page when listing objects in AWS | 1000 | number |
| `aws.s3.addressing_path_style` | Whether or not to use [path-style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#path-style-access) when reading objects from AWS | true | boolean |
| `objects_min_age`* | Ignore any object that is last modified within this time frame ("cutoff time") | "6h" | duration |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this if we have retention policy? and if it is risky to change it, why it is configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

It's on top of the retention policy, this configuration exists in the GC as well.
But maybe it's worth not documenting it, to prevent mishaps...

| `lakefs.endpoint_url` | The URL to the lakeFS installation - should end with `/api/v1` | NOT SET | URL |
| `lakefs.access_key_id` | Access key to the lakeFS installation | NOT SET | string |
| `lakefs.secret_access_key` | Secret access key to the lakeFS installation | NOT SET | string |

{: .note }
> **WARNING:** Changing `objects_min_age` is dangerous and can lead to undesired behaviour, such as causing ongoing writes to fail.
It's recommended to not change this property.

These keys can be provided in the following ways:
1. Config file: Create a YAML file with the keys, each `.` is a new nesting level. \
Expand Down Expand Up @@ -139,8 +159,9 @@ Flags:
- `--parallelism`: number of parallel downloads for metadataDir (default 10)
- `--presign`: use pre-signed URLs when downloading/uploading data (recommended) (default true)

### Example - docker run command
Here's an example for running the `treeverse/lakefs-sgc` docker image, with AWS credentials parsed from the `~/.aws/credentials` file:
### Example run commands
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Example run commands
## How to Run Standalone GC
### Run Commands

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


#### Directly passing in credentials parsed from `~/.aws/credentials`

```bash
docker run \
Expand All @@ -149,41 +170,75 @@ docker run \
-e AWS_ACCESS_KEY_ID="$(grep 'aws_access_key_id' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
-e AWS_SECRET_ACCESS_KEY="$(grep 'aws_secret_access_key' ~/.aws/credentials | awk -F' = ' '{print $2}')" \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your lakefs URL> \
-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<lakeFS Endpoint URL> \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your secret key> \
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<lakeFS accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<lakeFS secret key> \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we mention somewhere which lakeFS user this user is?
If I'm a customer who want to use that, I'd wish to give it the minimal permissions possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - Added a "Permissions" section

-e LAKEFS_SGC_LOGGING_LEVEL=debug \
treeverse/lakefs-sgc:<version> run <repository>
treeverse/lakefs-sgc:<tag> run <repository>
```

#### Mounting the `~/.aws` directory

When working with S3-compatible clients, it's often more convenient to mount the ~/.aws` file and pass in the desired profile.

First, change the permissions for `~/.aws/*` to allow the docker container to read this directory:
```bash
chmod 644 ~/.aws/*
```

Then, run the docker image and mount `~/.aws` to the `lakefs-sgc` home directory on the docker container:
```bash
docker run \
--network=host \
-v ~/.aws:/home/lakefs-sgc/.aws \
-e AWS_REGION=us-east-1 \
-e AWS_PROFILE=<your profile> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit; Here and elsewhere, drop the your prefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

-e LAKEFS_SGC_LAKEFS_ENDPOINT_URL=<your endpoint URL> \
-e LAKEFS_SGC_LAKEFS_ACCESS_KEY_ID=<your lakefs accesss key> \
-e LAKEFS_SGC_LAKEFS_SECRET_ACCESS_KEY=<your lakefs secret key> \
-e LAKEFS_SGC_LOGGING_LEVEL=debug \
treeverse/lakefs-sgc:<tag> run <repository>
```
### Output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Output
### Get the List of Objects Marked for Deletion
The output of an SGC job includes the list of objects marked for deletion. it is located at...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

`lakefs-sgc` will write its reports to `<REPOSITORY_STORAGE_NAMESPACE>/_lakefs/retention/gc/reports/<RUN_ID>/`. \
_RUN_ID_ is generated during runtime by the Standalone GC. You can find it in the logs:
```
"Marking objects for deletion" ... run_id=gcoca17haabs73f2gtq0
```

In this prefix, you'll find 3 objects:
- `deleted.json` - Containing all marked objects in a json format.
- `deleted.parquet` - Containing all marked objects in a parquet format, with the following schema:
In this prefix, you'll find 2 objects:
- `deleted.csv` - Containing all marked objects in a CSV containing one `address` column. Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a docs question, but why this file called deleted if it contains objects that are marked for deletion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's aligned with the GC's output
(cc @itaiad200 @Jonathan-Rosenberg - right?)

```csv
address
"data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa69g,_e7P9j-1ahTXtofw7tWwJUIhTfL0rEs_dvBrClzc_QE"
"data/gcnobu7n2efc74lfa5ug/csfnri7n2efc74lfa78g,mKZnS-5YbLzmK0pKsGGimdxxBlt8QZzCyw1QeQrFvFE"
...
```
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ address │ VARCHAR │ YES │ │ │ │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
```
- `summary.json` - A small json summarizing the GC run. Example:
```json
{
"run_id": "gcoca17haabs73f2gtq0",
"success": true,
"first_slice": "gcss5tpsrurs73cqi6e0",
"start_time": "2024-10-27T13:19:26.890099059Z",
"cutoff_time": "2024-10-27T07:19:26.890099059Z",
"num_deleted_objects": 33000
}
```
```json
{
"run_id": "gcoca17haabs73f2gtq0",
"success": true,
"first_slice": "gcss5tpsrurs73cqi6e0",
"start_time": "2024-10-27T13:19:26.890099059Z",
"cutoff_time": "2024-10-27T07:19:26.890099059Z",
"num_deleted_objects": 33000
}
```

### Deleting marked objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Deleting marked objects
### Delete marked objects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

To delete the objects marked by the GC, you'll need to read the `deleted.parquet` or `deleted.json` files, and manually delete each address from AWS.

To delete the objects marked by the GC, you'll need to read the `deleted.csv` file, and manually delete each address from AWS.

Example bash command to move all the marked objects to a different bucket on S3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a note about playing it safe and moving the objects instead of deleting them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

```bash
# Change these to your correct values
storage_ns=<your storage namespace (s3://...)>
output_bucket=<your output bucket (s3://...)>
run_id=<GC run id>

# Download the CSV file
aws s3 cp "$storage_ns/_lakefs/retention/gc/reports/$run_id/deleted.csv" "./run_id-$run_id.csv"

# Move all addresses to the output bucket under the run_id prefix
cat run_id-$run_id.csv | tail -n +2 | head -n 10 | xargs -I {} aws s3 mv "$storage_ns/{}" "$output_bucket/run_id=$run_id/"
```
Loading