Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage Collector: Eliminate double slash in URL #3525

Merged
merged 3 commits into from
Jun 21, 2022

Conversation

Jonathan-Rosenberg
Copy link
Contributor

@Jonathan-Rosenberg Jonathan-Rosenberg commented Jun 19, 2022

Both the lakeFS server and the Garbage Collector client write to the underlying storage during the garbage collection process.
At times, and in both parts, it generates and writes to paths with double slashes.
For example: s3://some-bucket/some-path//_lakefs/logs/gc/summary.
This PR aims to fix this problem.

Problems:

  1. The lakeFS server sends the suffix of the different paths with / as a delimiter as the first character to the formatPathWithNamespace function that expects it to not include the delimiter as a prefix.
  2. The GC code attached a / prefix to the _lakefs path. That resulted in a double slash if the storage namespace ended with a slash.

Mitigation:

  1. lakeFS: Pass the appropriate suffixes to the formatPathWithNamespace function.
  2. GarbageCollector: Make sure that the storage namespace ends with a slash and build the location properly.

How was this tested?

Ran the GC process with a previously double slashed generated path and made sure it changed its behavior to a single slash.


Clarification
Although the code itself generates a double slash path, the final entry, in S3, has no double slash in it.

Closes #2732

@Jonathan-Rosenberg Jonathan-Rosenberg added include-changelog PR description should be included in next release changelog team/ecosystem Team Ecosystem labels Jun 19, 2022
@Jonathan-Rosenberg Jonathan-Rosenberg changed the title Garbage Collector: Eliminate double slash in URL in GC Garbage Collector: Eliminate double slash in URL Jun 19, 2022
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like us to make the upgrade process explicit here. This is the requested change.

Also what do we want to do about GCS and Azure? We could check what happens and fix if needed before releasing GC for these two backends.


spark.close()
}

private def concatToGCLogsPrefix(storageNameSpace: String, key: String): String = {
val strippedKey = key.stripPrefix("/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why do we need this? No call has a leading slash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is not aware of the calls, it sanitizes the input...

Copy link
Contributor

@talSofer talSofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this bug! this is a must-fix for Azure on GC so timing is great.

How was this tested?

Comment on lines +22 to +24
configFileSuffixTemplate = "%s/retention/gc/rules/config.json"
addressesFilePrefixTemplate = "%s/retention/gc/addresses/"
commitsFileSuffixTemplate = "%s/retention/gc/commits/run_id=%s/commits.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part fixes another bug we found in the GC-Azure integration! double thanks @Jonathan-Rosenberg :)
Up until this fix, while trying to get gc rules we got the following error:

: invalid namespace
500 Internal Server Error

But removing the superfluous "/" fixed it for the following reason -
Before the fix, "/_lakefs/settings/retention/gc/rules/config.json" was parsed without returning an error to

parsedKey, err := url.ParseRequestURI(key)
, leading to an incorrect URL resolution trying to resolve it as a fully qualified URI (that includes a namespace). After the fix, "_lakefs/settings/retention/gc/rules/config.json" is resolved as a non-fully qualified URI.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Jonathan-Rosenberg Jonathan-Rosenberg merged commit b4e9ac4 into master Jun 21, 2022
@Jonathan-Rosenberg Jonathan-Rosenberg deleted the feature/eliminate-double-slash-in-gc branch June 21, 2022 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog team/ecosystem Team Ecosystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Garbage collection uses superfluous '/' at beginning of storage identifier
3 participants