-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage Collector: Eliminate double slash in URL #3525
Garbage Collector: Eliminate double slash in URL #3525
Conversation
…ce end with a slash
…uffix as it was erroneous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like us to make the upgrade process explicit here. This is the requested change.
Also what do we want to do about GCS and Azure? We could check what happens and fix if needed before releasing GC for these two backends.
clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala
Show resolved
Hide resolved
|
||
spark.close() | ||
} | ||
|
||
private def concatToGCLogsPrefix(storageNameSpace: String, key: String): String = { | ||
val strippedKey = key.stripPrefix("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Why do we need this? No call has a leading slash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is not aware of the calls, it sanitizes the input...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this bug! this is a must-fix for Azure on GC so timing is great.
How was this tested?
clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala
Show resolved
Hide resolved
configFileSuffixTemplate = "%s/retention/gc/rules/config.json" | ||
addressesFilePrefixTemplate = "%s/retention/gc/addresses/" | ||
commitsFileSuffixTemplate = "%s/retention/gc/commits/run_id=%s/commits.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part fixes another bug we found in the GC-Azure integration! double thanks @Jonathan-Rosenberg :)
Up until this fix, while trying to get gc rules we got the following error:
: invalid namespace
500 Internal Server Error
But removing the superfluous "/" fixed it for the following reason -
Before the fix, "/_lakefs/settings/retention/gc/rules/config.json" was parsed without returning an error to
lakeFS/pkg/block/azure/adapter.go
Line 91 in a21c0fc
parsedKey, err := url.ParseRequestURI(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Both the lakeFS server and the Garbage Collector client write to the underlying storage during the garbage collection process.
At times, and in both parts, it generates and writes to paths with double slashes.
For example:
s3://some-bucket/some-path//_lakefs/logs/gc/summary
.This PR aims to fix this problem.
Problems:
/
as a delimiter as the first character to theformatPathWithNamespace
function that expects it to not include the delimiter as a prefix./
prefix to the_lakefs
path. That resulted in a double slash if the storage namespace ended with a slash.Mitigation:
formatPathWithNamespace
function.How was this tested?
Ran the GC process with a previously double slashed generated path and made sure it changed its behavior to a single slash.
Clarification
Although the code itself generates a double slash path, the final entry, in S3, has no double slash in it.
Closes #2732