-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support MinIO #4878
Support MinIO #4878
Conversation
hard as fixing, so this PR does the latter. Most of the code already works, of course. But GC creates its own S3 client. It is _not_ the same as the S3A client and these differences were significant: * S3 endpoint needs to be configured. Otherwise, the client accesses the wrong "S3". * For most MinIO installations the client needs to be configured with path style. Otherwise it will attempt host style access by default and time out in some cases such as when running in Docker. Note that this cannot be configured on Hadoop versions before 2.8.0 so users will need to use a suitable build to GC on Spark 2.4.7. A mitigating factor is that this is _anyway_ needed to use Spark with MinIO -- regardless of GC or indeed lakeFS! With these in place I was able to GC using the Spark 3.1.2 / Hadoop 3 build. Fixes #4795. I will split testing on MinIO in Esti off of that issue for separate work.
18285f7
to
f5f7654
Compare
I am investigating test failures on non-Hadoop3. This is strange, I would not expect AWS S3 to stop working as a result of unrelated changes :-/ |
bc82800
to
43f991b
Compare
@@ -29,9 +30,14 @@ object S3ClientBuilder extends io.treeverse.clients.S3ClientBuilder { | |||
) | |||
} else None | |||
|
|||
val builder = AmazonS3ClientBuilder | |||
.standard() | |||
.withPathStyleAccessEnabled(hc.getBoolean("fs.s3a.path.style.access", true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make "fs.s3a.path.style.access"
a constant too? (to keep things unified)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
new AwsClientBuilder.EndpointConfiguration(endpoint, region) | ||
) | ||
else | ||
builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
builder | |
builder.withRegion(region) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch! Good catch, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesomeeee
Consider adding the documentation change to this PR (change from hadoop-aws 2.7.7 to 2.8.0).
Also the comment from StorageUtils.scala
to fix a misconfigured client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I want to update documentation separately, under #4913.
@@ -29,9 +30,14 @@ object S3ClientBuilder extends io.treeverse.clients.S3ClientBuilder { | |||
) | |||
} else None | |||
|
|||
val builder = AmazonS3ClientBuilder | |||
.standard() | |||
.withPathStyleAccessEnabled(hc.getBoolean("fs.s3a.path.style.access", true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
new AwsClientBuilder.EndpointConfiguration(endpoint, region) | ||
) | ||
else | ||
builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch! Good catch, thanks!
#4795 was either to discover how to make MinIO work or to make it work. Since the only way to discover all the roadblocks is to fix each one, discovery is as hard as fixing. So this PR does the latter.
Most of the code already works, of course. But GC creates its own S3
client. It is not the same as the S3A client and these differences
were significant:
S3 endpoint needs to be configured. Otherwise, the client accesses
the wrong "S3".
For most MinIO installations the client needs to be configured with
path style. Otherwise it will attempt host style access by default
and time out in some cases such as when running in Docker.
Note that this cannot be configured on Hadoop versions before 2.8.0
so users will need to use a suitable build to GC on Spark 2.4.7. A
mitigating factor is that this is anyway needed to use Spark with
MinIO -- regardless of GC or indeed lakeFS!
With these in place I was able to GC using the Spark 3.1.2 / Hadoop 3
build.
Fixes #4795. I will split testing on MinIO in Esti off of that issue
for separate work.