Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MinIO #4878

Merged
merged 7 commits into from
Jan 1, 2023
Merged

Support MinIO #4878

merged 7 commits into from
Jan 1, 2023

Conversation

arielshaqed
Copy link
Contributor

@arielshaqed arielshaqed commented Dec 27, 2022

#4795 was either to discover how to make MinIO work or to make it work. Since the only way to discover all the roadblocks is to fix each one, discovery is as hard as fixing. So this PR does the latter.

Most of the code already works, of course. But GC creates its own S3
client. It is not the same as the S3A client and these differences
were significant:

  • S3 endpoint needs to be configured. Otherwise, the client accesses
    the wrong "S3".

  • For most MinIO installations the client needs to be configured with
    path style. Otherwise it will attempt host style access by default
    and time out in some cases such as when running in Docker.

    Note that this cannot be configured on Hadoop versions before 2.8.0
    so users will need to use a suitable build to GC on Spark 2.4.7. A
    mitigating factor is that this is anyway needed to use Spark with
    MinIO -- regardless of GC or indeed lakeFS!

With these in place I was able to GC using the Spark 3.1.2 / Hadoop 3
build.

Fixes #4795. I will split testing on MinIO in Esti off of that issue
for separate work.

hard as fixing, so this PR does the latter.

Most of the code already works, of course.  But GC creates its own S3
client.  It is _not_ the same as the S3A client and these differences
were significant:

* S3 endpoint needs to be configured.  Otherwise, the client accesses
  the wrong "S3".
* For most MinIO installations the client needs to be configured with
  path style.  Otherwise it will attempt host style access by default
  and time out in some cases such as when running in Docker.

  Note that this cannot be configured on Hadoop versions before 2.8.0
  so users will need to use a suitable build to GC on Spark 2.4.7.  A
  mitigating factor is that this is _anyway_ needed to use Spark with
  MinIO -- regardless of GC or indeed lakeFS!

With these in place I was able to GC using the Spark 3.1.2 / Hadoop 3
build.

Fixes #4795.  I will split testing on MinIO in Esti off of that issue
for separate work.
@arielshaqed arielshaqed added area/client/spark area/integrations bug Something isn't working team/ecosystem Team Ecosystem include-changelog PR description should be included in next release changelog labels Dec 27, 2022
@arielshaqed arielshaqed force-pushed the chore/4795-test-committed-gc-on-minio branch from 18285f7 to f5f7654 Compare December 27, 2022 16:05
@arielshaqed
Copy link
Contributor Author

I am investigating test failures on non-Hadoop3. This is strange, I would not expect AWS S3 to stop working as a result of unrelated changes :-/

@arielshaqed arielshaqed force-pushed the chore/4795-test-committed-gc-on-minio branch from bc82800 to 43f991b Compare December 28, 2022 08:56
@@ -29,9 +30,14 @@ object S3ClientBuilder extends io.treeverse.clients.S3ClientBuilder {
)
} else None

val builder = AmazonS3ClientBuilder
.standard()
.withPathStyleAccessEnabled(hc.getBoolean("fs.s3a.path.style.access", true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make "fs.s3a.path.style.access" a constant too? (to keep things unified)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

new AwsClientBuilder.EndpointConfiguration(endpoint, region)
)
else
builder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
builder
builder.withRegion(region)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch! Good catch, thanks!

Copy link
Contributor

@Jonathan-Rosenberg Jonathan-Rosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesomeeee
Consider adding the documentation change to this PR (change from hadoop-aws 2.7.7 to 2.8.0).
Also the comment from StorageUtils.scala to fix a misconfigured client.

Copy link
Contributor Author

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I want to update documentation separately, under #4913.

@@ -29,9 +30,14 @@ object S3ClientBuilder extends io.treeverse.clients.S3ClientBuilder {
)
} else None

val builder = AmazonS3ClientBuilder
.standard()
.withPathStyleAccessEnabled(hc.getBoolean("fs.s3a.path.style.access", true))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

new AwsClientBuilder.EndpointConfiguration(endpoint, region)
)
else
builder
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch! Good catch, thanks!

@arielshaqed arielshaqed merged commit dda7b29 into master Jan 1, 2023
@arielshaqed arielshaqed deleted the chore/4795-test-committed-gc-on-minio branch January 1, 2023 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/client/spark area/integrations bug Something isn't working include-changelog PR description should be included in next release changelog team/ecosystem Team Ecosystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discovery: GC committed for MinIO
2 participants