How to correctly set up SparkDataSet using S3 #1352

paulofbmarcon · 2022-03-16T21:42:21Z

paulofbmarcon
Mar 16, 2022

I tried the below configuration in catalog.yml but I couldn't login to AWS S3. The only way I could get to access s3 files was to set com.amazonaws.auth.profile.ProfileCredentialsProvider in spark.yml. Is this the expected behavior?
It looks like SparkDataSet should be reading my credentials configuration file.

some_database:
  type: spark.SparkDataSet
  filepath: s3a://some-bucket/somepath/hugefile.csv
  credentials: dev_s3
  file_format: csv
  load_args:
    delimiter: ";"
    header: "True"
    inferSchema: True
    overwrite: "True"

with the credentials.yml as follow:

dev_s3:
  key: "mykey"
  secret: "mysecrect"

datajoely · 2022-03-17T10:11:26Z

datajoely
Mar 17, 2022
Collaborator

So when it comes to spark we've found managing cloud credentials (AWS and often Azure from experience) we have found that this approach is typically easier to get up and running. IAM roles make this even easier, but that's not always possible.

The solution you've ended up is good - the only thing we could recommend is that you do the credentials stuff in conf/local/spark.yaml. The keys will get merged correctly at runtime, but the credentials won't be committed to Git/VCS.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to correctly set up SparkDataSet using S3 #1352

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to correctly set up SparkDataSet using S3 #1352

paulofbmarcon Mar 16, 2022

Replies: 1 comment

datajoely Mar 17, 2022 Collaborator

paulofbmarcon
Mar 16, 2022

datajoely
Mar 17, 2022
Collaborator