Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it still required to set S3SingleDriverLogStore when use delta with S3? #324

Closed
DdMad opened this issue Feb 18, 2020 · 9 comments
Closed

Comments

@DdMad
Copy link

DdMad commented Feb 18, 2020

According to the doc, it is required to set spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore.

However, when I try read and write on S3 without this config, it also succeeds. So I wonder if setting S3 log store config is no longer required? Or the read or write on S3 will have some hidden issues if I don't set that log store config?

PS: I'm using delta 0.5.0 (delta-core_2.11-0.5.0.jar)

Thanks.

@mukulmurthy
Copy link
Collaborator

See https://github.com/delta-io/delta#requirements-for-underlying-storage-systems - S3 is eventually consistent, so even though Delta appears to work in most cases you'll get subtle race conditions that probably manifest as silently dropping data.

@zbstof
Copy link

zbstof commented Apr 26, 2021

It seems that s3 is not eventually-consistent anymore
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
So do we still need to set spark.delta.logStore.class?

@tdas
Copy link
Contributor

tdas commented Apr 26, 2021

Yes, we still need to. Because s3's read after write consistency only ensures that a writer will always be able to list and read all the files already written, but it does not guarantee if multiple concurrent attempts to write a file will only allow one of the writers to win. We need that to ensure mutual exclusion, that is, only one concurrent attempt to write a file in the Delta Log directory must win, the other attempts must fail.

@emanuelh-cloud
Copy link

If only one spark job is writing to the delta table, this mean no concurrent writes to the same delta table _delta_log directory, correct?
Do I still need to define "spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore" ?

@chinmaychandak
Copy link

chinmaychandak commented May 25, 2021

Hi guys,

I am trying to write to S3 in the Delta format, but am getting this error in PySpark: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3".

This is my pySpark snippet:

    spark = SparkSession.builder \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.sql.execution.arrow.pyspark.fallback.enabled", "true") \
    .config("spark.sql.execution.arrow.maxRecordsPerBatch", 10) \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0,org.apache.hadoop:hadoop-aws:3.2.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") \
    .getOrCreate()
   
   spark_df = spark.read.format("csv").load(some_CSV.csv, header="True")

   spark_df.write.format('delta').option("mergeSchema", "true").mode("append") \
                .save('s3://demo/delta/test')

I am new to PySpark, can you please help me out? I am able to write to S3 outside of PySpark, since I'm running on an EC2 which has an IAM role configured, so I did not add AccessKeys to the Hadoop config.

@zsxwing
Copy link
Member

zsxwing commented May 25, 2021

@chinmaychandak could you try s3a://demo/delta/test instead? I remember hadoop-aws doesn't set the file system for s3 by default. s3a is the one that it supports by default.

@chinmaychandak
Copy link

chinmaychandak commented May 25, 2021

@zsxwing, thank you so much for responding! Really appreciate the help.

That worked like a charm, although I now had to explicitly specify IAM keys in the Spark config otherwise I get Access Denied. Any workaround for not having to specify the keys?

@zsxwing
Copy link
Member

zsxwing commented May 26, 2021

@chinmaychandak what's the error you hit? Is it in driver or executor? hadoop-aws should support IAM Role. By the way, it's better to use the slack channel or mailing list to ask questions ( See https://github.com/delta-io/delta#community ) It's hard to track the messages in a closed ticket.

@chinmaychandak
Copy link

My bad, I had the incorrect IAM policy, it seems to work now.

I'll definitely keep the Slack channel in mind the next time I have a question, thanks for pointing me to the community resources, @zsxwing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants