-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it still required to set S3SingleDriverLogStore when use delta with S3? #324
Comments
See https://github.com/delta-io/delta#requirements-for-underlying-storage-systems - S3 is eventually consistent, so even though Delta appears to work in most cases you'll get subtle race conditions that probably manifest as silently dropping data. |
It seems that s3 is not eventually-consistent anymore |
Yes, we still need to. Because s3's read after write consistency only ensures that a writer will always be able to list and read all the files already written, but it does not guarantee if multiple concurrent attempts to write a file will only allow one of the writers to win. We need that to ensure mutual exclusion, that is, only one concurrent attempt to write a file in the Delta Log directory must win, the other attempts must fail. |
If only one spark job is writing to the delta table, this mean no concurrent writes to the same delta table _delta_log directory, correct? |
Hi guys, I am trying to write to S3 in the Delta format, but am getting this error in PySpark: This is my pySpark snippet:
I am new to PySpark, can you please help me out? I am able to write to S3 outside of PySpark, since I'm running on an EC2 which has an IAM role configured, so I did not add AccessKeys to the Hadoop config. |
@chinmaychandak could you try |
@zsxwing, thank you so much for responding! Really appreciate the help. That worked like a charm, although I now had to explicitly specify IAM keys in the Spark config otherwise I get |
@chinmaychandak what's the error you hit? Is it in driver or executor? |
My bad, I had the incorrect IAM policy, it seems to work now. I'll definitely keep the Slack channel in mind the next time I have a question, thanks for pointing me to the community resources, @zsxwing! |
According to the doc, it is required to set
spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
.However, when I try read and write on S3 without this config, it also succeeds. So I wonder if setting S3 log store config is no longer required? Or the read or write on S3 will have some hidden issues if I don't set that log store config?
PS: I'm using delta 0.5.0 (delta-core_2.11-0.5.0.jar)
Thanks.
The text was updated successfully, but these errors were encountered: