Is it still required to set S3SingleDriverLogStore when use delta with S3? #324

DdMad · 2020-02-18T09:37:30Z

According to the doc, it is required to set spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore.

However, when I try read and write on S3 without this config, it also succeeds. So I wonder if setting S3 log store config is no longer required? Or the read or write on S3 will have some hidden issues if I don't set that log store config?

PS: I'm using delta 0.5.0 (delta-core_2.11-0.5.0.jar)

Thanks.

The text was updated successfully, but these errors were encountered:

mukulmurthy · 2020-02-19T00:56:56Z

See https://github.com/delta-io/delta#requirements-for-underlying-storage-systems - S3 is eventually consistent, so even though Delta appears to work in most cases you'll get subtle race conditions that probably manifest as silently dropping data.

zbstof · 2021-04-26T12:31:58Z

It seems that s3 is not eventually-consistent anymore
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
So do we still need to set spark.delta.logStore.class?

tdas · 2021-04-26T14:22:01Z

Yes, we still need to. Because s3's read after write consistency only ensures that a writer will always be able to list and read all the files already written, but it does not guarantee if multiple concurrent attempts to write a file will only allow one of the writers to win. We need that to ensure mutual exclusion, that is, only one concurrent attempt to write a file in the Delta Log directory must win, the other attempts must fail.

emanuelh-cloud · 2021-05-19T16:16:52Z

If only one spark job is writing to the delta table, this mean no concurrent writes to the same delta table _delta_log directory, correct?
Do I still need to define "spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore" ?

chinmaychandak · 2021-05-25T16:38:09Z

Hi guys,

I am trying to write to S3 in the Delta format, but am getting this error in PySpark: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3".

This is my pySpark snippet:

    spark = SparkSession.builder \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.sql.execution.arrow.pyspark.fallback.enabled", "true") \
    .config("spark.sql.execution.arrow.maxRecordsPerBatch", 10) \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0,org.apache.hadoop:hadoop-aws:3.2.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") \
    .getOrCreate()
   
   spark_df = spark.read.format("csv").load(some_CSV.csv, header="True")

   spark_df.write.format('delta').option("mergeSchema", "true").mode("append") \
                .save('s3://demo/delta/test')

I am new to PySpark, can you please help me out? I am able to write to S3 outside of PySpark, since I'm running on an EC2 which has an IAM role configured, so I did not add AccessKeys to the Hadoop config.

zsxwing · 2021-05-25T16:49:00Z

@chinmaychandak could you try s3a://demo/delta/test instead? I remember hadoop-aws doesn't set the file system for s3 by default. s3a is the one that it supports by default.

chinmaychandak · 2021-05-25T17:18:09Z

@zsxwing, thank you so much for responding! Really appreciate the help.

That worked like a charm, although I now had to explicitly specify IAM keys in the Spark config otherwise I get Access Denied. Any workaround for not having to specify the keys?

zsxwing · 2021-05-26T05:11:36Z

@chinmaychandak what's the error you hit? Is it in driver or executor? hadoop-aws should support IAM Role. By the way, it's better to use the slack channel or mailing list to ask questions ( See https://github.com/delta-io/delta#community ) It's hard to track the messages in a closed ticket.

chinmaychandak · 2021-05-26T23:39:28Z

My bad, I had the incorrect IAM policy, it seems to work now.

I'll definitely keep the Slack channel in mind the next time I have a question, thanks for pointing me to the community resources, @zsxwing!

tdas closed this as completed Mar 2, 2020

arnisd mentioned this issue Sep 3, 2020

delta log FileAlreadyExistsException on S3 delta table due to last_checkpoint #509

Closed

gourav-sg mentioned this issue May 29, 2021

[Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) #41

Closed

9 tasks

GunSik2 mentioned this issue Oct 20, 2021

delta + cache + minio GunSik2/k8s-ai#35

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it still required to set S3SingleDriverLogStore when use delta with S3? #324

Is it still required to set S3SingleDriverLogStore when use delta with S3? #324

DdMad commented Feb 18, 2020

mukulmurthy commented Feb 19, 2020

zbstof commented Apr 26, 2021

tdas commented Apr 26, 2021

emanuelh-cloud commented May 19, 2021

chinmaychandak commented May 25, 2021 •

edited

Loading

zsxwing commented May 25, 2021

chinmaychandak commented May 25, 2021 •

edited

Loading

zsxwing commented May 26, 2021

chinmaychandak commented May 26, 2021

Is it still required to set S3SingleDriverLogStore when use delta with S3? #324

Is it still required to set S3SingleDriverLogStore when use delta with S3? #324

Comments

DdMad commented Feb 18, 2020

mukulmurthy commented Feb 19, 2020

zbstof commented Apr 26, 2021

tdas commented Apr 26, 2021

emanuelh-cloud commented May 19, 2021

chinmaychandak commented May 25, 2021 • edited Loading

zsxwing commented May 25, 2021

chinmaychandak commented May 25, 2021 • edited Loading

zsxwing commented May 26, 2021

chinmaychandak commented May 26, 2021

chinmaychandak commented May 25, 2021 •

edited

Loading

chinmaychandak commented May 25, 2021 •

edited

Loading