Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Jul 18, 2019

What changes were proposed in this pull request?

Fix CSV datasource to throw com.univocity.parsers.common.TextParsingException with large size message, which will make log output consume large disk space.
This issue is troublesome when sometimes we need parse CSV with large size column.

This PR proposes to set CSV parser/writer settings by setErrorContentLength(1000) to limit the error message length.

How was this patch tested?

Manually.

val s = "a" * 40 * 1000000
Seq(s).toDF.write.mode("overwrite").csv("/tmp/bogdan/es4196.csv")

spark.read .option("maxCharsPerColumn", 30000000) .csv("/tmp/bogdan/es4196.csv").count

Before:
The thrown message will include error content of about 30MB size (The column size exceed the max value 30MB, so the error content include the whole parsed content, so it is 30MB).

After:
The thrown message will include error content like "...aaa...aa" (the number of 'a' is 1024), i.e. limit the content size to be 1024.

@SparkQA

This comment has been minimized.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28431]Fix CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message [SPARK-28431][SQL] Fix CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message Jul 18, 2019
@WeichenXu123 WeichenXu123 force-pushed the limit_csv_exception_size branch from 35f8b32 to c8f1a52 Compare July 19, 2019 01:57
@HyukjinKwon

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@gengliangwang

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine otherwise

@HyukjinKwon HyukjinKwon changed the title [SPARK-28431][SQL] Fix CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message [SPARK-28431][SQL] Set meximum error message length in CSV datasource Jul 21, 2019
@HyukjinKwon HyukjinKwon changed the title [SPARK-28431][SQL] Set meximum error message length in CSV datasource [SPARK-28431][SQL] Set meximum error message length in CSV datasource's parsing and writing Jul 21, 2019
@SparkQA

This comment has been minimized.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, fix PR's title meximum -> maximum

@HyukjinKwon HyukjinKwon changed the title [SPARK-28431][SQL] Set meximum error message length in CSV datasource's parsing and writing [SPARK-28431][SQL] Set maximum error message length in CSV datasource's parsing and writing Jul 21, 2019
@WeichenXu123 WeichenXu123 force-pushed the limit_csv_exception_size branch from 634cc64 to f8c3f7f Compare July 22, 2019 04:08
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA

This comment has been minimized.

@WeichenXu123
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 22, 2019

Test build #107992 has finished for PR 25184 at commit 9e8fcca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2019

Test build #108002 has finished for PR 25184 at commit 9e8fcca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants