-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25660][SQL] Fix for the backward slash as CSV fields delimiter #22654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
dd16ca3
7bf453a
728aac2
983315b
1c2ac25
20856b4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1826,4 +1826,14 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te | |
| val df = spark.read.option("enforceSchema", false).csv(input) | ||
| checkAnswer(df, Row("1", "2")) | ||
| } | ||
|
|
||
| test("using the backward slash as the delimiter") { | ||
| val input = Seq("""abc\1""").toDS() | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't Or is the point that delimiting takes precedence?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, if an user specified
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a user specified
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prohibited single backslash and throw an exception with a tip of using double backslash. |
||
| val delimiter = """\\""" | ||
| checkAnswer(spark.read.option("delimiter", delimiter).csv(input), Row("abc", "1")) | ||
| checkAnswer(spark.read.option("inferSchema", true).option("delimiter", delimiter).csv(input), | ||
| Row("abc", 1)) | ||
| val schema = new StructType().add("a", StringType).add("b", IntegerType) | ||
| checkAnswer(spark.read.schema(schema).option("delimiter", delimiter).csv(input), Row("abc", 1)) | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing why we had to switch up the case statement like this. I get that we need to cover more cases, but there was duplication and now there is a bit more. What about ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer more declarative way and less nested levels of controls. but this is personal opinion. Let's look at your example:
you didn't check that str can be null.
If it has length 2, how
strcould be"""\u0000"""?You should produce control chars not just second char. For example:
\t-> Seq('', 't') -> '\t`.In my approach, everything is simple. One input case is mapped to one output. There is no unnecessary complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah good points. This is too clever, off the top of my head. I still wonder if the code here can reduce the duplication of
Seq('\\', c) => '\c'but I don't see a way that actually works, yeah.