Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
package org.apache.spark.sql.catalyst.json

import java.io.ByteArrayOutputStream
import java.util.Locale

import scala.collection.mutable.ArrayBuffer
import scala.util.Try
Expand Down Expand Up @@ -126,16 +125,11 @@ class JacksonParser(

case VALUE_STRING =>
// Special case handling for NaN and Infinity.
val value = parser.getText
val lowerCaseValue = value.toLowerCase(Locale.ROOT)
if (lowerCaseValue.equals("nan") ||
lowerCaseValue.equals("infinity") ||
lowerCaseValue.equals("-infinity") ||
lowerCaseValue.equals("inf") ||
lowerCaseValue.equals("-inf")) {
value.toFloat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a bug fix, because toFloat is case sensitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For input/output, it is not a bug. Please see #17956 (comment) and the PR description.

Copy link
Member

@gatorsmile gatorsmile May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we should follow it. How about our CSV parsers? How about Pandas? Are they case sensitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is JSON format itself related with CSV and Pandas? Please see the discussion in #9759 (comment).

Also, this does not change the existing behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, we want to be consistent with the others. Could you check Pandas?

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably let me check and open JIRA/PR if there are some cases we should handle when I have some time. Let's leave out that here. It sounds that does not block this PR and I don't want extra changes hold off this PR like #17901.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes the behavior, right? Does your test case pass without the code changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this adds some cases. I am not still convinced that we should take care of every other cases. Let's add options nanValue, positiveInf and negativeInf in a separate PR like our CSV datasource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this PR, +INF, INF and -INF" do not work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I added some cases here. I will open another PR/JIRA to add those options into JSON related functionalities.

} else {
throw new RuntimeException(s"Cannot parse $value as FloatType.")
parser.getText match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last question is, should we be case sensitive here? can we check other systems like pandas?

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to my knowledge, at least Python supports some case-insensitive cases. I would rather leave this working with options (discussed in #17956 (comment)) treating this as a variant. At least, I think we can say here it follows Scala + Jackson's standard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we don't really support case insensitive before. Although we check the lower case of input string, but we actually call toFloat on the original string. And "nan".toFloat will get java.lang.NumberFormatException: For input string: "nan".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me NaN and Infinity, INF are special and case insensitive are not really making sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, here, we are not trying to educate users/customers what are right or not. We are trying to improve the usability. If the other popular platforms accept some formats, we need to follow them even if they are strange to us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not add all the variants in the default behaviour. We could add the options for those cases. We all already have different ideas. Is it worth adding those cases right now in this PR? This is not the end of fixing this code path.

case "NaN" => Float.NaN
case "Infinity" => Float.PositiveInfinity
case "-Infinity" => Float.NegativeInfinity
case other => throw new RuntimeException(s"Cannot parse $other as FloatType.")
}
}

Expand All @@ -146,16 +140,11 @@ class JacksonParser(

case VALUE_STRING =>
// Special case handling for NaN and Infinity.
val value = parser.getText
val lowerCaseValue = value.toLowerCase(Locale.ROOT)
if (lowerCaseValue.equals("nan") ||
lowerCaseValue.equals("infinity") ||
lowerCaseValue.equals("-infinity") ||
lowerCaseValue.equals("inf") ||
lowerCaseValue.equals("-inf")) {
value.toDouble
} else {
throw new RuntimeException(s"Cannot parse $value as DoubleType.")
parser.getText match {
case "NaN" => Double.NaN
case "Infinity" => Double.PositiveInfinity
case "-Infinity" => Double.NegativeInfinity
case other => throw new RuntimeException(s"Cannot parse $other as DoubleType.")
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ package org.apache.spark.sql.execution.datasources.json
import java.io.{File, StringWriter}
import java.nio.charset.StandardCharsets
import java.sql.{Date, Timestamp}
import java.util.Locale

import com.fasterxml.jackson.core.JsonFactory
import org.apache.hadoop.fs.{Path, PathFilter}
Expand Down Expand Up @@ -1988,4 +1989,43 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
assert(errMsg.startsWith("The field for corrupt records must be string type and nullable"))
}
}

test("SPARK-18772: Parse special floats correctly") {
val jsons = Seq(
"""{"a": "NaN"}""",
"""{"a": "Infinity"}""",
"""{"a": "-Infinity"}""")

// positive cases
val checks: Seq[Double => Boolean] = Seq(
_.isNaN,
_.isPosInfinity,
_.isNegInfinity)

Seq(FloatType, DoubleType).foreach { dt =>
jsons.zip(checks).foreach { case (json, check) =>
val ds = spark.read
.schema(StructType(Seq(StructField("a", dt))))
.json(Seq(json).toDS())
.select($"a".cast(DoubleType)).as[Double]
assert(check(ds.first()))
}
}

// negative cases
Seq(FloatType, DoubleType).foreach { dt =>
val lowerCasedJsons = jsons.map(_.toLowerCase(Locale.ROOT))
// The special floats are case-sensitive so these cases below throw exceptions.
lowerCasedJsons.foreach { lowerCasedJson =>
val e = intercept[SparkException] {
spark.read
.option("mode", "FAILFAST")
.schema(StructType(Seq(StructField("a", dt))))
.json(Seq(lowerCasedJson).toDS())
.collect()
}
assert(e.getMessage.contains("Cannot parse"))
}
}
}
}