Handling decimal type in dataset #842

mozinrat · 2016-09-06T07:39:58Z

What kind an issue is this?

Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
The easier it is to track down the bug, the faster it is solved. .

Issue description

Description

Need advice on handling decimal datatype, would prefer if it can be used as float or double numeric data type rather then string. If needed can loose precision. I need to rely on auto create new index and dynamic mapping. So can't create manual index mapping in first step.

Steps to reproduce

Simply try to read a decimal column from parquet file using spark and store in ES using es-hadoop

Code:

Dataset<Row> col = spark.sql("select money from parquetFile");
EsSparkSQL.saveToEs(col,"spark/docs");

Strack trace:

org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: Decimal types are not supported by Elasticsearch``` 

### Version Info

OS:         :  osx
JVM         :  JDK 1.8.092
Hadoop/Spark:  Spark 2.0.0
ES-Hadoop   :  elasticsearch-spark-20_2.11:5.0.0-alpha5
ES          :  2.3.5

The text was updated successfully, but these errors were encountered:

jbaiera · 2016-09-07T19:15:16Z

Hello. We prefer that questions pertaining to troubleshooting or advice are asked on the forum instead of Github. Github issues are for confirmed bugs and actionable features only. Organization is key to success and we thank you for your understanding and cooperation.

To answer the question though, while we're here:

Decimal types are not supported by the connector because there simply is no way to serialize them into Elasticsearch without losing some precision. In practice these types are usually used to represent monetary values, and losses in precision are generally unacceptable in that case. Instead of blindly accepting the precision loss, or transforming it into a type you may not have expected, we throw an error to indicate that you must make a choice on how to proceed.

If precision is not important for that column/field, we advise that a transformation is applied to the column that casts it into either a string or a compatible numeric type. Transformations from DecimalType to other IntegralTypes in Spark are allowed if the decimal precision is lax enough that it won't lose anything. On the other hand, casts to StringTypes are always allowed. When a JSON string type is indexed into a double field in Elasticsearch, the precision will be lost at indexing time instead of throwing a casting error in Spark. To wit:

val data = Seq(
  Row("1", Decimal(1200.00)),
  Row("2", Decimal(1400.00))
)
val schema = StructType(Array(
  StructField("id", StringType),
  StructField("number", new DecimalType(10, 2))
))
val conf = Map("es.mapping.id" -> "id")
val rdd = sc.makeRDD(data)
val df = sqc.createDataFrame(rdd, schema)
df.select(df("id"), df("number").cast(StringType)).saveToEs("spark/decimalValues", conf)

As for specifying a mapping for auto-created indices: I would create an Elasticsearch index template using the template APIs before executing the Spark job. You specify an index name pattern when making a template. Any indices that are created that match this name pattern will have the template mappings applied to themselves automatically. You can specify that the field in question should be mapped as a double in the template, and when ES-Hadoop creates an index with a name that matches the pattern, Elasticsearch will automatically apply the mapping from the template without requiring any manual intervention.

jbaiera closed this as completed Sep 7, 2016

jbaiera added question :Spark v5.0.0-beta1 v2.4.1 labels Sep 7, 2016

apatrida mentioned this issue May 18, 2017

oracle number fields and EsHadoopSerializationException: Decimal types are not supported by Elasticsearch kohesive/elasticsearch-data-import-handler#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling decimal type in dataset #842

Handling decimal type in dataset #842

mozinrat commented Sep 6, 2016 •

edited

Loading

jbaiera commented Sep 7, 2016

Handling decimal type in dataset #842

Handling decimal type in dataset #842

Comments

mozinrat commented Sep 6, 2016 • edited Loading

What kind an issue is this?

Issue description

Steps to reproduce

jbaiera commented Sep 7, 2016

mozinrat commented Sep 6, 2016 •

edited

Loading