Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling decimal type in dataset #842

Closed
1 task
mozinrat opened this issue Sep 6, 2016 · 1 comment
Closed
1 task

Handling decimal type in dataset #842

mozinrat opened this issue Sep 6, 2016 · 1 comment

Comments

@mozinrat
Copy link

mozinrat commented Sep 6, 2016

What kind an issue is this?

  • Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
    The easier it is to track down the bug, the faster it is solved. .

Issue description

Description

Need advice on handling decimal datatype, would prefer if it can be used as float or double numeric data type rather then string. If needed can loose precision. I need to rely on auto create new index and dynamic mapping. So can't create manual index mapping in first step.

Steps to reproduce

Simply try to read a decimal column from parquet file using spark and store in ES using es-hadoop

Code:

Dataset<Row> col = spark.sql("select money from parquetFile");
EsSparkSQL.saveToEs(col,"spark/docs");

Strack trace:

org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: Decimal types are not supported by Elasticsearch``` 

### Version Info

OS:         :  osx
JVM         :  JDK 1.8.092
Hadoop/Spark:  Spark 2.0.0
ES-Hadoop   :  elasticsearch-spark-20_2.11:5.0.0-alpha5
ES          :  2.3.5
@jbaiera
Copy link
Member

jbaiera commented Sep 7, 2016

Hello. We prefer that questions pertaining to troubleshooting or advice are asked on the forum instead of Github. Github issues are for confirmed bugs and actionable features only. Organization is key to success and we thank you for your understanding and cooperation.

To answer the question though, while we're here:

Decimal types are not supported by the connector because there simply is no way to serialize them into Elasticsearch without losing some precision. In practice these types are usually used to represent monetary values, and losses in precision are generally unacceptable in that case. Instead of blindly accepting the precision loss, or transforming it into a type you may not have expected, we throw an error to indicate that you must make a choice on how to proceed.

If precision is not important for that column/field, we advise that a transformation is applied to the column that casts it into either a string or a compatible numeric type. Transformations from DecimalType to other IntegralTypes in Spark are allowed if the decimal precision is lax enough that it won't lose anything. On the other hand, casts to StringTypes are always allowed. When a JSON string type is indexed into a double field in Elasticsearch, the precision will be lost at indexing time instead of throwing a casting error in Spark. To wit:

val data = Seq(
  Row("1", Decimal(1200.00)),
  Row("2", Decimal(1400.00))
)
val schema = StructType(Array(
  StructField("id", StringType),
  StructField("number", new DecimalType(10, 2))
))
val conf = Map("es.mapping.id" -> "id")
val rdd = sc.makeRDD(data)
val df = sqc.createDataFrame(rdd, schema)
df.select(df("id"), df("number").cast(StringType)).saveToEs("spark/decimalValues", conf)

As for specifying a mapping for auto-created indices: I would create an Elasticsearch index template using the template APIs before executing the Spark job. You specify an index name pattern when making a template. Any indices that are created that match this name pattern will have the template mappings applied to themselves automatically. You can specify that the field in question should be mapped as a double in the template, and when ES-Hadoop creates an index with a name that matches the pattern, Elasticsearch will automatically apply the mapping from the template without requiring any manual intervention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants