Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 15, 2016

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13866

This PR adds the support to infer DecimalType.
Here are the rules between IntegerType, LongType and DecimalType.

Infering Types

  1. IntegerType and then LongTypeare tried first.

    Int.MaxValue => IntegerType
    Long.MaxValue => LongType
  2. If it fails, try DecimalType.

    (Long.MaxValue + 1) => DecimalType(20, 0)

    This does not try to infer this as DecimalType when scale is less than 0.

  3. if it fails, try DoubleType

    0.1 => DoubleType // This is failed to be inferred as `DecimalType` because it has the scale, 1.

Compatible Types (Merging Types)

For merging types, this is the same with JSON data source. If DecimalType is not capable, then it becomes DoubleType

How was this patch tested?

Unit tests were used and ./dev/run_tests for code style test.

@HyukjinKwon
Copy link
Member Author

cc @rxin @falaki

@HyukjinKwon
Copy link
Member Author

There should be a conflict with #11550.

I will resolve the conflict as soon as either this one or that one is merged.

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53187 has finished for PR 11724 at commit ed1d499.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53194 has finished for PR 11724 at commit c144797.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case IntegerType => tryParseInteger(field)
case LongType => tryParseLong(field)
case DoubleType => tryParseDouble(field)
case _: DecimalType => tryParseDecimal(field)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent:

case DecimalType => tryPraseDecimal(field)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the _ because DecimalType looks referencing the companion object. I tried that expression before but this emits the compilation error below.

Error:(89, 14) pattern type is incompatible with expected type;
 found   : org.apache.spark.sql.types.DecimalType.type
 required: org.apache.spark.sql.types.DataType
Note: if you intended to match against the class, try `case DecimalType(_,_)`
        case DecimalType => tryParseDecimal(field)
             ^

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53256 has finished for PR 11724 at commit 978da28.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53254 has finished for PR 11724 at commit 439de08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53257 has finished for PR 11724 at commit 6ad04cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53292 has finished for PR 11724 at commit fa42c1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@falaki Could you take a look at this please?

@SparkQA
Copy link

SparkQA commented Mar 28, 2016

Test build #54295 has finished for PR 11724 at commit 6de6e63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@rxin I am willing to close this one if you are not sure of this one.

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57495 has finished for PR 11724 at commit bdaac7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 3, 2016

@HyukjinKwon unfortunately this is too confusing. Can you precisely describe the inference rule in the pr description, and create (unit - not end to end) test cases for the rules?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 3, 2016

@rxin Sure I will add a more explicit description and some more unit tests for this. Thanks.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 3, 2016

@rxin I added some more commits for unit tests in CSVInferSchemaSuite. It's okay to hear that it is still confusing or looks not covered enough if it is. I can improve them more.

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57627 has finished for PR 11724 at commit 0fdd796.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57629 has finished for PR 11724 at commit adb8747.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 3, 2016

I actually worry that we are inferring things directly as decimals for floating point numbers, because a lot of formats and tools don't necessarily handle decimals well.

It seems like the problem here is only for large ints. Is it possible to only use decimal if they are integers, and otherwise prefer floating point numbers?

@HyukjinKwon
Copy link
Member Author

@rxin I see. Thank you. Let me fix this up and change the description as well with some rules for LongType, DoubleType and DecimalType.

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57698 has finished for PR 11724 at commit a2bf0c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57701 has finished for PR 11724 at commit 9593ae3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@rxin Do you mind if I ask a quick look again?

@rxin
Copy link
Contributor

rxin commented May 13, 2016

cc @davies can you review this?

@davies
Copy link
Contributor

davies commented May 13, 2016

LGTM

@davies
Copy link
Contributor

davies commented May 13, 2016

Merging this into master and 2.0, thanks!

asfgit pushed a commit that referenced this pull request May 13, 2016
…source.

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13866

This PR adds the support to infer `DecimalType`.
Here are the rules between `IntegerType`, `LongType` and `DecimalType`.

#### Infering Types

1. `IntegerType` and then `LongType`are tried first.

  ```scala
  Int.MaxValue => IntegerType
  Long.MaxValue => LongType
  ```

2. If it fails, try `DecimalType`.

  ```scala
  (Long.MaxValue + 1) => DecimalType(20, 0)
  ```
  This does not try to infer this as `DecimalType` when scale is less than 0.

3. if it fails, try `DoubleType`
  ```scala
  0.1 => DoubleType // This is failed to be inferred as `DecimalType` because it has the scale, 1.
  ```

#### Compatible Types (Merging Types)

For merging types, this is the same with JSON data source. If `DecimalType` is not capable, then it becomes `DoubleType`

## How was this patch tested?

Unit tests were used and `./dev/run_tests` for code style test.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #11724 from HyukjinKwon/SPARK-13866.

(cherry picked from commit 51841d7)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
@asfgit asfgit closed this in 51841d7 May 13, 2016
@HyukjinKwon HyukjinKwon deleted the SPARK-13866 branch January 2, 2018 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants