-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13866][SQL] Handle decimal type in CSV inference at CSV data source. #11724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There should be a conflict with #11550. I will resolve the conflict as soon as either this one or that one is merged. |
|
Test build #53187 has finished for PR 11724 at commit
|
|
Test build #53194 has finished for PR 11724 at commit
|
| case IntegerType => tryParseInteger(field) | ||
| case LongType => tryParseLong(field) | ||
| case DoubleType => tryParseDouble(field) | ||
| case _: DecimalType => tryParseDecimal(field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent:
case DecimalType => tryPraseDecimal(field)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the _ because DecimalType looks referencing the companion object. I tried that expression before but this emits the compilation error below.
Error:(89, 14) pattern type is incompatible with expected type;
found : org.apache.spark.sql.types.DecimalType.type
required: org.apache.spark.sql.types.DataType
Note: if you intended to match against the class, try `case DecimalType(_,_)`
case DecimalType => tryParseDecimal(field)
^
|
Test build #53256 has finished for PR 11724 at commit
|
|
Test build #53254 has finished for PR 11724 at commit
|
|
Test build #53257 has finished for PR 11724 at commit
|
|
Test build #53292 has finished for PR 11724 at commit
|
|
@falaki Could you take a look at this please? |
|
Test build #54295 has finished for PR 11724 at commit
|
|
@rxin I am willing to close this one if you are not sure of this one. |
|
Test build #57495 has finished for PR 11724 at commit
|
|
@HyukjinKwon unfortunately this is too confusing. Can you precisely describe the inference rule in the pr description, and create (unit - not end to end) test cases for the rules? |
|
@rxin Sure I will add a more explicit description and some more unit tests for this. Thanks. |
|
@rxin I added some more commits for unit tests in |
|
Test build #57627 has finished for PR 11724 at commit
|
|
Test build #57629 has finished for PR 11724 at commit
|
|
I actually worry that we are inferring things directly as decimals for floating point numbers, because a lot of formats and tools don't necessarily handle decimals well. It seems like the problem here is only for large ints. Is it possible to only use decimal if they are integers, and otherwise prefer floating point numbers? |
|
@rxin I see. Thank you. Let me fix this up and change the description as well with some rules for |
|
Test build #57698 has finished for PR 11724 at commit
|
|
Test build #57701 has finished for PR 11724 at commit
|
|
@rxin Do you mind if I ask a quick look again? |
|
cc @davies can you review this? |
|
LGTM |
|
Merging this into master and 2.0, thanks! |
…source. ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13866 This PR adds the support to infer `DecimalType`. Here are the rules between `IntegerType`, `LongType` and `DecimalType`. #### Infering Types 1. `IntegerType` and then `LongType`are tried first. ```scala Int.MaxValue => IntegerType Long.MaxValue => LongType ``` 2. If it fails, try `DecimalType`. ```scala (Long.MaxValue + 1) => DecimalType(20, 0) ``` This does not try to infer this as `DecimalType` when scale is less than 0. 3. if it fails, try `DoubleType` ```scala 0.1 => DoubleType // This is failed to be inferred as `DecimalType` because it has the scale, 1. ``` #### Compatible Types (Merging Types) For merging types, this is the same with JSON data source. If `DecimalType` is not capable, then it becomes `DoubleType` ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style test. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #11724 from HyukjinKwon/SPARK-13866. (cherry picked from commit 51841d7) Signed-off-by: Davies Liu <davies.liu@gmail.com>
What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13866
This PR adds the support to infer
DecimalType.Here are the rules between
IntegerType,LongTypeandDecimalType.Infering Types
IntegerTypeand thenLongTypeare tried first.If it fails, try
DecimalType.This does not try to infer this as
DecimalTypewhen scale is less than 0.if it fails, try
DoubleTypeCompatible Types (Merging Types)
For merging types, this is the same with JSON data source. If
DecimalTypeis not capable, then it becomesDoubleTypeHow was this patch tested?
Unit tests were used and
./dev/run_testsfor code style test.