Skip to content

Conversation

@mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

SPARK-22893 tried to unify error messages about dataTypes. Unfortunately, still many places were missing the simpleString method in other to have the same representation everywhere.

The PR unified the messages using alway the simpleString representation of the dataTypes in the messages.

How was this patch tested?

existing/modified UTs

@maropu
Copy link
Member

maropu commented May 14, 2018

jsonExpressions.scala also has the same issue?

s"A type of keys and values in map() must be string, but got ${m.dataType}")

Then, you need to update the output results in SQLQueryTestSuite.

@SparkQA
Copy link

SparkQA commented May 14, 2018

Test build #90586 has finished for PR 21321 at commit ada7667.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

thanks @maropu I missed that one. I'll update it shortly, thanks.

@SparkQA
Copy link

SparkQA commented May 15, 2018

Test build #90638 has finished for PR 21321 at commit e09026b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

cc @gatorsmile

@mgaido91
Copy link
Contributor Author

kindly ping @gatorsmile

1 similar comment
@mgaido91
Copy link
Contributor Author

kindly ping @gatorsmile

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a quick question for you from skimming this PR.

dataType.isInstanceOf[StringType] ||
dataType.isInstanceOf[BooleanType],
s"FeatureHasher requires columns to be of NumericType, BooleanType or StringType. " +
s"FeatureHasher requires columns to be of ${NumericType.simpleString}, " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original PR it doesn't seem like we rewrote the constant types only the dynamic ones (and this PR also doesn't seem to consistently rewrite the constant types referenced). What's the reason/how do you decide which ones you want to rewrite to ref simpleString?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR rewrites always constant type referenced. I am not sure why you are saying it is not. If I missed some places, then it was just because I haven't seen them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's my bad, looking through it I saw some raw type names but those are all in the suites which makes sense.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about if we want to change to using SchemaUtils rather than just changing the error messages.

Some places which might have been missed in this PR under ml/ which I did a quick check with grep -r "Type" ./mllib/src/main/scala/org/apache/spark/ml |grep -i require (my bash regex might not be great but you get the idea):

./mllib/src/main/scala/org/apache/spark/ml/feature/DCT.scala: require(inputType.isInstanceOf[VectorUDT], s"Input type must be VectorUDT but got $inputType.")

./mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala: require(inputType == StringType, s"Input type must be string type but got $inputType.")

StringIndexer.scala

If you have a chance to double check those that would be great!

fields.foreach { fieldSchema =>
val dataType = fieldSchema.dataType
val fieldName = fieldSchema.name
require(dataType.isInstanceOf[NumericType] ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to use SchemaUtils checkColumnTypes here -- what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't do that because NumericType is an AbstractDataType and not a DataType so I couldn't use that method.

@mgaido91
Copy link
Contributor Author

Thanks for your kind and nice review @holdenk, I addressed the places I missed. Thank you!

@SparkQA
Copy link

SparkQA commented Jun 29, 2018

Test build #92470 has finished for PR 21321 at commit bd8446c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks worth giving a try since it's an error message and we can revert anytime if we see some other concerns.

One thing I am hesitant is, IIRC @hvanhovell preferred catalogString. I will leave a cc for him.

@HyukjinKwon
Copy link
Member

@mgaido91, mind rebasing this please? Let me just get this in.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jul 9, 2018

sorry for the late answer @HyukjinKwon. I just rebased it, thanks.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92747 has finished for PR 21321 at commit 781b64d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 1bd3d61 Jul 9, 2018
@gatorsmile
Copy link
Member

I think the fix is wrong. We should not use simpleString but catalogString, because simpleString will do the truncation.

@gatorsmile
Copy link
Member

Let me revert the changes. Please re-submit the fix.

@HyukjinKwon
Copy link
Member

That was the thing I was hesitant of. I was kind of confused recently about this because simpleString was used in some recent PRs even although I know catalogString is preferred. If this fix is "wrong", other PRs using simpleString all should be reverted though.

@mgaido91, mind opening a PR with catalogString again please?

@gatorsmile
Copy link
Member

@HyukjinKwon The whole PR is doing the wrong things. That is why I reverted it. I do not want the others to follow this PR. For the other PRs whose main objective are not to use simpleString to improve the error message, we can submit follow-up PRs to fix them; otherwise, feel free to revert them WDYT?

@HyukjinKwon
Copy link
Member

Ah, it's fine. This one is already reverted and I don't mind. Not a big deal actually. Glad that we are on the same page about catalogString anyway.

@mgaido91
Copy link
Contributor Author

@gatorsmile I used simpleString because in the previous JIRA which attempted to unify error messages simpleString was used, as well as I have always seen that method being used. Anyway, I am fine with catalogString too, but then we should change also the current places which are using simpleString, do we agree on this solution?

@HyukjinKwon
Copy link
Member

For instance, #20064 at SPARK-22893, right? Let's fix them too. Yes, since this one is reverted, we should better stick to catalogString in all such places, which I believe we all agreed upon here now.

I only haven't left some comments about this so far only because my impression was that we were not yet sure between simpleString and catalogString in error messages.

Now it looks at least including me three committer prefers catalogString in error messages.

@gatorsmile
Copy link
Member

catalogString is preferred in error messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants