Skip to content

Conversation

@alberskib
Copy link

Invocation of getters for type extending AnyVal returns default value (if field value is null) instead of throwing NPE. Please check comments for SPARK-11553 issue for more details.

@rxin
Copy link
Contributor

rxin commented Nov 12, 2015

This change breaks binary compatibility. I think we should just update the documentation to say use isNull to check for primitive types.

@alberskib
Copy link
Author

We could also change only implementation of methods for primitive types (getInt, getDouble etc) to do null check internally and do not change code of getAs but update documentation as you said.

@marmbrus
Copy link
Contributor

ok to test

@marmbrus
Copy link
Contributor

Please change the PR title to something more descriptive: [SPARK-11553] [SQL] Primitive Row accessors should not convert null to default value

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45772 has finished for PR 9642 at commit f498f4a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@alberskib alberskib changed the title SPARK-11553 Fix issue related to Row from DataFrame API [SPARK-11553] [SQL] Primitive Row accessors should not convert null to default value Nov 12, 2015
@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45777 has finished for PR 9642 at commit eef5778.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45852 has finished for PR 9642 at commit 858b9a1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45853 has finished for PR 9642 at commit be382bc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@alberskib
Copy link
Author

The only failing test is 'Sessions of SQLContext' in org.apache.spark.sql.SQLContextSuite. Unfortunately on my machine I am not able to reproduce the problem - it is working well. What is more I think that it is not related to my change.

@SparkQA
Copy link

SparkQA commented Nov 13, 2015

Test build #45894 has finished for PR 9642 at commit 181b075.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@alberskib
Copy link
Author

After rebasing to current master all tests are passing

@hvanhovell
Copy link
Contributor

UnsafeRow and SpecificRow have similar problems. Shouldn't we fix those as well? For example:

import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.catalyst.expressions.SpecificMutableRow
import org.apache.spark.sql.catalyst.expressions.UnsafeRow

val srow = new SpecificMutableRow(IntegerType :: Nil)
srow.isNullAt(0)
srow.getInt(0)

val urow = new UnsafeRow()
urow.pointTo(new Array[Byte](16), 1, 16)
urow.setNullAt(0)
urow.isNullAt(0)
urow.getInt(0)

// Result:
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.catalyst.expressions.SpecificMutableRow
import org.apache.spark.sql.catalyst.expressions.UnsafeRow
srow: org.apache.spark.sql.catalyst.expressions.SpecificMutableRow = [null]
res129: Boolean = true
res130: Int = 0
urow: org.apache.spark.sql.catalyst.expressions.UnsafeRow = []
res133: Boolean = true
res134: Int = 0

I'd actualy rather not touch this at all. When you are using internal API you should be more carefull and expect some quirkiness.

I can currently think of only one place in which this causes some problems: UDFs with primitive parameters. The engine will pass in default values instead of nulls. Are there any other situations in which this causes problems?

@alberskib
Copy link
Author

Hey @hvanhovell ! Thanks for the comment. I agree with you that if we want to introduce this change we need to take care of UnsafeRow and SpecificRow as well.
I have couple of questions for you:

I'd actualy rather not touch this at all ...

You mean discard change and only update documentation? Yes this is one option - I even think that it is not so bad.
On the other hand it could be misleading and many users will not notice the difference in documentation - this leads them to making the same errors. I think that right now there is lot of code that is calculating wrong outcomes as a result of this bug.

I can currently think of only one place in which this causes some problems

You mean problem caused by not introducing this change?

I think that we should analyse pros and cons and decide how to proceed.

@hvanhovell
Copy link
Contributor

You mean discard change and only update documentation? Yes this is one option - I even think that it is not so bad.

I would only update the documentation. Internal mutable rows are among the most performance critical classes in Spark SQL, so I am not that keen to add (potentially unnecessary) branching to every primitive getter. When someone is using InternalRows he or she is using a SparkSQL internal API anyway, and really should be knowing what (s)he is doing.

You mean problem caused by not introducing this change?

Yes, do you know any other problems caused by this?

@alberskib
Copy link
Author

Yes, do you know any other problems caused by this?

Right now I did not find any other potential problem.

@alberskib
Copy link
Author

There is no scaladoc for primitive getters in UnsafeRow and SpecificRow (and super classes) - everything remains there as is.

@hvanhovell
Copy link
Contributor

So I have been making a lot of fuss about internal classes, which you are not touching. Sorry about that.

This change is much more benign, but I still wonder if you need to start throwing NPE's. I'd prefer to update the documentation.

@alberskib
Copy link
Author

No problem.
I am not sure which option is better. At the beginning I was thinking that it is easier to notice exception being throwned rather than change in the documentation, but right now it looks like changing only documentation will be more consistent.
It will be good if more people will express opinion.

@marmbrus
Copy link
Contributor

The original behavior of these classes was to throw if you tried to retrieve a null primitive value and I think thats a lot less confusing. Unfortunately it seems that the tungsten refactoring changed this behavior, but since it was the original behavior I'm in favor of returning to it.

@marmbrus
Copy link
Contributor

Regarding the internal classes, my preferred option would be to add assertions that we can elide for performance in production.

@marmbrus
Copy link
Contributor

Thanks, I'm going to merge this to master and 1.6.

@asfgit asfgit closed this in 3129662 Nov 16, 2015
asfgit pushed a commit that referenced this pull request Nov 16, 2015
… default value

Invocation of getters for type extending AnyVal returns default value (if field value is null) instead of throwing NPE. Please check comments for SPARK-11553 issue for more details.

Author: Bartlomiej Alberski <bartlomiej.alberski@allegrogroup.com>

Closes #9642 from alberskib/bugfix/SPARK-11553.

(cherry picked from commit 3129662)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants