Skip to content

Conversation

@gorros
Copy link
Collaborator

@gorros gorros commented Mar 20, 2019

#67

@MrPowers
Copy link
Collaborator

@gorros - Thanks for the PR!

I think getOrElse(null) will work better. Here are some example code snippets:

// how forall works by default
Seq("hi", "hello").forall(_.startsWith("h")) // true
Seq("hi", "hello", null).forall(_.startsWith("h")) // NullPointerException

// returns false correctly
Seq("hi", "hello", "bye").forall(_.startsWith("h"))

// incorrectly returns false
scala.util.Try(Seq("hi", "hello", "bye", null).forall(_.startsWith("h"))).toOption.getOrElse(false)

So I think we either need to throw the NullPointerException or return null. It seems like Spark always prefers to return null instead of throwing a NullPointerException. Let me know what you think!

Also, if you agree, can you please update the test? Thanks again so much for the help!!! You're motivating me to keep working on this project and make it better.

@gorros
Copy link
Collaborator Author

gorros commented Mar 20, 2019

@MrPowers Well, it is not straight forward. But on the other side forall must return Boolean. Also

scala> val a: Boolean = null.asInstanceOf[Boolean]
a: Boolean = false

That is why
scala.util.Try(Seq("hi", "hello", "bye", null).forall(_.startsWith("h"))).toOption.getOrElse(false)
returns false.

@MrPowers
Copy link
Collaborator

@gorros - good point. Here's a related question I asked in StackOverflow that helped me answer this question.

Do you have any thoughts on how to rework the forall function? I know UDFs can't return Option values....

If you don't have any ideas, I can just ask another question on Stackoverflow. Let me know. Thanks!

@gorros
Copy link
Collaborator Author

gorros commented Mar 21, 2019

@MrPowers, so by reworking do you mean to make forall return null if column is null?

@gorros
Copy link
Collaborator Author

gorros commented Mar 21, 2019

@MrPowers there is conflict with return type when I try to return null. Which is logical.

@MrPowers
Copy link
Collaborator

MrPowers commented Mar 21, 2019

@gorros - Here's how I think the function should behave (input on the left // => desired output on the right). Let me know what you think!

null // => null
Seq("hi", "hello").forall(_.startsWith("h")) // => true
Seq("hi", "hello", null).forall(_.startsWith("h")) // => true
Seq("hi", "hello", "bye").forall(_.startsWith("h")) // => false
Seq("hi", "hello", "bye", null).forall(_.startsWith("h")) // => false
Seq().forall(_.startsWith("h")) // => null
Seq(null).forall(_.startsWith("h")) // => null

So if the input is null, we return null. If the Array contains null values, we basically just ignore the null values in making the determination. Then a couple of other edge cases.

This ended up being a lot more complicated than I thought when I initially opened the issue - sorry about that!

I really like how Spark never throws NullPointerExceptions. I need to go through all the spark-daria functions again and make sure that none of them ever throw NullPointerExceptions - it's something we need to always watch out for, especially when we use UDFs!

@gorros
Copy link
Collaborator Author

gorros commented Mar 24, 2019

@MrPowers I think there is already question regarding UDFs returning null.

@gorros
Copy link
Collaborator Author

gorros commented Mar 24, 2019

@MrPowers But on the other side previously I wrote simple UDFs that return null

@gorros
Copy link
Collaborator Author

gorros commented Mar 24, 2019

@MrPowers I think main issue is tah our return type us Boolean which is not nullable.

@nvander1
Copy link
Collaborator

Related to pulling in support for higher order functions: #80 (comment)

If we do what I propose in that comment, then we can implement forall with the HOF exists and demorgan's law.

forall(array, predicate)

is equivalent to writing
NOT exists(array, NOT(predicate))

The signature of exists however is this:

def exists(array: Column, predicate: Column => Column): Column = ???
// the predicate must returna boolean valued column

So naively it doesn't work with UDFs. However, we can define an implicit conversion from UDFs to functions that operate on columns like so:

implicit def udf2columnar(f: UserDefinedFunction): Column => Column = x => f(x)

I've used the above solutions in the past, and I think we can take a similar approach like so:

implicit def scalaPredicate2udf[T](f: T => Boolean)(implicit e: TypeTag[Boolean], e2: TypeTag[T]): Column => Column = x => udf[Boolean, T](f)(e, e2)(x)

We'll need to test this however to make sure it works in practice

@nvander1 nvander1 closed this May 31, 2019
@nvander1 nvander1 deleted the forall_npe branch May 31, 2019 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants