-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16281][SQL] Implement parse_url SQL function #14008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
def5982
a2ab582
08c20e0
1651b5d
a02ad9b
65c2a6a
441cea2
1319b37
859d143
34a10d4
f75e3ae
c2c4513
c072104
4299ac5
fd70c6a
40e1cd4
7b0bca4
013ff46
95114ef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,8 +17,10 @@ | |
|
|
||
| package org.apache.spark.sql.catalyst.expressions | ||
|
|
||
| import java.net.{MalformedURLException, URL} | ||
| import java.text.{BreakIterator, DecimalFormat, DecimalFormatSymbols} | ||
| import java.util.{HashMap, Locale, Map => JMap} | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, imports need to be sorted. |
||
| import java.util.regex.Pattern | ||
|
|
||
| import scala.collection.mutable.ArrayBuffer | ||
|
|
||
|
|
@@ -654,6 +656,154 @@ case class StringRPad(str: Expression, len: Expression, pad: Expression) | |
| override def prettyName: String = "rpad" | ||
| } | ||
|
|
||
| object ParseUrl { | ||
| private val HOST = UTF8String.fromString("HOST") | ||
| private val PATH = UTF8String.fromString("PATH") | ||
| private val QUERY = UTF8String.fromString("QUERY") | ||
| private val REF = UTF8String.fromString("REF") | ||
| private val PROTOCOL = UTF8String.fromString("PROTOCOL") | ||
| private val FILE = UTF8String.fromString("FILE") | ||
| private val AUTHORITY = UTF8String.fromString("AUTHORITY") | ||
| private val USERINFO = UTF8String.fromString("USERINFO") | ||
| private val REGEXPREFIX = "(&|^)" | ||
| private val REGEXSUBFIX = "=([^&]*)" | ||
| } | ||
|
|
||
| /** | ||
| * Extracts a part from a URL | ||
| */ | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL", | ||
| extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO. | ||
| Key specifies which query to extract. | ||
| Examples: | ||
| > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST') | ||
| 'spark.apache.org' | ||
| > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY') | ||
| 'query=1' | ||
| > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 'query') | ||
| '1'""") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably """...
|...
""".stripMarginOtherwise all leading white spaces are included in the extended description string.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I'll fix this, thank you,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Complication with the error |
||
| case class ParseUrl(children: Seq[Expression]) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. again we should not use Seq[Expression] here. We should just have a 3-arg ctor, and then add a 2-arg ctor.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then we should think of a good default value for the 3rd argument. We should avoid using
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I explained before, I can hardly find a magic
Any suggestion on this?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, I don't have a strong preference here,
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if we use # as the default value and check on that? It is not a valid URL key is it?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Anyway I don't have a super strong preference here either. It might be more clear to not use a hacky # value.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, # is not a valid URL key. And I agree with you on not using a hacky value. |
||
| extends Expression with ExpectsInputTypes with CodegenFallback { | ||
|
|
||
| override def nullable: Boolean = true | ||
| override def inputTypes: Seq[DataType] = Seq.fill(children.size)(StringType) | ||
| override def dataType: DataType = StringType | ||
| override def prettyName: String = "parse_url" | ||
|
|
||
| // If the url is a constant, cache the URL object so that we don't need to convert url | ||
| // from UTF8String to String to URL for every row. | ||
| @transient private lazy val cachedUrl = children(0) match { | ||
| case Literal(url: UTF8String, _) if url ne null => getUrl(url) | ||
| case _ => null | ||
| } | ||
|
|
||
| // If the key is a constant, cache the Pattern object so that we don't need to convert key | ||
| // from UTF8String to String to StringBuilder to String to Pattern for every row. | ||
| @transient private lazy val cachedPattern = children(2) match { | ||
| case Literal(key: UTF8String, _) if key ne null => getPattern(key) | ||
| case _ => null | ||
| } | ||
|
|
||
| // If the partToExtract is a constant, cache the Extract part function so that we don't need | ||
| // to check the partToExtract for every row. | ||
| @transient private lazy val cachedExtractPartFunc = children(1) match { | ||
| case Literal(part: UTF8String, _) => getExtractPartFunc(part) | ||
| case _ => null | ||
| } | ||
|
|
||
| import ParseUrl._ | ||
|
|
||
| override def checkInputDataTypes(): TypeCheckResult = { | ||
| if (children.size > 3 || children.size < 2) { | ||
| TypeCheckResult.TypeCheckFailure(s"$prettyName function requires two or three arguments") | ||
| } else { | ||
| super[ExpectsInputTypes].checkInputDataTypes() | ||
| } | ||
| } | ||
|
|
||
| private def getPattern(key: UTF8String): Pattern = { | ||
| Pattern.compile(REGEXPREFIX + key.toString + REGEXSUBFIX) | ||
| } | ||
|
|
||
| private def getUrl(url: UTF8String): URL = { | ||
| try { | ||
| new URL(url.toString) | ||
| } catch { | ||
| case e: MalformedURLException => null | ||
| } | ||
| } | ||
|
|
||
| private def getExtractPartFunc(partToExtract: UTF8String): URL => String = { | ||
| partToExtract match { | ||
| case HOST => _.getHost | ||
| case PATH => _.getPath | ||
| case QUERY => _.getQuery | ||
| case REF => _.getRef | ||
| case PROTOCOL => _.getProtocol | ||
| case FILE => _.getFile | ||
| case AUTHORITY => _.getAuthority | ||
| case USERINFO => _.getUserInfo | ||
| case _ => (url: URL) => null | ||
| } | ||
| } | ||
|
|
||
| private def extractValueFromQuery(query: UTF8String, pattern: Pattern): UTF8String = { | ||
| val m = pattern.matcher(query.toString) | ||
| if (m.find()) { | ||
| UTF8String.fromString(m.group(2)) | ||
| } else { | ||
| null | ||
| } | ||
| } | ||
|
|
||
| private def extractFromUrl(url: URL, partToExtract: UTF8String): UTF8String = { | ||
| if (cachedExtractPartFunc ne null) { | ||
| UTF8String.fromString(cachedExtractPartFunc.apply(url)) | ||
| } else { | ||
| UTF8String.fromString(getExtractPartFunc(partToExtract).apply(url)) | ||
| } | ||
| } | ||
|
|
||
| private def parseUrlWithoutKey(url: UTF8String, partToExtract: UTF8String): UTF8String = { | ||
| if (cachedUrl ne null) { | ||
| extractFromUrl(cachedUrl, partToExtract) | ||
| } else { | ||
| val currentUrl = getUrl(url) | ||
| if (currentUrl ne null) { | ||
| extractFromUrl(currentUrl, partToExtract) | ||
| } else { | ||
| null | ||
| } | ||
| } | ||
| } | ||
|
|
||
| override def eval(input: InternalRow): Any = { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is somewhat convoluted with 4 levels of nesting, I think you can rewrite it this way to make it easier to follow
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make sense, I'll fix this. |
||
| val evaluated = children.map{e => e.eval(input).asInstanceOf[UTF8String]} | ||
| if (evaluated.contains(null)) return null | ||
| if (evaluated.size == 2) { | ||
| parseUrlWithoutKey(evaluated(0), evaluated(1)) | ||
| } else { | ||
| // 3-arg, i.e. QUERY with key | ||
| assert(evaluated.size == 3) | ||
| if (evaluated(1) != QUERY) { | ||
| return null | ||
| } | ||
|
|
||
| val query = parseUrlWithoutKey(evaluated(0), evaluated(1)) | ||
| if (query eq null) { | ||
| return null | ||
| } | ||
|
|
||
| if (cachedPattern ne null) { | ||
| extractValueFromQuery(query, cachedPattern) | ||
| } else { | ||
| extractValueFromQuery(query, getPattern(evaluated(2))) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Returns the input formatted according do printf-style format strings | ||
| */ | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -726,6 +726,57 @@ class StringExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
| checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 0) | ||||
| } | ||||
|
|
||||
| test("ParseUrl") { | ||||
| def checkParseUrl(expected: String, urlStr: String, partToExtract: String): Unit = { | ||||
| checkEvaluation( | ||||
| ParseUrl(Seq(Literal(urlStr), Literal(partToExtract))), expected) | ||||
| } | ||||
| def checkParseUrlWithKey( | ||||
| expected: String, | ||||
| urlStr: String, | ||||
| partToExtract: String, | ||||
| key: String): Unit = { | ||||
| checkEvaluation( | ||||
| ParseUrl(Seq(Literal(urlStr), Literal(partToExtract), Literal(key))), expected) | ||||
| } | ||||
|
|
||||
| checkParseUrl("spark.apache.org", "http://spark.apache.org/path?query=1", "HOST") | ||||
| checkParseUrl("/path", "http://spark.apache.org/path?query=1", "PATH") | ||||
| checkParseUrl("query=1", "http://spark.apache.org/path?query=1", "QUERY") | ||||
| checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref", "REF") | ||||
| checkParseUrl("http", "http://spark.apache.org/path?query=1", "PROTOCOL") | ||||
| checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1", "FILE") | ||||
| checkParseUrl("spark.apache.org:8080", "http://spark.apache.org:8080/path?query=1", "AUTHORITY") | ||||
| checkParseUrl("userinfo", "http://userinfo@spark.apache.org/path?query=1", "USERINFO") | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what will happen if there is no userinfo in the url?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then the result is |
||||
| checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1", "QUERY", "query") | ||||
|
|
||||
| // Null checking | ||||
| checkParseUrl(null, null, "HOST") | ||||
| checkParseUrl(null, "http://spark.apache.org/path?query=1", null) | ||||
| checkParseUrl(null, null, null) | ||||
| checkParseUrl(null, "test", "HOST") | ||||
| checkParseUrl(null, "http://spark.apache.org/path?query=1", "NO") | ||||
| checkParseUrl(null, "http://spark.apache.org/path?query=1", "USERINFO") | ||||
| checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1", "HOST", "query") | ||||
| checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1", "QUERY", "quer") | ||||
| checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1", "QUERY", null) | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add exceptional cases by using the following statement?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure. Is there any exceptional case?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. e.g. invalid url, invalid
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh sorry, I miss the point.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As invalid url and invalid part just get
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Invalid
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure how to handle this kind of malformed queries. Hive makes reasonable message.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, you are right. Invalid |
||||
| checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1", "QUERY", "") | ||||
|
|
||||
| // exceptional cases | ||||
| intercept[java.util.regex.PatternSyntaxException] { | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, @janplus .
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In case of Hive, it's also
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In other words, Spark of this PR runs the execution for that problematic parameter while Hive does not.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll have a investigation on this.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you, @janplus .
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, @dongjoon-hyun
Given that, it seems not that valuable to do this optimization.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for nice investigation. Yes, the validation of Hive seems too limited.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, definitely I can do that. In fact I have finished it. spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala Line 64 in d1e8108
Which means we can only implement this validation in
But obviously this should not be a data type mismatch. This message may confuse the users. Also the different message for Literal
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's fine to throw the exception at executor side, no need to specially handle literal here.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, @cloud-fan |
||||
| evaluate(ParseUrl(Seq(Literal("http://spark.apache.org/path?"), | ||||
| Literal("QUERY"), Literal("???")))) | ||||
| } | ||||
|
|
||||
| // arguments checking | ||||
| assert(ParseUrl(Seq(Literal("1"))).checkInputDataTypes().isFailure) | ||||
| assert(ParseUrl(Seq(Literal("1"), Literal("2"), Literal("3"), Literal("4"))) | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also add some cases with invalid-type parameters?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I declare ParseUrl with
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah right, no need to bother here |
||||
| .checkInputDataTypes().isFailure) | ||||
| assert(ParseUrl(Seq(Literal("1"), Literal(2))).checkInputDataTypes().isFailure) | ||||
| assert(ParseUrl(Seq(Literal(1), Literal("2"))).checkInputDataTypes().isFailure) | ||||
| assert(ParseUrl(Seq(Literal("1"), Literal("2"), Literal(3))).checkInputDataTypes().isFailure) | ||||
| } | ||||
|
|
||||
| test("Sentences") { | ||||
| val nullString = Literal.create(null, StringType) | ||||
| checkEvaluation(Sentences(nullString, nullString, nullString), null) | ||||
|
|
||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go before printf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, Thank you for review. I'll fix this.