Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

According to https://github.com/apache/spark/pull/25252/files#r738489764, if we use wild pattern, it will return too much rows.

In this pr we return common builtin functions only once

Why are the changes needed?

Improve performance

Does this PR introduce any user-facing change?

No

How was this patch tested?

WIP

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-37173][SQL] SparkGetFunctionOperation return builtin function only once [WIP][SPARK-37173][SQL] SparkGetFunctionOperation return builtin function only once Nov 1, 2021
@github-actions github-actions bot added the SQL label Nov 1, 2021
@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Test build #144797 has finished for PR 34453 at commit 118af14.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49267/

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49267/

@AngersZhuuuu AngersZhuuuu marked this pull request as draft November 1, 2021 05:16
@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Test build #144798 has finished for PR 34453 at commit 3362038.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49268/

@AngersZhuuuu
Copy link
Contributor Author

ping @wangyum @juliuszsompolski

@SparkQA
Copy link

SparkQA commented Nov 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49268/

@SparkQA
Copy link

SparkQA commented Nov 10, 2021

Test build #145051 has finished for PR 34453 at commit 3362038.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@juliuszsompolski
Copy link
Contributor

In #25252 @dongjoon-hyun suggests to make it under a legacy feature flag, since it's behaviour changing.

@dongjoon-hyun
Copy link
Member

Thank you, @juliuszsompolski

@AngersZhuuuu
Copy link
Contributor Author

In #25252 @dongjoon-hyun suggests to make it under a legacy feature flag, since it's behaviour changing.

Sure, will start work on this.

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Test build #146224 has finished for PR 34453 at commit 75131c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SparkConf(object):
  • class ProbabilisticClassifier(Classifier, _ProbabilisticClassifierParams, metaclass=ABCMeta):
  • class ProbabilisticClassificationModel(
  • class _JavaProbabilisticClassifier(ProbabilisticClassifier, _JavaClassifier, metaclass=ABCMeta):
  • class _JavaProbabilisticClassificationModel(
  • class _LinearSVCParams(
  • class LinearSVCModel(
  • class _LogisticRegressionParams(
  • class LogisticRegression(
  • class LogisticRegressionModel(
  • class BinaryLogisticRegressionSummary(_BinaryClassificationSummary, LogisticRegressionSummary):
  • class BinaryLogisticRegressionTrainingSummary(
  • class DecisionTreeClassifier(
  • class DecisionTreeClassificationModel(
  • class RandomForestClassifier(
  • class RandomForestClassificationModel(
  • class RandomForestClassificationTrainingSummary(
  • class BinaryRandomForestClassificationTrainingSummary(
  • class GBTClassifier(
  • class GBTClassificationModel(
  • class NaiveBayes(
  • class NaiveBayesModel(
  • class _MultilayerPerceptronParams(
  • class MultilayerPerceptronClassifier(
  • class MultilayerPerceptronClassificationModel(
  • class MultilayerPerceptronClassificationTrainingSummary(
  • class FMClassifier(
  • class FMClassificationModel(
  • class _GaussianMixtureParams(
  • class GaussianMixtureModel(
  • class _KMeansParams(
  • class KMeansModel(
  • class _BisectingKMeansParams(
  • class BisectingKMeansModel(
  • class PowerIterationClustering(
  • class BinaryClassificationEvaluator(
  • class RegressionEvaluator(
  • class MulticlassClassificationEvaluator(
  • class MultilabelClassificationEvaluator(
  • class ClusteringEvaluator(
  • class RankingEvaluator(
  • class Binarizer(
  • class BucketedRandomProjectionLSH(
  • class BucketedRandomProjectionLSHModel(
  • class Bucketizer(
  • class ElementwiseProduct(
  • class FeatureHasher(
  • class HashingTF(
  • class _OneHotEncoderParams(
  • class PolynomialExpansion(
  • class QuantileDiscretizer(
  • class _StringIndexerParams(
  • class StopWordsRemover(
  • class VectorAssembler(
  • class VectorSizeHint(
  • class VarianceThresholdSelector(
  • class VarianceThresholdSelectorModel(
  • class UnivariateFeatureSelector(
  • class UnivariateFeatureSelectorModel(
  • class _LinearRegressionParams(
  • class LinearRegressionModel(
  • class IsotonicRegression(
  • class IsotonicRegressionModel(JavaModel, _IsotonicRegressionParams, JavaMLWritable, JavaMLReadable):
  • class DecisionTreeRegressor(
  • class RandomForestRegressor(
  • class _AFTSurvivalRegressionParams(
  • class AFTSurvivalRegression(
  • class AFTSurvivalRegressionModel(
  • class _GeneralizedLinearRegressionParams(
  • class GeneralizedLinearRegression(
  • class GeneralizedLinearRegressionModel(
  • class _FactorizationMachinesParams(
  • class FMRegressionModel(
  • class CrossValidator(
  • class TrainValidationSplit(
  • + \"class name
  • class MultivariateGaussian(NamedTuple):
  • class TimedeltaOps(DataTypeOps):
  • class TimedeltaIndex(Index):
  • class MissingPandasLikeTimedeltaIndex(MissingPandasLikeIndex):
  • class PandasSQLStringFormatter(string.Formatter):
  • class PandasAPIOnSparkAdviceWarning(Warning):
  • class UDFBasicProfiler(BasicProfiler):
  • class CloudPickleSerializer(FramedSerializer):
  • class ArrowStreamUDFSerializer(ArrowStreamSerializer):
  • class SQLStringFormatter(string.Formatter):
  • class DayTimeIntervalType(AtomicType):
  • class DayTimeIntervalTypeConverter(object):
  • class ExecutorPodsPollingSnapshotSource(
  • class ExecutorPodsWatchSnapshotSource(
  • class ExecutorRollPlugin extends SparkPlugin
  • class ExecutorRollDriverPlugin extends DriverPlugin with Logging
  • class AnsiCombinedTypeCoercionRule(rules: Seq[TypeCoercionRule]) extends
  • trait ExpressionBuilder
  • case class RelationTimeTravel(
  • case class AsOfTimestamp(timestamp: Long) extends TimeTravelSpec
  • case class AsOfVersion(version: String) extends TimeTravelSpec
  • class CombinedTypeCoercionRule(rules: Seq[TypeCoercionRule]) extends TypeCoercionRule
  • case class ExpressionStats(expr: Expression)(var useCount: Int)
  • case class PrettyPythonUDF(
  • case class MapContainsKey(
  • case class TryElementAt(left: Expression, right: Expression, child: Expression)
  • case class ConvertTimezone(
  • case class AesEncrypt(
  • case class AesDecrypt(
  • trait PadExpressionBuilderBase extends ExpressionBuilder
  • case class StringLPad(str: Expression, len: Expression, pad: Expression)
  • case class BinaryLPad(str: Expression, len: Expression, pad: Expression, child: Expression)
  • case class BinaryRPad(str: Expression, len: Expression, pad: Expression, child: Expression)
  • case class UnclosedCommentProcessor(
  • case class PythonMapInArrow(
  • case class CreateTable(
  • case class DropIndex(
  • case class TableSpec(
  • public class ColumnIOUtil
  • case class OptimizeSkewedJoin(ensureRequirements: EnsureRequirements)
  • case class ParquetColumn(
  • case class DropIndexExec(
  • case class PushedDownOperators(
  • case class TableSampleInfo(
  • trait MapInBatchExec extends UnaryExecNode
  • case class PythonMapInArrowExec(
  • class RatePerMicroBatchProvider extends SimpleTableProvider with DataSourceRegister
  • class RatePerMicroBatchTable(
  • class RatePerMicroBatchStream(
  • case class RatePerMicroBatchStreamOffset(offset: Long, timestamp: Long) extends Offset
  • case class RatePerMicroBatchStreamInputPartition(
  • class RatePerMicroBatchStreamPartitionReader(
  • // When this is enabled, this class does additional lookup on write operations (put/delete) to

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50698/

@AngersZhuuuu AngersZhuuuu marked this pull request as ready for review December 15, 2021 10:07
@github-actions github-actions bot added the DOCS label Dec 15, 2021
@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50698/

@AngersZhuuuu AngersZhuuuu changed the title [WIP][SPARK-37173][SQL] SparkGetFunctionOperation return builtin function only once [SPARK-37173][SQL] SparkGetFunctionOperation return builtin function only once Dec 15, 2021
@AngersZhuuuu
Copy link
Contributor Author

legacy feature flag

For legacy flag, should we must use spark.sql.legacy.xxx and default value is false?

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50703/

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50703/

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50742/

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50742/

Seq("shiftleft", "shiftright", "shiftrightunsigned"))
checkResult(metaData.getFunctions(null, "default", "upPer"), Seq("upper"))

statement.execute(s"SET ${SQLConf.THRIFTSERVER_SEPARATE_DISPLAY_SYSTEM_FUNCTION.key}=true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test with two schemas and run an unfiltered getFunctions call to show that previously we'd see duplicates, whereas now the functions are unique?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

@juliuszsompolski
Copy link
Contributor

It's up to our decision, but it's safe to keep the existing behavior first and to switch it at next version because it gives the users a chance to try this new behavior.

My 2cents: working with various partners and vendors of BI tools, we found GetFunctions to rarely be used at all, and when we found it used, it was in the context of the current behaviour causing trouble (running to slow, causing UI freezes because of trying to render the humongous list of duplicated functions). I am not aware of any tool depending on the current behaviour.
Note that before GetTables / GetSchemas / GetColumns gotten implemented in Spark 3.0, this was all throwing wrong results or errors with Spark, and it has not been escalated for a long time. Only recently BI vendors are picking up serious interest in developing connectors that take advantage of these functions, and this has been reported as something that is an unexpected and unwanted behaviour.

@AngersZhuuuu
Copy link
Contributor Author

It's up to our decision, but it's safe to keep the existing behavior first and to switch it at next version because it gives the users a chance to try this new behavior.

My 2cents: working with various partners and vendors of BI tools, we found GetFunctions to rarely be used at all, and when we found it used, it was in the context of the current behaviour causing trouble (running to slow, causing UI freezes because of trying to render the humongous list of duplicated functions). I am not aware of any tool depending on the current behaviour. Note that before GetTables / GetSchemas / GetColumns gotten implemented in Spark 3.0, this was all throwing wrong results or errors with Spark, and it has not been escalated for a long time. Only recently BI vendors are picking up serious interest in developing connectors that take advantage of these functions, and this has been reported as something that is an unexpected and unwanted behaviour.

We use HUE for adhoc before, to support use hue in spark's thrift server. we also changed a lot..

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50755/

Copy link
Contributor

@bogdanghit bogdanghit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @AngersZhuuuu

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50758/

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50755/

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Test build #146267 has finished for PR 34453 at commit 479d533.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50758/

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Test build #146281 has finished for PR 34453 at commit 366eed7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Test build #146284 has finished for PR 34453 at commit 392c5ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

gentle ping @dongjoon-hyun WDYT of current code?

@AngersZhuuuu
Copy link
Contributor Author

Any more suggestion? also cc @wangyum

@AngersZhuuuu
Copy link
Contributor Author

Any more suggestion?

@bogdanghit
Copy link
Contributor

Any more suggestion?

It looks good to me. @dongjoon-hyun WDYT?

@bogdanghit
Copy link
Contributor

Gentle nudge @AngersZhuuuu @dongjoon-hyun @wangyum.

@bogdanghit
Copy link
Contributor

@dongjoon-hyun what should still be done to push this through?

@srowen
Copy link
Member

srowen commented Feb 15, 2022

Would the existing behavior ever be desirable? sounds like more of a bug?

@bogdanghit
Copy link
Contributor

@srowen that was my initial thought as well, but there are concerns it may be a breaking change because of the different result format

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 29, 2022
@github-actions github-actions bot closed this May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants