Skip to content

Conversation

@sahilkumarsingh
Copy link

What changes were proposed in this pull request?

This PR will address the issue SPARK-54634.

With this, I am adding a user-friendly error message when users write SQL queries with an empty IN clause, like: SELECT * FROM table WHERE col IN ()

Why are the changes needed?

When users write SQL with an empty IN clause, Spark currently produces a syntax error of subclass [PARSE_SYNTAX_ERROR], which leads the user to believe that their syntax is incorrect, whereas the actual issue is due to the absence of values for the IN clause. Hence, the current error message does not communicate the right intention to the user.

This change provides a clear, actionable error message that explains the actual problem
and suggests alternatives.

Example - Before:

org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'IN'. SQLSTATE: 42601 (line 1, pos 33)

Example - After:

org.apache.spark.sql.catalyst.parser.ParseException:
[INVALID_SQL_SYNTAX.EMPTY_IN_PREDICATE] Invalid SQL syntax: IN predicate requires at least one value. Empty IN clauses like 'IN ()' are not allowed. Consider using 'WHERE FALSE' if you need an always-false condition, or provide at least one value in the IN list. SQLSTATE: 42000

Does this PR introduce any user-facing change?

Yes, users will now see a better error message.

Code executed: spark.sql("SELECT * FROM range(10) WHERE id IN ()").show()

Before output:
image

After output:
image

How was this patch tested?

  • I have added unit tests in QueryParsingErrorsSuite.scala and SQL golden tests added in predicate-functions.sql
  • I have also tested the build locally by running the query in spark-shell

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic) - used for code assistance, test generation, and documentation.

@github-actions github-actions bot added the SQL label Dec 8, 2025
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the error message better!

exception = parseException(sql2),
condition = "PARSE_SYNTAX_ERROR",
parameters = Map("error" -> "'IN'", "hint" -> ""))
parameters = Map("error" -> "'INTO'", "hint" -> ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the error message before and after this change for this test case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Allison,

This is the before and after this change for this test case:

Before:

[scala> spark.sql("SELECT * FROM S WHERE C1 IN (INSERT INTO T VALUES (2))").show()
org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'IN'. SQLSTATE: 42601 (line 1, pos 25)

== SQL ==
SELECT * FROM S WHERE C1 IN (INSERT INTO T VALUES (2))
-------------------------^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:285)
  at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:97)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:54)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(AbstractSqlParser.scala:93)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$5(SparkSession.scala:492)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:491)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:490)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:504)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:513)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:91)
  ... 42 elided

After:

[scala> spark.sql("SELECT * FROM S WHERE C1 IN (INSERT INTO T VALUES (2))").show()
org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'INTO'. SQLSTATE: 42601 (line 1, pos 36)

== SQL ==
SELECT * FROM S WHERE C1 IN (INSERT INTO T VALUES (2))
------------------------------------^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:267)
  at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:78)
  at org.apache.spark.sql.execution.SparkSqlParser.super$parse(SparkSqlParser.scala:163)
  at org.apache.spark.sql.execution.SparkSqlParser.$anonfun$parseInternal$1(SparkSqlParser.scala:163)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:107)
  at org.apache.spark.sql.execution.SparkSqlParser.parseInternal(SparkSqlParser.scala:163)
  at org.apache.spark.sql.execution.SparkSqlParser.parseWithParameters(SparkSqlParser.scala:70)
  at org.apache.spark.sql.execution.SparkSqlParser.parsePlanWithParameters(SparkSqlParser.scala:84)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$6(SparkSession.scala:573)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:148)
  at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:572)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:563)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:591)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:682)
  at org.apache.spark.sql.classic.SparkSession.sql(SparkSession.scala:92)
  ... 42 elided

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants