Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "column -1 out of bounds" exception when using through spark #1221

Open
OndrejKut opened this issue Dec 20, 2024 · 1 comment
Open

Comments

@OndrejKut
Copy link

OndrejKut commented Dec 20, 2024

Describe the bug
I am using spark (actually pyspark) with sqlite JDBC for tests, and I run into case that causes "java.sql.SQLException: column -1 out of bounds" exception. It only happens with sqlite-jdbc-3.39.4.0 or newer, older versions work just fine - this makes me think that this might be an issue in the JDBC driver itself. Unfortunately, the issue is generated indirectly through spark, I don't know exactly what it does with sqlite. Still, I tried to isolate a minimal example that reproduces the issue.

To Reproduce
Here is a sample code from Python 3.10 (as mentioned I reproduce the issue indirectly through pyspark, I don't actually use Java directly):

import sqlite3
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

db_sql = """
CREATE TABLE [t1](
  [id] integer PRIMARY KEY AUTOINCREMENT NOT NULL, 
  [some_bool] bool, 
  [some_int] bigint NOT NULL);
  
CREATE TABLE [t2](
  [id] integer PRIMARY KEY AUTOINCREMENT NOT NULL, 
  [another_bool] bool, 
  [t1_id] bigint NOT NULL);
  

INSERT INTO t1 (id, some_bool, some_int) VALUES (1, False, 2);
INSERT INTO t2 (id, another_bool, t1_id) VALUES (1, NULL, 1);
"""

if __name__ == '__main__':
    # sqlite-jdbc-3.39.3.0.jar - working
    # sqlite-jdbc-3.39.4.0.jar - not working
    DRIVER_PATH = os.path.abspath("sqlite-jdbc-3.39.4.0.jar")
    DB_PATH = os.path.abspath("debug.db")
    if os.path.exists(DB_PATH):
        os.remove(DB_PATH)
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
    cursor.executescript(db_sql)
    conn.commit()
    conn.close()

    spark = SparkSession \
        .builder.appName("ISSUE") \
        .master("local[{}]".format(1)) \
        .config("spark.driver.extraClassPath", DRIVER_PATH)
    spark = spark.getOrCreate()

    q1 = """SELECT id, some_int, some_bool FROM t1"""
    df1 = spark.read.format('jdbc').options(driver='org.sqlite.JDBC',
                                      url=f"jdbc:sqlite:{DB_PATH}",
                                      query=q1).load()

    q2 = """SELECT another_bool, t1_id FROM t2"""
    df2 = spark.read.format('jdbc').options(driver='org.sqlite.JDBC',
                                      url=f"jdbc:sqlite:{DB_PATH}",
                                      query=q2).load()

    df = df1.join(df2, df1.id == df2.t1_id, "fullouter")
    groups = df.groupBy(['some_int'])

    agg_f = F.count(
        F.when(
            F.col("some_bool") == True,
            F.col("another_bool")
        )
    )
    agg_df = groups.agg(agg_f)
    data = agg_df.toPandas()

Expected behavior
I don't expect any exception.

Logs
Here is one such stracktrace:

24/12/20 04:02:49 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.sql.SQLException: column -1 out of bounds [1,2]
	at org.sqlite.core.CoreResultSet.checkCol(CoreResultSet.java:98)
	at org.sqlite.core.CoreResultSet.markCol(CoreResultSet.java:112)
	at org.sqlite.jdbc3.JDBC3ResultSet.wasNull(JDBC3ResultSet.java:150)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:359)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:340)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

Environment:

  • OS: Windows 10
  • CPU architecture: x86_64
  • sqlite-jdbc version: 3.47.1.0 (issue seems to be introduced in 3.39.4.0)

Additional context

  • the issue might not happen in 100% runs, occasionally it just works - most of the time it throws the exception though
  • if we exchange the order of another_bool and t1_id in the second query, the issue is gone
  • if we use sqlite-jdbc-3.39.3.0 or older, the issue is gone
@gotson
Copy link
Collaborator

gotson commented Dec 20, 2024

Difficult to help while not knowing how Spark uses the driver. It stumbles on this method, and the documentation says:

Note that you must first call one of the getter methods on a column to try to read its value and then call the method wasNull to see if the value read was SQL NULL.

And:

Throws:
SQLException - if a database access error occurs or this method is called on a closed result set

It may be some incorrect usage of the driver by Spark, as JDBC is quite complex and often left to interpretation when it comes to implementation, or it could be some issue in the driver.

There's nothing obvious here that would make me think the driver is at fault, as we have unit tests covering those methods.

I suggest you raise an issue on the Spark repo/community, they may find something on their side, or inquire here if our driver seem at fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants