Getting "column -1 out of bounds" exception when using through spark #1221

OndrejKut · 2024-12-20T03:21:44Z

Describe the bug
I am using spark (actually pyspark) with sqlite JDBC for tests, and I run into case that causes "java.sql.SQLException: column -1 out of bounds" exception. It only happens with sqlite-jdbc-3.39.4.0 or newer, older versions work just fine - this makes me think that this might be an issue in the JDBC driver itself. Unfortunately, the issue is generated indirectly through spark, I don't know exactly what it does with sqlite. Still, I tried to isolate a minimal example that reproduces the issue.

To Reproduce
Here is a sample code from Python 3.10 (as mentioned I reproduce the issue indirectly through pyspark, I don't actually use Java directly):

import sqlite3
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

db_sql = """
CREATE TABLE [t1](
  [id] integer PRIMARY KEY AUTOINCREMENT NOT NULL, 
  [some_bool] bool, 
  [some_int] bigint NOT NULL);
  
CREATE TABLE [t2](
  [id] integer PRIMARY KEY AUTOINCREMENT NOT NULL, 
  [another_bool] bool, 
  [t1_id] bigint NOT NULL);
  

INSERT INTO t1 (id, some_bool, some_int) VALUES (1, False, 2);
INSERT INTO t2 (id, another_bool, t1_id) VALUES (1, NULL, 1);
"""

if __name__ == '__main__':
    # sqlite-jdbc-3.39.3.0.jar - working
    # sqlite-jdbc-3.39.4.0.jar - not working
    DRIVER_PATH = os.path.abspath("sqlite-jdbc-3.39.4.0.jar")
    DB_PATH = os.path.abspath("debug.db")
    if os.path.exists(DB_PATH):
        os.remove(DB_PATH)
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
    cursor.executescript(db_sql)
    conn.commit()
    conn.close()

    spark = SparkSession \
        .builder.appName("ISSUE") \
        .master("local[{}]".format(1)) \
        .config("spark.driver.extraClassPath", DRIVER_PATH)
    spark = spark.getOrCreate()

    q1 = """SELECT id, some_int, some_bool FROM t1"""
    df1 = spark.read.format('jdbc').options(driver='org.sqlite.JDBC',
                                      url=f"jdbc:sqlite:{DB_PATH}",
                                      query=q1).load()

    q2 = """SELECT another_bool, t1_id FROM t2"""
    df2 = spark.read.format('jdbc').options(driver='org.sqlite.JDBC',
                                      url=f"jdbc:sqlite:{DB_PATH}",
                                      query=q2).load()

    df = df1.join(df2, df1.id == df2.t1_id, "fullouter")
    groups = df.groupBy(['some_int'])

    agg_f = F.count(
        F.when(
            F.col("some_bool") == True,
            F.col("another_bool")
        )
    )
    agg_df = groups.agg(agg_f)
    data = agg_df.toPandas()

Expected behavior
I don't expect any exception.

Logs
Here is one such stracktrace:

24/12/20 04:02:49 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.sql.SQLException: column -1 out of bounds [1,2]
	at org.sqlite.core.CoreResultSet.checkCol(CoreResultSet.java:98)
	at org.sqlite.core.CoreResultSet.markCol(CoreResultSet.java:112)
	at org.sqlite.jdbc3.JDBC3ResultSet.wasNull(JDBC3ResultSet.java:150)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:359)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:340)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

Environment:

OS: Windows 10
CPU architecture: x86_64
sqlite-jdbc version: 3.47.1.0 (issue seems to be introduced in 3.39.4.0)

Additional context

the issue might not happen in 100% runs, occasionally it just works - most of the time it throws the exception though
if we exchange the order of another_bool and t1_id in the second query, the issue is gone
if we use sqlite-jdbc-3.39.3.0 or older, the issue is gone

The text was updated successfully, but these errors were encountered:

gotson · 2024-12-20T04:09:20Z

Difficult to help while not knowing how Spark uses the driver. It stumbles on this method, and the documentation says:

Note that you must first call one of the getter methods on a column to try to read its value and then call the method wasNull to see if the value read was SQL NULL.

And:

Throws:
SQLException - if a database access error occurs or this method is called on a closed result set

It may be some incorrect usage of the driver by Spark, as JDBC is quite complex and often left to interpretation when it comes to implementation, or it could be some issue in the driver.

There's nothing obvious here that would make me think the driver is at fault, as we have unit tests covering those methods.

I suggest you raise an issue on the Spark repo/community, they may find something on their side, or inquire here if our driver seem at fault.

OndrejKut added the triage label Dec 20, 2024

gotson added troubleshooting and removed triage labels Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting "column -1 out of bounds" exception when using through spark #1221

Getting "column -1 out of bounds" exception when using through spark #1221

OndrejKut commented Dec 20, 2024 •

edited

Loading

gotson commented Dec 20, 2024

Getting "column -1 out of bounds" exception when using through spark #1221

Getting "column -1 out of bounds" exception when using through spark #1221

Comments

OndrejKut commented Dec 20, 2024 • edited Loading

gotson commented Dec 20, 2024

OndrejKut commented Dec 20, 2024 •

edited

Loading