Out of bounds error when reading data contains null value #647

zijie0 · 2021-07-23T02:12:23Z

What happened:

With version 0.6.3 or below, we can't read data containing null values. It throws following error:

IndexError: Out of bounds on buffer access (axis 0)

What you expected to happen:

Should be able to read data containing null.

Minimal Complete Verifiable Example:

In [1]: import glob

In [2]: import pyspark

In [3]: from fastparquet import ParquetFile

In [4]: spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [5]: sdf = spark.createDataFrame([[None, 1, None], ['a', None, None]], schema='a string, b int, c string')

In [6]: sdf.show()
+----+----+----+
|   a|   b|   c|
+----+----+----+
|null|   1|null|
|   a|null|null|
+----+----+----+


In [7]: path = 'spark_null'

In [8]: sdf.write.format('parquet').save(path)

In [10]: file_list = glob.glob(f'{path}/*.parquet')

In [11]: pdf = ParquetFile(file_list).to_pandas()

.../lib/python3.7/site-packages/fastparquet/speedups.pyx in fastparquet.speedups.unpack_byte_array()

IndexError: Out of bounds on buffer access (axis 0)

Anything else we need to know?:

Environment:

Dask version: N/A
Python version: 3.7.9
Operating System: CentOS 7.6
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

zijie0 · 2021-07-23T02:14:31Z

It works fine in version 0.7.0, but we cannot do the upgrade due to this issue: 646

It would be nice if we can have a hotfix in 0.6.x. Otherwise the compatibility with Spark is totally broken.

martindurant · 2021-07-23T15:22:45Z

If you agree that my comments in the other issue are sufficient, please close this issue.

zijie0 · 2021-07-25T12:39:40Z

Close this issue as we have a workaround in #646.

zijie0 closed this as completed Jul 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of bounds error when reading data contains null value #647

Out of bounds error when reading data contains null value #647

zijie0 commented Jul 23, 2021

zijie0 commented Jul 23, 2021

martindurant commented Jul 23, 2021

zijie0 commented Jul 25, 2021

Out of bounds error when reading data contains null value #647

Out of bounds error when reading data contains null value #647

Comments

zijie0 commented Jul 23, 2021

zijie0 commented Jul 23, 2021

martindurant commented Jul 23, 2021

zijie0 commented Jul 25, 2021