Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of bounds error when reading data contains null value #647

Closed
zijie0 opened this issue Jul 23, 2021 · 3 comments
Closed

Out of bounds error when reading data contains null value #647

zijie0 opened this issue Jul 23, 2021 · 3 comments

Comments

@zijie0
Copy link

zijie0 commented Jul 23, 2021

What happened:

With version 0.6.3 or below, we can't read data containing null values. It throws following error:

IndexError: Out of bounds on buffer access (axis 0)

What you expected to happen:

Should be able to read data containing null.

Minimal Complete Verifiable Example:

In [1]: import glob

In [2]: import pyspark

In [3]: from fastparquet import ParquetFile

In [4]: spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [5]: sdf = spark.createDataFrame([[None, 1, None], ['a', None, None]], schema='a string, b int, c string')

In [6]: sdf.show()
+----+----+----+
|   a|   b|   c|
+----+----+----+
|null|   1|null|
|   a|null|null|
+----+----+----+


In [7]: path = 'spark_null'

In [8]: sdf.write.format('parquet').save(path)

In [10]: file_list = glob.glob(f'{path}/*.parquet')

In [11]: pdf = ParquetFile(file_list).to_pandas()

.../lib/python3.7/site-packages/fastparquet/speedups.pyx in fastparquet.speedups.unpack_byte_array()

IndexError: Out of bounds on buffer access (axis 0)

Anything else we need to know?:

Environment:

  • Dask version: N/A
  • Python version: 3.7.9
  • Operating System: CentOS 7.6
  • Install method (conda, pip, source): pip
@zijie0
Copy link
Author

zijie0 commented Jul 23, 2021

It works fine in version 0.7.0, but we cannot do the upgrade due to this issue: 646

It would be nice if we can have a hotfix in 0.6.x. Otherwise the compatibility with Spark is totally broken.

@martindurant
Copy link
Member

If you agree that my comments in the other issue are sufficient, please close this issue.

@zijie0
Copy link
Author

zijie0 commented Jul 25, 2021

Close this issue as we have a workaround in #646.

@zijie0 zijie0 closed this as completed Jul 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants