Skip to content

IndeError: list index out of range when opening a parquet file saved by pySpark because there is no _metadata folder #352

@IceS2

Description

@IceS2

I saved a pySpark dataframe to a parquet file trying to reproduce a error I encountered on nested parquet files where I could load them with fastparquet but couldn't transform them into a pandas DataFrame because ValueError would rise because there was a Null value in an integer column.

Trying to load the dataframe now is raising IndexError: list index out of range

+------------+
|        body|
+------------+
| [32189, 10]|
|[32196, 100]|
|[32197, 100]|
|[32198, 100]|
|[32192, 100]|
|[32193, 100]|
| [32187, 10]|
|[32191, 100]|
|    [32210,]|
|[32212, 123]|
|[32213, 100]|
|[32152, 100]|
| [32148, 10]|
| [32178, 10]|
| [32176, 10]|
| [32179, 10]|
|[32196, 100]|
|[32197, 100]|
|[32198, 100]|
| [32205, 10]|
+------------+
only showing top 20 rows

>>> df.write.parquet('test.parquet', mode='overwrite')
>>> df.printSchema()
root
 |-- body: struct (nullable = true)
 |    |-- id: integer (nullable = true)
 |    |-- candidates_requested: integer (nullable = true)
   ...: pq = fp.ParquetFile('part-00000-c0834688-6375-4c0f-bc44-a2f5b3d0
   ...: bc4b-c000.snappy.parquet')
   ...: df = pq.to_pandas()
   ...: 
   ...: 
------------------------------------------------------------------------
NotADirectoryError                     Traceback (most recent call last)
~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, root, sep)
     93                 self.fn = fn2
---> 94                 with open_with(fn2, 'rb') as f:
     95                     self._parse_header(f, verify)

~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/util.py in default_open(f, mode)
     36 def default_open(f, mode='rb'):
---> 37     return open(f, mode)
     38 

NotADirectoryError: [Errno 20] Not a directory: 'part-00000-c0834688-6375-4c0f-bc44-a2f5b3d0bc4b-c000.snappy.parquet/_metadata'

During handling of the above exception, another exception occurred:

IndexError                             Traceback (most recent call last)
<ipython-input-1-c3650c67f022> in <module>()
      1 import fastparquet as fp
----> 2 pq = fp.ParquetFile('part-00000-c0834688-6375-4c0f-bc44-a2f5b3d0bc4b-c000.snappy.parquet')
      3 df = pq.to_pandas()

~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, root, sep)
     98                 self.fn = join_path(fn)
     99                 with open_with(fn, 'rb') as f:
--> 100                     self._parse_header(f, verify)
    101         self.open = open_with
    102         self.sep = sep

~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/api.py in _parse_header(self, f, verify)
    122         self.head_size = head_size
    123         self.fmd = fmd
--> 124         self._set_attrs()
    125 
    126     def _set_attrs(self):

~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/api.py in _set_attrs(self)
    141                                            for rg in self.row_groups])
    142         self._read_partitions()
--> 143         self._dtypes()
    144 
    145     @ property

~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/api.py in _dtypes(self, categories)
    458                 num_nulls = 0
    459                 for rg in self.row_groups:
--> 460                     chunk = rg.columns[i]
    461                     if chunk.meta_data.statistics is None:
    462                         num_nulls = True

IndexError: list index out of range

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions