forked from jcrobak/parquet-python
-
-
Notifications
You must be signed in to change notification settings - Fork 188
Closed
Description
I saved a pySpark dataframe to a parquet file trying to reproduce a error I encountered on nested parquet files where I could load them with fastparquet but couldn't transform them into a pandas DataFrame because ValueError would rise because there was a Null value in an integer column.
Trying to load the dataframe now is raising IndexError: list index out of range
+------------+
| body|
+------------+
| [32189, 10]|
|[32196, 100]|
|[32197, 100]|
|[32198, 100]|
|[32192, 100]|
|[32193, 100]|
| [32187, 10]|
|[32191, 100]|
| [32210,]|
|[32212, 123]|
|[32213, 100]|
|[32152, 100]|
| [32148, 10]|
| [32178, 10]|
| [32176, 10]|
| [32179, 10]|
|[32196, 100]|
|[32197, 100]|
|[32198, 100]|
| [32205, 10]|
+------------+
only showing top 20 rows
>>> df.write.parquet('test.parquet', mode='overwrite')
>>> df.printSchema()
root
|-- body: struct (nullable = true)
| |-- id: integer (nullable = true)
| |-- candidates_requested: integer (nullable = true)
...: pq = fp.ParquetFile('part-00000-c0834688-6375-4c0f-bc44-a2f5b3d0
...: bc4b-c000.snappy.parquet')
...: df = pq.to_pandas()
...:
...:
------------------------------------------------------------------------
NotADirectoryError Traceback (most recent call last)
~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, root, sep)
93 self.fn = fn2
---> 94 with open_with(fn2, 'rb') as f:
95 self._parse_header(f, verify)
~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/util.py in default_open(f, mode)
36 def default_open(f, mode='rb'):
---> 37 return open(f, mode)
38
NotADirectoryError: [Errno 20] Not a directory: 'part-00000-c0834688-6375-4c0f-bc44-a2f5b3d0bc4b-c000.snappy.parquet/_metadata'
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-1-c3650c67f022> in <module>()
1 import fastparquet as fp
----> 2 pq = fp.ParquetFile('part-00000-c0834688-6375-4c0f-bc44-a2f5b3d0bc4b-c000.snappy.parquet')
3 df = pq.to_pandas()
~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, root, sep)
98 self.fn = join_path(fn)
99 with open_with(fn, 'rb') as f:
--> 100 self._parse_header(f, verify)
101 self.open = open_with
102 self.sep = sep
~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/api.py in _parse_header(self, f, verify)
122 self.head_size = head_size
123 self.fmd = fmd
--> 124 self._set_attrs()
125
126 def _set_attrs(self):
~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/api.py in _set_attrs(self)
141 for rg in self.row_groups])
142 self._read_partitions()
--> 143 self._dtypes()
144
145 @ property
~/.virtualenvs/default/lib/python3.6/site-packages/fastparquet/api.py in _dtypes(self, categories)
458 num_nulls = 0
459 for rg in self.row_groups:
--> 460 chunk = rg.columns[i]
461 if chunk.meta_data.statistics is None:
462 num_nulls = True
IndexError: list index out of range
Metadata
Metadata
Assignees
Labels
No labels