Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatibility with hdfs.open #188

Closed
ephes opened this issue Jul 27, 2017 · 5 comments
Closed

Incompatibility with hdfs.open #188

ephes opened this issue Jul 27, 2017 · 5 comments

Comments

@ephes
Copy link

ephes commented Jul 27, 2017

Hi,

I tried to read a parquet file directly from hdfs using pyarrow, but if I set the open_with to hdfs.open, it doesn't seem to work:

from fastparquet import ParquetFile
pf = ParquetFile(hdfs_path, open_with=hdfs.open)

---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
~/miniconda3/envs/ro/lib/python3.5/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep, root)
     92                 self.fn = fn2
---> 93                 with open_with(fn2, 'rb') as f:
     94                     self._parse_header(f, verify)

io.pxi in pyarrow.lib._HdfsClient.open()

error.pxi in pyarrow.lib.check_status()

ArrowIOError: IOError: Unable to open file /user/jwersdoerfer/amoma/queries/ymd=20170126/locale=CA/000000_0/_metadata

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-26-c56d7ca809d4> in <module>()
      1 from fastparquet import ParquetFile
----> 2 pf = ParquetFile(hdfs_path, open_with=hdfs.open)

~/miniconda3/envs/ro/lib/python3.5/site-packages/fastparquet/api.py in __init__(self, fn, verify, open_with, sep, root)
     97                 self.fn = fn
     98                 with open_with(fn, 'rb') as f:
---> 99                     self._parse_header(f, verify)
    100         if not self.row_groups:
    101             self.file_scheme = 'empty'

~/miniconda3/envs/ro/lib/python3.5/site-packages/fastparquet/api.py in _parse_header(self, f, verify)
    113             if verify:
    114                 assert f.read(4) == b'PAR1'
--> 115             f.seek(-8, 2)
    116             head_size = struct.unpack('<i', f.read(4))[0]
    117             if verify:

TypeError: seek() takes exactly one argument (2 given)
@martindurant
Copy link
Member

It seems arrow's seek function is incomplete - I would raise an issue with them. That form or seek says "8 bytes before the end of the file".
You could instead try hdfs3, which is known to work.

@wesm
Copy link

wesm commented Jul 27, 2017

Looks like we need to implement the second argument of file.seek in pyarrow.lib.HdfsFile.seek. I opened https://issues.apache.org/jira/browse/ARROW-1287

@ephes
Copy link
Author

ephes commented Jul 27, 2017

@martindurant hdfs3 worked and I'll keep it in mind as a fallback method. But I think I'm using pyarrow directly, because it was also 3 times faster on my parquet file. But thanks :).

@martindurant
Copy link
Member

Would be interested to see your benchmark - the types of data, and profiling you might do (if you have the time and motivation). If you are happy as things are, please close this issue.

@wesm
Copy link

wesm commented Jul 28, 2017

I submitted a patch for the seek issue; I will test it out with fastparquet to make sure all is a-OK

wesm added a commit to wesm/arrow that referenced this issue Jul 29, 2017
….seek

I still need to validate this against the use case in dask/fastparquet#188

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#907 from wesm/ARROW-1287 and squashes the following commits:

933f3f6 [Wes McKinney] Add testing script for checking thirdparty library against pyarrow.HdfsClient
423ca87 [Wes McKinney] Implement whence argument for pyarrow.NativeFile.seek
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants