Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented possibility to read directly from WebHDFS using smart_open #29

Merged
merged 14 commits into from
Aug 23, 2015

Conversation

ziky90
Copy link
Contributor

@ziky90 ziky90 commented Aug 18, 2015

This PR partially solves the issue #23. It implements reading from WebHDFS based on API described at https://hadoop.apache.org/docs/r1.0.4/webhdfs.html.

For reading from WebHDFS you can use:
webhdfs://host:port/path/file

NOTE:

  • It does not support kerberos authentication yet (so it only works for WebHDFS that is not secured)
  • Please do not merge yet, I'll try to implement read and seek methods in this PR soon as well.

@@ -36,6 +36,10 @@ It is well tested (using `moto <https://github.com/spulec/moto>`_), well documen
>>> for line in smart_open.smart_open('hdfs://user/hadoop/my_file.txt'):
... print line

>>> # stream from WebHDFS
>>> for line in smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt'):
... print line
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing 1 space

@piskvorky
Copy link
Owner

Thanks @ziky90 ! Can you add some unit tests so we can merge?

@ziky90
Copy link
Contributor Author

ziky90 commented Aug 18, 2015

Added space + replaced os.path.join.
I'll try to do some unit tests + I'll also try to directly implement read and seek methods before merge.

@piskvorky
Copy link
Owner

Cool, thanks!

@ziky90
Copy link
Contributor Author

ziky90 commented Aug 21, 2015

I have added read and seek methods implementation + unit tests.
For unit tests I need to additionally install package responses. Should this be in the setup.py or just as a note in the README.rst?

For now I have at least added requests to the .travis.yml file + I'll try to fix the Python 3 related problem with tests.

@piskvorky
Copy link
Owner

OK, thanks!
Re. responses: a note in the README (not a hard dependency).

@@ -140,9 +145,16 @@ def __init__(self, uri, default_scheme="file"):

if self.scheme == "hdfs":
self.uri_path = parsed_uri.netloc + parsed_uri.path
if self.uri_path[0] != "/":
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safer to use lstrip('/') here? (this will fail if there's no 0th element)

Or what is this for? When does this happen? Can uri_path start with multiple leading //?

@ziky90
Copy link
Contributor Author

ziky90 commented Aug 22, 2015

Tests for Python3 fixed, though I don't understand error log for Python2.6. It seems to as travis just got stuck (same as for commit 15c7cfc)?
It seems that it gets stuck on test that I was even not modifying at all.

@ziky90 ziky90 closed this Aug 22, 2015
@ziky90 ziky90 reopened this Aug 22, 2015
@ziky90 ziky90 closed this Aug 22, 2015
@ziky90 ziky90 reopened this Aug 22, 2015
@piskvorky
Copy link
Owner

Yeah I remember seeing that too. Not sure why it gets stuck -- seems something to do with multiprocessing.

In any case, not related to your PR here. I re-ran the failing travis test and it passes.

Are we good to merge or do you plan to add more commits?

@ziky90
Copy link
Contributor Author

ziky90 commented Aug 23, 2015

Thanks, from my point of view it's ready to merge.

piskvorky added a commit that referenced this pull request Aug 23, 2015
Implemented possibility to read directly from WebHDFS using smart_open
@piskvorky piskvorky merged commit 74d09de into piskvorky:master Aug 23, 2015
@piskvorky
Copy link
Owner

Thanks, merged!

How would you describe this PR in one line (for the CHANGELOG)?

@ziky90
Copy link
Contributor Author

ziky90 commented Aug 23, 2015

probably the best description would be something like:
implemented reading from WebHDFS

@ziky90 ziky90 mentioned this pull request Sep 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants