Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webhdfs: expose kerberos and https options #6936

Merged
merged 5 commits into from
Nov 7, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions dvc/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,10 +179,12 @@ class RelPath(str):
},
"hdfs": {"user": str, "kerb_ticket": str, **REMOTE_COMMON},
"webhdfs": {
"hdfscli_config": str,
"webhdfs_token": str,
"user": str,
"webhdfs_alias": str,
"kerberos": Bool,
"kerberos_principal": str,
"proxy_to": str,
Copy link
Contributor

@jorgeorpinel jorgeorpinel Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be proxy_user or superuser per https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html ? Or hadoop_proxy_user to be precise.

"proxy to" could refer to many things...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for this initial pull request I just copied the names of parameters from fsspec and left them as a first draft to have a discussion around.

I think proxy_user would be clearer, and in the hadoop docs I feel like I see proxy user more often than superuser.

If we do something like hadoop_proxy_user I feel like maybe we should prefix all the other options with hadoop or webhdfs as well to be consistent?

It seems like the other DVC protocol config options do not do this kind of prefixing except google drive, but I can see how it would make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gudmundur-heimisson . It's an interesting question and I'm sure @efiop will know best what to do. I personally do like the prefixing idea

"ssl_verify": Any(Bool, str),
"token": str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, with the previous caveat about prefixes.

"use_https": Bool,
Comment on lines +185 to +187
Copy link
Contributor

@jorgeorpinel jorgeorpinel Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding of the Hadoop docs is correct, then swebhdfs protocol is webhdfs over https, but fsspec (and DVC) does not actually support swebhdfs protocol in URL strings and instead uses webhdfs with a use_https flag being true.

The question is if we want to create a new swebhdfs protocol to be consistent with the actual standard, or keep it as it is to avoid multiplying protocols on our end and just use this use_https flag.

As fsspec shows support for actually using swebhdfs in URL strings is rather inconsistent in the hadoop ecosystem it seems, so it wouldn't be crazy to just use webhdfs and then provide this flag.

Regarding the hadoop prefix see other comments.

**REMOTE_COMMON,
},
"azure": {
Expand Down
15 changes: 11 additions & 4 deletions dvc/fs/webhdfs.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,17 +31,24 @@ def _get_kwargs_from_urls(urlpath):
)

def _prepare_credentials(self, **config):
if "webhdfs_token" in config:
config["token"] = config.pop("webhdfs_token")

self._ssl_verify = (
config.pop("ssl_verify") if "ssl_verify" in config else True
)
gudmundur-heimisson marked this conversation as resolved.
Show resolved Hide resolved
config["kerb_kwargs"] = {}
if "kerberos_principal" in config:
config["kerb_kwargs"]["principal"] = config.pop(
"kerberos_principal"
)
gudmundur-heimisson marked this conversation as resolved.
Show resolved Hide resolved
return config

@wrap_prop(threading.Lock())
@cached_property
def fs(self):
from fsspec.implementations.webhdfs import WebHDFS

return WebHDFS(**self.fs_args)
fs = WebHDFS(**self.fs_args)
fs.session.verify = self._ssl_verify
efiop marked this conversation as resolved.
Show resolved Hide resolved
return fs

def checksum(self, path_info):
path = self._with_bucket(path_info)
Expand Down
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ ssh_gssapi = sshfs[gssapi]>=2021.8.1
webdav = webdav4>=0.9.3
# not to break `dvc[webhdfs]`
webhdfs =
requests-kerberos==0.13.0
terraform = tpi[ssh]>=2.1.0
tests =
%(terraform)s
Expand Down
63 changes: 48 additions & 15 deletions tests/unit/remote/test_webhdfs.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,54 @@
from unittest.mock import Mock, create_autospec

import pytest
import requests

from dvc.fs.webhdfs import WebHDFSFileSystem

host = "host"
kerberos = False
kerberos_principal = "principal"
port = 12345
proxy_to = "proxy"
ssl_verify = False
token = "token"
use_https = True
user = "test"
webhdfs_token = "token"
webhdfs_alias = "alias-name"
hdfscli_config = "path/to/cli/config"


def test_init(dvc):
url = "webhdfs://test@127.0.0.1:50070"
config = {
"host": url,
"webhdfs_token": webhdfs_token,
"webhdfs_alias": webhdfs_alias,
"hdfscli_config": hdfscli_config,
"user": user,


@pytest.fixture()
def webhdfs_config():
url = f"webhdfs://{user}@{host}:{port}"
url_config = WebHDFSFileSystem._get_kwargs_from_urls(url)
return {
"kerberos": kerberos,
"kerberos_principal": kerberos_principal,
"proxy_to": proxy_to,
"ssl_verify": ssl_verify,
"token": token,
"use_https": use_https,
**url_config,
}

fs = WebHDFSFileSystem(**config)
assert fs.fs_args["token"] == webhdfs_token

def test_init(dvc, webhdfs_config):
fs = WebHDFSFileSystem(**webhdfs_config)
assert fs.fs_args["host"] == host
assert fs.fs_args["token"] == token
assert fs.fs_args["user"] == user
assert fs.fs_args["port"] == port
assert fs.fs_args["kerberos"] == kerberos
assert fs.fs_args["kerb_kwargs"] == {"principal": kerberos_principal}
assert fs.fs_args["proxy_to"] == proxy_to
assert fs.fs_args["use_https"] == use_https


def test_verify_ssl(dvc, webhdfs_config, monkeypatch):
mock_session = create_autospec(requests.Session)
monkeypatch.setattr(requests, "Session", Mock(return_value=mock_session))
# can't have token at the same time as user or proxy_to
del webhdfs_config["token"]
fs = WebHDFSFileSystem(**webhdfs_config)
# ssl verify can't be set until after the file system is instantiated
fs.fs # pylint: disable=pointless-statement
assert mock_session.verify == ssl_verify