Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webhdfs: expose kerberos and https options #6936

Merged
merged 5 commits into from
Nov 7, 2021
Merged

webhdfs: expose kerberos and https options #6936

merged 5 commits into from
Nov 7, 2021

Conversation

gudmundur-heimisson
Copy link
Contributor

@gudmundur-heimisson gudmundur-heimisson commented Nov 7, 2021

This will re-enable some of the functionality lost after #6662.

I have not created a documentation pull request for this yet, since I expect there may be some discussion around the naming or organization of these options, since this is my first contribution to this repo and I'm not sure about all the conventions.

Once the naming is settled I will of course create a documentation PR as well.

Note that I have renamed webhdfs_token to token since that is what is indicated in the current docs.

Please let me know if you want the naming convention of these to be changed.

Fixes #6935

@gudmundur-heimisson gudmundur-heimisson requested a review from a team as a code owner November 7, 2021 16:25
@gudmundur-heimisson
Copy link
Contributor Author

I'm not sure if I linked the issue correctly but it is #6935

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much! 🙏

@efiop
Copy link
Contributor

efiop commented Nov 7, 2021

Oh, looks like we also didn't update the docs before 🙁 We'll need to https://dvc.org/doc/command-reference/remote/modify accordingly.

@gudmundur-heimisson
Copy link
Contributor Author

I will create a pull request in the dvc.org repo and link here once it's up for the documentation.

dvc/fs/webhdfs.py Outdated Show resolved Hide resolved
Simplify config.pop.

Co-authored-by: Ruslan Kuprieiev <kupruser@gmail.com>
dvc/fs/webhdfs.py Outdated Show resolved Hide resolved
dvc/fs/webhdfs.py Outdated Show resolved Hide resolved
Co-authored-by: Ruslan Kuprieiev <kupruser@gmail.com>
@gudmundur-heimisson
Copy link
Contributor Author

Here is the documentation pull request:
iterative/dvc.org#3009

@efiop efiop added bugfix fixes bug fs: webhdfs Related to the WebHDFS filesystem labels Nov 7, 2021
@efiop efiop changed the title webhdfs: expose kerberos and https options (#6935) webhdfs: expose kerberos and https options Nov 7, 2021
@efiop efiop merged commit c5b1a73 into iterative:master Nov 7, 2021
shcheklein pushed a commit to iterative/dvc.org that referenced this pull request Nov 8, 2021
* Update WebHDFS docs pending iterative/dvc#6936

* Restyled by prettier

Co-authored-by: Restyled.io <commits@restyled.io>
"webhdfs_alias": str,
"kerberos": Bool,
"kerberos_principal": str,
"proxy_to": str,
Copy link
Contributor

@jorgeorpinel jorgeorpinel Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be proxy_user or superuser per https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html ? Or hadoop_proxy_user to be precise.

"proxy to" could refer to many things...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for this initial pull request I just copied the names of parameters from fsspec and left them as a first draft to have a discussion around.

I think proxy_user would be clearer, and in the hadoop docs I feel like I see proxy user more often than superuser.

If we do something like hadoop_proxy_user I feel like maybe we should prefix all the other options with hadoop or webhdfs as well to be consistent?

It seems like the other DVC protocol config options do not do this kind of prefixing except google drive, but I can see how it would make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gudmundur-heimisson . It's an interesting question and I'm sure @efiop will know best what to do. I personally do like the prefixing idea

Comment on lines +185 to +187
"ssl_verify": Any(Bool, str),
"token": str,
"use_https": Bool,
Copy link
Contributor

@jorgeorpinel jorgeorpinel Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding of the Hadoop docs is correct, then swebhdfs protocol is webhdfs over https, but fsspec (and DVC) does not actually support swebhdfs protocol in URL strings and instead uses webhdfs with a use_https flag being true.

The question is if we want to create a new swebhdfs protocol to be consistent with the actual standard, or keep it as it is to avoid multiplying protocols on our end and just use this use_https flag.

As fsspec shows support for actually using swebhdfs in URL strings is rather inconsistent in the hadoop ecosystem it seems, so it wouldn't be crazy to just use webhdfs and then provide this flag.

Regarding the hadoop prefix see other comments.

"kerberos_principal": str,
"proxy_to": str,
"ssl_verify": Any(Bool, str),
"token": str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, with the previous caveat about prefixes.

@efiop
Copy link
Contributor

efiop commented Nov 17, 2021

@jorgeorpinel We've released this already, so I think renaming could wait as not critical. Also hadoop_ prefixes are a different topic for all remotes, as we've discussed during our meeting.

@skshetry
Copy link
Member

skshetry commented Dec 20, 2021

This requires system libraries for kerberos to be installed, so I'd suggest to roll back this change, or add it under a separate extras like we do for gssapi.

Strictly speaking, the requests-kerberos is not even our dependency.

@gudmundur-heimisson
Copy link
Contributor Author

You don't need the system libraries of kerberos to be installed to install requests_kerberos, you can get a ticket from another machine and use that, we do that for some of our processes that use service accounts.

As an FYI, if the kerberos capability is removed then this renders DVC unusable with secured hdfs clusters over webhdfs, which means it would not be usable in an enterprise setting.

If you decide to roll this back I would recommend rolling back all the way to before fsspec was used for webhdfs so that at least the client can be customized, since otherwise it is impossible to use with a secured cluster.

@skshetry
Copy link
Member

skshetry commented Dec 20, 2021

@gudmundur-heimisson, sorry I meant to rollback the dependency that you have added in webhdfs extras in setup.cfg, not the whole PR/functionality (was not able to comment on code changes somehow). fsspec also does not specify them directly and expects users to install them, and I think we should recommend the same here. Maybe we can print a message on the usages of those config if it's not installed?

I have been trying to install it on CI but one of it's indirect dependency krb5 needs to be built before it can be used which requires system libraries.

Also see https://github.com/iterative/dvc-webhdfs/blob/0ce2ab527eb3ab68f3e5420b32caf773330174e6/.github/workflows/tests.yaml#L31 as well.

@skshetry
Copy link
Member

@gudmundur-heimisson, does dvc[webhdfs_kerberos] work for you instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes bug fs: webhdfs Related to the WebHDFS filesystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow passing fsspec options to WebHDFS to enable Kerberos and https
4 participants