-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
webhdfs: expose kerberos and https options #6936
Conversation
I'm not sure if I linked the issue correctly but it is #6935 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much! 🙏
Oh, looks like we also didn't update the docs before 🙁 We'll need to https://dvc.org/doc/command-reference/remote/modify accordingly. |
I will create a pull request in the dvc.org repo and link here once it's up for the documentation. |
Simplify config.pop. Co-authored-by: Ruslan Kuprieiev <kupruser@gmail.com>
Co-authored-by: Ruslan Kuprieiev <kupruser@gmail.com>
Here is the documentation pull request: |
* Update WebHDFS docs pending iterative/dvc#6936 * Restyled by prettier Co-authored-by: Restyled.io <commits@restyled.io>
"webhdfs_alias": str, | ||
"kerberos": Bool, | ||
"kerberos_principal": str, | ||
"proxy_to": str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be proxy_user
or superuser
per https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html ? Or hadoop_proxy_user
to be precise.
"proxy to" could refer to many things...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for this initial pull request I just copied the names of parameters from fsspec
and left them as a first draft to have a discussion around.
I think proxy_user
would be clearer, and in the hadoop docs I feel like I see proxy user more often than superuser.
If we do something like hadoop_proxy_user
I feel like maybe we should prefix all the other options with hadoop
or webhdfs
as well to be consistent?
It seems like the other DVC protocol config options do not do this kind of prefixing except google drive, but I can see how it would make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gudmundur-heimisson . It's an interesting question and I'm sure @efiop will know best what to do. I personally do like the prefixing idea
"ssl_verify": Any(Bool, str), | ||
"token": str, | ||
"use_https": Bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_https
- should it behadoop_swebhdfs
per https://hadoop.apache.org/docs/r3.1.0/api/org/apache/hadoop/fs/SWebHdfs.html ? Or is the point that it enables HTTPS (SWebHdfs being an implementation detail)ssl_verify
- if it's Hadoop specific should it behadoop_ssl_verify
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my understanding of the Hadoop docs is correct, then swebhdfs
protocol is webhdfs
over https
, but fsspec (and DVC) does not actually support swebhdfs
protocol in URL strings and instead uses webhdfs
with a use_https
flag being true.
The question is if we want to create a new swebhdfs
protocol to be consistent with the actual standard, or keep it as it is to avoid multiplying protocols on our end and just use this use_https
flag.
As fsspec
shows support for actually using swebhdfs
in URL strings is rather inconsistent in the hadoop ecosystem it seems, so it wouldn't be crazy to just use webhdfs
and then provide this flag.
Regarding the hadoop prefix see other comments.
"kerberos_principal": str, | ||
"proxy_to": str, | ||
"ssl_verify": Any(Bool, str), | ||
"token": str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
token
- should it be hadoop_delegation_token
to be explicit? Per https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/delegation_tokens.html#Background:_Hadoop_Delegation_Tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, with the previous caveat about prefixes.
@jorgeorpinel We've released this already, so I think renaming could wait as not critical. Also |
This requires system libraries for kerberos to be installed, so I'd suggest to roll back this change, or add it under a separate extras like we do for Strictly speaking, the |
You don't need the system libraries of kerberos to be installed to install requests_kerberos, you can get a ticket from another machine and use that, we do that for some of our processes that use service accounts. As an FYI, if the kerberos capability is removed then this renders DVC unusable with secured hdfs clusters over webhdfs, which means it would not be usable in an enterprise setting. If you decide to roll this back I would recommend rolling back all the way to before fsspec was used for webhdfs so that at least the client can be customized, since otherwise it is impossible to use with a secured cluster. |
@gudmundur-heimisson, sorry I meant to rollback the dependency that you have added in I have been trying to install it on CI but one of it's indirect dependency Also see https://github.com/iterative/dvc-webhdfs/blob/0ce2ab527eb3ab68f3e5420b32caf773330174e6/.github/workflows/tests.yaml#L31 as well. |
@gudmundur-heimisson, does
|
This will re-enable some of the functionality lost after #6662.
I have not created a documentation pull request for this yet, since I expect there may be some discussion around the naming or organization of these options, since this is my first contribution to this repo and I'm not sure about all the conventions.
Once the naming is settled I will of course create a documentation PR as well.
Note that I have renamed
webhdfs_token
totoken
since that is what is indicated in the current docs.Please let me know if you want the naming convention of these to be changed.
Fixes #6935