Skip to content

Commit 234d11a

Browse files
Guðmundur Heimissonrestyled-commits
andauthored
Update WebHDFS docs pending treeverse/dvc#6936 (#3009)
* Update WebHDFS docs pending treeverse/dvc#6936 * Restyled by prettier Co-authored-by: Restyled.io <commits@restyled.io>
1 parent f2b8b8c commit 234d11a

File tree

2 files changed

+56
-54
lines changed

2 files changed

+56
-54
lines changed

content/docs/command-reference/remote/add.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -305,27 +305,28 @@ $ dvc remote add -d myremote hdfs://user@example.com/path
305305
**HDFS and WebHDFS:**
306306

307307
Both remotes, HDFS and WebHDFS, allow using a Hadoop cluster as a remote
308-
repository. However, HDFS relies on `pyarrow` which in turn requires `libhdfs`,
309-
an interface to the Java Hadoop client, that must be installed separately.
310-
Meanwhile, WebHDFS has no need for this requirement as it communicates with the
311-
Hadoop cluster via a HTTP REST API using the Python libraries `HdfsCLI` and
312-
`requests`. The latter remote should be preferred by users who seek easier and
313-
more portable setups, at the expense of performance due to the added overhead of
314-
HTTP.
315-
316-
One last note: WebHDFS does require enabling the HTTP REST API in the cluster by
317-
setting the configuration property `dfs.webhdfs.enabled` to `true` in
318-
`hdfs-site.xml`.
308+
repository. However, HDFS requires `libhdfs`, an interface to the Java Hadoop
309+
client, that must be installed separately. Meanwhile, WebHDFS has no need for
310+
this requirement as it communicates with the Hadoop cluster via a REST API.
311+
312+
If your cluster is secured, then WebHDFS is commonly used with Kerberos and
313+
HTTPS, to enable these simply set `use_https` and `kerberos` to `true`. This
314+
will require you to run `kinit` before invoking DVC to make sure you have an
315+
active kerberos session.
316+
317+
One last note: WebHDFS requires enabling the REST API in the cluster by setting
318+
the configuration property `dfs.webhdfs.enabled` to `true` in `hdfs-site.xml`.
319319

320320
```dvc
321-
$ dvc remote add -d myremote webhdfs://user@example.com/path
322-
$ dvc remote modify --local myremote user myuser
323-
$ dvc remote modify --local myremote token 'mytoken'
321+
$ dvc remote add -d myremote webhdfs://example.com/path
322+
$ dvc remote modify myremote use_https true
323+
$ dvc remote modify myremote kerberos true
324+
$ dvc remote modify --local myremote token SOME_BASE64_ENCODED_TOKEN
324325
```
325326

326-
> The user name and password may contain sensitive user info. Therefore, it's
327-
> safer to add it with the `--local` option, so it's written to a Git-ignored
328-
> config file. See `dvc remote modify` for a full list of WebHDFS parameters.
327+
> If `token` is used, it may contain sensitive user info. Therefore, it's safer
328+
> to add it with the `--local` option, so it's written to a Git-ignored config
329+
> file. See `dvc remote modify` for a full list of WebHDFS parameters.
329330
330331
</details>
331332

content/docs/command-reference/remote/modify.md

Lines changed: 38 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -857,69 +857,70 @@ Read more about by expanding the WebHDFS section in
857857
by HDFS. Read more about by expanding the WebHDFS section in
858858
[`dvc remote add`](/doc/command-reference/remote/add#supported-storage-types).
859859

860-
- `url` - remote location:
860+
- `url` - remote location.
861861

862862
```dvc
863863
$ dvc remote modify myremote url webhdfs://user@example.com/path
864864
```
865865

866-
- `user` - user name to access the remote, can be empty in case of using `token`
867-
or if using a `HdfsCLI` cfg file. May only be used when Hadoop security is
868-
off. Defaults to current user as determined by `whoami`.
866+
Only provide the `user` parameter if you are not using `kerberos` or `token`
867+
authentication, since those authentication methods already contain the user
868+
information.
869+
870+
- `kerberos` - whether or not to enable kerberos authentication. Defaults to
871+
`false`. Example:
869872

870873
```dvc
871-
$ dvc remote modify --local myremote user myuser
874+
$ dvc remote modify myremote kerberos true
872875
```
873876

874-
- `token` - Hadoop delegation token for WebHDFS, can be empty in case of using
875-
`user` or if using a `HdfsCLI` cfg file. May be used when Hadoop security is
876-
on.
877+
- `kerberos_principal` - kerberos principal to use. Useful if you have multiple
878+
kerberos principals, for example for service accounts. If `kerberos` is
879+
`false` this setting is ignored.
877880

878881
```dvc
879-
$ dvc remote modify --local myremote token 'mytoken'
882+
$ dvc remote modify myremote kerberos_principal some_principal_name
880883
```
881884

882-
- `hdfscli_config` - path to a `HdfsCLI` cfg file. WebHDFS access depends on
883-
`HdfsCLI`, which allows the usage of a configuration file by default located
884-
in `~/.hdfscli.cfg` (Linux). In the file, multiple aliases can be set with
885-
their own connection parameters, like `url` or `user`. If using a cfg file,
886-
`webhdfs_alias` can be set to specify which alias to use.
885+
- `proxy_to` - user to proxy as. Proxy user feature must be enabled on the
886+
cluster, and the user must have the correct rights. For more information see
887+
[the Hadoop documentation](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html).
888+
This setting is incompatible with `token`, since if a delegation token is used
889+
the proxy user is embedded in the token information. If the cluster is secured
890+
kerberos must be enabled for this to work.
887891

888892
```dvc
889-
$ dvc remote modify --local myremote hdfscli_config \
890-
`/path/to/.hdfscli.cfg`
893+
$ dvc remote modify myremote proxy_to some_proxy_user
891894
```
892895

893-
Sample configuration file:
896+
- `ssl_verify` - whether to verify SSL requests. Default is true when
897+
`use_https` is enabled.
894898

895-
```ini
896-
[global]
897-
default.alias = myalias
899+
```dvc
900+
$ dvc remote modify myremote ssl_verify false
901+
```
898902

899-
[myalias.alias]
900-
url = http://example.com/path
901-
user = myuser
903+
- `token` - delegation token. For more information see
904+
[the Hadoop documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/delegation_tokens.html#Background:_Hadoop_Delegation_Tokens.).
905+
This setting is incompatible with `proxy_to` or providing a `user` in the
906+
`url` since that information is encoded in the token itself. This token must
907+
be the base64 encoded URL safe token such as that
908+
[returned by the WebHDFS API](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Get_Delegation_Token).
909+
On a secured cluster kerberos must be enabled for delegation tokens to be
910+
used.
902911

903-
[production.alias]
904-
url = http://prodexample.com/path
905-
user = produser
912+
```dvc
913+
$ dvc remote modify myremote token SOME_BASE64_ENCODED_TOKEN
906914
```
907915

908-
See more information in the `HdfsCLI`
909-
[docs](https://hdfscli.readthedocs.io/en/latest/quickstart.html#configuration).
910-
911-
- `webhdfs_alias` - alias in a `HdfsCLI` cfg file to use. Only relevant if used
912-
in conjunction with `hdfscli_config`. If not defined, `default.alias` in
913-
`HdfsCLI` cfg file will be used instead.
916+
- `use_https` - whether to use `swebhdfs` or not. Note that DVC still expects
917+
the protocol string in the `url` to be `webhdfs` and will fail if `swebhdfs`
918+
is used.
914919

915920
```dvc
916-
$ dvc remote modify --local myremote webhdfs_alias myalias
921+
$ dvc remote modify myremote use_https true
917922
```
918923

919-
> The user name, token, webhdfs_alias, and hdfscli_config may contain sensitive
920-
> user info. Therefore, it's safer to add it with the `--local` option, so it's
921-
> written to a Git-ignored config file.
922-
923924
</details>
924925

925926
<details>

0 commit comments

Comments
 (0)