HDFS

This guidance provides users instructions to operate the HDFS cluster in OpenPAI.

Table of Content

Goal
Build
Configuration
- Properties Configuration
- Storage Path
  - Name Node
  - Data Node
Deployment
Upgrading
Service Monitoring
- Metrics
- Monitoring
  - Monitor via Prometheus
  - Monitor via HTTP API
High Availability
Access HDFS Data
- WebHDFS
- HDFS Command
- Web Portal
- Mountable HDFS
- API
Reference

Goal

The Hadoop Distributed File System (HDFS) in OpenPAI serves as a central storage for both user's application and data. The application log will also be stored to HDFS.

Build

The HDFS service image can be built together with other services by running this command:

python paictl.py image build -p /path/to/configuration/

HDFS is in the hadoop-run image, it can be built respectively with following commands:

python paictl.py image build -p /path/to/configuration/ -n hadoop-run

Configuration

Properties Configuration

HDFS name node and data node both have it configuration files. They are located in name node configuration and data node configuration respectively. All the HDFS related properties are in file core-site.xml and hdfs-site.xml. Please refer core-site.xml and hdfs-site.xml for the detailed property descriptions.

Storage Path

HDFS's data storage path on a machine is configured by cluster.data-path in file services-configuration.yaml. All the HDFS related data both on name node and data node will be stored under this path.

Name Node

Configuration Data: Its path is defined by hadoop-name-node-configuration configuration map.
Name Data: It is in the hdfs/name directory under the storage path.
Temp Data: It is in the hadooptmp/namenode directory under the storage path.

Data Node

Configuration Data: Its path is defined by hadoop-data-node-configuration configuration map.
Data Storage: It is in the hdfs/data directory under the storage path.
Host Configuration: Its path is defined by host-configuration configuration map.
Temp Data: It is in the hadooptmp/datanode directory under the storage path.

Deployment

HDFS can be deployed when starting the OpenPAI services with command:

python paictl.py service start -p /service/configuration/path

The name node and data node service can be started separately by specifying the service name in the command.

python paictl.py service start -p /service/configuration/path -n hadoop-name-node
python paictl.py service start -p /service/configuration/path -n hadoop-data-node

Upgrading

It is recommended to have a backup of the name node data before upgrading the cluster. Please refer rolling upgrade for the detailed instructions.

Service Monitoring

Metrics

HDFS exposes various metrics for monitoring and debugging. Please refer HDFS Metrics for all the detailed metrics and their explanations.

Monitoring

Monitoring via Prometheus

The Prometheus service will collect those metrics and monitor HDFS in real time. This is still an undergoing work.

Monitoring via HTTP API

Data Node: all the metrics can be retrieved by command

curl http://DATA_NODE_ADDRESS:5075/jmx

Name Node: all the metrics can be retrieved by command

curl http://NAME_NODE_ADDRESS:5070/jmx

High Availability

Currently OpenPAI management tool doesn't deploy HDFS in a High Availability (HA) fashion. This will be added in a future release. For solution about the HA feature please refer HDFS High Availability.

Access HDFS Data

Data on HDFS can be accessed by various ways. Users can choose the proper way according to there needs.

WebHDFS

WebHDFS provides a set of REST APIs and this is our recommended way to access data. WebHDFS REST API contains the detailed instructions of the APIs. The rest server URI is http://hdfs-name-node-address:5070. The hdfs-name-node-address is the address of the machine with pai-master label true in configuration file cluster-configuration.yaml. Following are two simple examples to show how the APIs can be used to create and delete a file.

Create a File
Suppose to create file test_file under directory /test. First step is submit a request without redirection and data with command:

curl -i -X PUT "http://hdfs-name-node-address:5070/webhdfs/v1/test/test_file?op=CREATE"

This command will return the data node where the file should be written. The location URI would be like

http://hdfs-name-node-address:5075/webhdfs/v1/test/test_file?op=CREATE&namenoderpcaddress=hdfs-data-node-address:9000&createflag=&createparent=true&overwrite=false

Then run following command with this URI to write file data:

curl -i -X PUT -T file-data-to-write returned-location-uri

Delete a File
If we want to delete the file created by above example, run following command:

curl -i -X DELETE "http://hdfs-name-node-address:5070/webhdfs/v1/test/test_file?op=DELETE"

HDFS Command

The commands are available in the Hadoop package. Please download the version you need on Hadoop Releases. Then extract it to your machine by running

tar -zxvf hadoop-package-name

All commands are located in bin directory. Please refer HDFS Command Guid for detailed command descriptions. All files in the HDFS are specified by its URI following pattern hdfs://hdfs-name-node-address:name-node-port/parent/child. Here the name-node-port is 9000. The hdfs-name-node-address is the address of the machine with pai-master label true in configuration file cluster-configuration.yaml.

Web Portal

Data on HDFS can be accessed by pointing your web browser to http://hdfs-name-node-address:5070/explorer.html after the cluster is ready. The hdfs-name-node-address is the address of the machine with pai-master label true in configuration file cluster-configuration.yaml. From release 2.9.0 users can upload or delete files on the web portal. On earlier release users can only browse the data.

Mountable HDFS

The hadoop-hdfs-fuse tool can mount HDFS on local file system and users can access the data with Linux commands. The tool can be installed with following commands on Ubuntu system:

# add the CDH5 repository
wget http://archive.cloudera.com/cdh5/one-click-install/trusty/amd64/cdh5-repository_1.0_all.deb
sudo dpkg -i cdh5-repository_1.0_all.deb
# install the hadoop-dfs-fuse tool
sudo apt-get update
sudo apt-get install hadoop-hdfs-fuse
# mount to local system
mkdir -p your-mount-directory
sudo hadoop-fuse-dfs dfs://hdfs-name-node-address:9000 your-mount-directory

API

Java API

The Java APIs allow users to access data from Java programs. The detailed HDFS API interfaces can be found on HDFS API Doc。

C API

The C API is provided by libhdfs library and it only supports a subset of the HDFS operations. Please follow the instructions on C APIs for details.

Python API

The Python API can be installed with command:

pip install hdfs

Please refer HdfsCLI for the details.

Reference

Hadoop reference doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdfs.md

hdfs.md

HDFS

Table of Content

Goal

Build

Configuration

Properties Configuration

Storage Path

Name Node

Data Node

Deployment

Upgrading

Service Monitoring

Metrics

Monitoring

Monitoring via Prometheus

Monitoring via HTTP API

High Availability

Access HDFS Data

WebHDFS

HDFS Command

Web Portal

Mountable HDFS

API

Java API

C API

Python API

Reference

Files

hdfs.md

Latest commit

History

hdfs.md

File metadata and controls

HDFS

Table of Content

Goal

Build

Configuration

Properties Configuration

Storage Path

Name Node

Data Node

Deployment

Upgrading

Service Monitoring

Metrics

Monitoring

Monitoring via Prometheus

Monitoring via HTTP API

High Availability

Access HDFS Data

WebHDFS

HDFS Command

Web Portal

Mountable HDFS

API

Java API

C API

Python API

Reference