This guidance provides users instructions to operate the HDFS cluster in OpenPAI.
- Goal
- Build
- Configuration
- Deployment
- Upgrading
- Service Monitoring
- High Availability
- Access HDFS Data
- Reference
The Hadoop Distributed File System (HDFS) in OpenPAI serves as a central storage for both user's application and data. The application log will also be stored to HDFS.
The HDFS service image can be built together with other services by running this command:
python paictl.py image build -p /path/to/configuration/
HDFS is in the hadoop-run image, it can be built respectively with following commands:
python paictl.py image build -p /path/to/configuration/ -n hadoop-run
HDFS name node and data node both have it configuration files. They are located in name node configuration and data node configuration respectively. All the HDFS related properties are in file core-site.xml and hdfs-site.xml. Please refer core-site.xml and hdfs-site.xml for the detailed property descriptions.
HDFS's data storage path on a machine is configured by cluster.data-path in file services-configuration.yaml. All the HDFS related data both on name node and data node will be stored under this path.
- Configuration Data: Its path is defined by hadoop-name-node-configuration configuration map.
- Name Data: It is in the hdfs/name directory under the storage path.
- Temp Data: It is in the hadooptmp/namenode directory under the storage path.
- Configuration Data: Its path is defined by hadoop-data-node-configuration configuration map.
- Data Storage: It is in the hdfs/data directory under the storage path.
- Host Configuration: Its path is defined by host-configuration configuration map.
- Temp Data: It is in the hadooptmp/datanode directory under the storage path.
HDFS can be deployed when starting the OpenPAI services with command:
python paictl.py service start -p /service/configuration/path
The name node and data node service can be started separately by specifying the service name in the command.
python paictl.py service start -p /service/configuration/path -n hadoop-name-node
python paictl.py service start -p /service/configuration/path -n hadoop-data-node
It is recommended to have a backup of the name node data before upgrading the cluster. Please refer rolling upgrade for the detailed instructions.
HDFS exposes various metrics for monitoring and debugging. Please refer HDFS Metrics for all the detailed metrics and their explanations.
The Prometheus service will collect those metrics and monitor HDFS in real time. This is still an undergoing work.
- Data Node: all the metrics can be retrieved by command
curl http://DATA_NODE_ADDRESS:5075/jmx
- Name Node: all the metrics can be retrieved by command
curl http://NAME_NODE_ADDRESS:5070/jmx
Currently OpenPAI management tool doesn't deploy HDFS in a High Availability (HA) fashion. This will be added in a future release. For solution about the HA feature please refer HDFS High Availability.
Data on HDFS can be accessed by various ways. Users can choose the proper way according to there needs.
WebHDFS provides a set of REST APIs and this is our recommended way to access data. WebHDFS REST API contains the detailed instructions of the APIs. The rest server URI is http://hdfs-name-node-address:5070. The hdfs-name-node-address is the address of the machine with pai-master label true in configuration file cluster-configuration.yaml. Following are two simple examples to show how the APIs can be used to create and delete a file.
- Create a File
Suppose to create file test_file under directory /test. First step is submit a request without redirection and data with command:
curl -i -X PUT "http://hdfs-name-node-address:5070/webhdfs/v1/test/test_file?op=CREATE"
This command will return the data node where the file should be written. The location URI would be like
Then run following command with this URI to write file data:
curl -i -X PUT -T file-data-to-write returned-location-uri
- Delete a File
If we want to delete the file created by above example, run following command:
curl -i -X DELETE "http://hdfs-name-node-address:5070/webhdfs/v1/test/test_file?op=DELETE"
The commands are available in the Hadoop package. Please download the version you need on Hadoop Releases. Then extract it to your machine by running
tar -zxvf hadoop-package-name
All commands are located in bin directory. Please refer HDFS Command Guid for detailed command descriptions. All files in the HDFS are specified by its URI following pattern hdfs://hdfs-name-node-address:name-node-port/parent/child. Here the name-node-port is 9000. The hdfs-name-node-address is the address of the machine with pai-master label true in configuration file cluster-configuration.yaml.
Data on HDFS can be accessed by pointing your web browser to http://hdfs-name-node-address:5070/explorer.html after the cluster is ready. The hdfs-name-node-address is the address of the machine with pai-master label true in configuration file cluster-configuration.yaml. From release 2.9.0 users can upload or delete files on the web portal. On earlier release users can only browse the data.
The hadoop-hdfs-fuse tool can mount HDFS on local file system and users can access the data with Linux commands. The tool can be installed with following commands on Ubuntu system:
# add the CDH5 repository
wget http://archive.cloudera.com/cdh5/one-click-install/trusty/amd64/cdh5-repository_1.0_all.deb
sudo dpkg -i cdh5-repository_1.0_all.deb
# install the hadoop-dfs-fuse tool
sudo apt-get update
sudo apt-get install hadoop-hdfs-fuse
# mount to local system
mkdir -p your-mount-directory
sudo hadoop-fuse-dfs dfs://hdfs-name-node-address:9000 your-mount-directory
The Java APIs allow users to access data from Java programs. The detailed HDFS API interfaces can be found on HDFS API Doc。
The C API is provided by libhdfs library and it only supports a subset of the HDFS operations. Please follow the instructions on C APIs for details.
The Python API can be installed with command:
pip install hdfs
Please refer HdfsCLI for the details.