Skip to content

Latest commit

 

History

History
215 lines (163 loc) · 9.52 KB

hdfs.md

File metadata and controls

215 lines (163 loc) · 9.52 KB

HDFS

This guidance provides users instructions to operate the HDFS cluster in OpenPAI.

Table of Content

Goal

The Hadoop Distributed File System (HDFS) in OpenPAI serves as a central storage for both user's application and data. The application log will also be stored to HDFS.

Build

The HDFS service image can be built together with other services by running this command:

python paictl.py image build -p /path/to/configuration/

HDFS is in the hadoop-run image, it can be built respectively with following commands:

python paictl.py image build -p /path/to/configuration/ -n hadoop-run

Configuration

Properties Configuration

HDFS name node and data node both have it configuration files. They are located in name node configuration and data node configuration respectively. All the HDFS related properties are in file core-site.xml and hdfs-site.xml. Please refer core-site.xml and hdfs-site.xml for the detailed property descriptions.

Storage Path

HDFS's data storage path on a machine is configured by cluster.data-path in file services-configuration.yaml. All the HDFS related data both on name node and data node will be stored under this path.

Name Node

  • Configuration Data: Its path is defined by hadoop-name-node-configuration configuration map.
  • Name Data: It is in the hdfs/name directory under the storage path.
  • Temp Data: It is in the hadooptmp/namenode directory under the storage path.

Data Node

  • Configuration Data: Its path is defined by hadoop-data-node-configuration configuration map.
  • Data Storage: It is in the hdfs/data directory under the storage path.
  • Host Configuration: Its path is defined by host-configuration configuration map.
  • Temp Data: It is in the hadooptmp/datanode directory under the storage path.

Deployment

HDFS can be deployed when starting the OpenPAI services with command:

python paictl.py service start -p /service/configuration/path

The name node and data node service can be started separately by specifying the service name in the command.

python paictl.py service start -p /service/configuration/path -n hadoop-name-node
python paictl.py service start -p /service/configuration/path -n hadoop-data-node

Upgrading

It is recommended to have a backup of the name node data before upgrading the cluster. Please refer rolling upgrade for the detailed instructions.

Service Monitoring

Metrics

HDFS exposes various metrics for monitoring and debugging. Please refer HDFS Metrics for all the detailed metrics and their explanations.

Monitoring

Monitoring via Prometheus

The Prometheus service will collect those metrics and monitor HDFS in real time. This is still an undergoing work.

Monitoring via HTTP API

  • Data Node: all the metrics can be retrieved by command
curl http://DATA_NODE_ADDRESS:5075/jmx
  • Name Node: all the metrics can be retrieved by command
curl http://NAME_NODE_ADDRESS:5070/jmx

High Availability

Currently OpenPAI management tool doesn't deploy HDFS in a High Availability (HA) fashion. This will be added in a future release. For solution about the HA feature please refer HDFS High Availability.

Access HDFS Data

Data on HDFS can be accessed by various ways. Users can choose the proper way according to there needs.

WebHDFS

WebHDFS provides a set of REST APIs and this is our recommended way to access data. WebHDFS REST API contains the detailed instructions of the APIs. The rest server URI is http://hdfs-name-node-address:5070. The hdfs-name-node-address is the address of the machine with pai-master label true in configuration file cluster-configuration.yaml. Following are two simple examples to show how the APIs can be used to create and delete a file.

  1. Create a File
    Suppose to create file test_file under directory /test. First step is submit a request without redirection and data with command:
curl -i -X PUT "http://hdfs-name-node-address:5070/webhdfs/v1/test/test_file?op=CREATE"

This command will return the data node where the file should be written. The location URI would be like

http://hdfs-name-node-address:5075/webhdfs/v1/test/test_file?op=CREATE&namenoderpcaddress=hdfs-data-node-address:9000&createflag=&createparent=true&overwrite=false

Then run following command with this URI to write file data:

curl -i -X PUT -T file-data-to-write returned-location-uri
  1. Delete a File
    If we want to delete the file created by above example, run following command:
curl -i -X DELETE "http://hdfs-name-node-address:5070/webhdfs/v1/test/test_file?op=DELETE"

HDFS Command

The commands are available in the Hadoop package. Please download the version you need on Hadoop Releases. Then extract it to your machine by running

tar -zxvf hadoop-package-name

All commands are located in bin directory. Please refer HDFS Command Guid for detailed command descriptions. All files in the HDFS are specified by its URI following pattern hdfs://hdfs-name-node-address:name-node-port/parent/child. Here the name-node-port is 9000. The hdfs-name-node-address is the address of the machine with pai-master label true in configuration file cluster-configuration.yaml.

Web Portal

Data on HDFS can be accessed by pointing your web browser to http://hdfs-name-node-address:5070/explorer.html after the cluster is ready. The hdfs-name-node-address is the address of the machine with pai-master label true in configuration file cluster-configuration.yaml. From release 2.9.0 users can upload or delete files on the web portal. On earlier release users can only browse the data.

Mountable HDFS

The hadoop-hdfs-fuse tool can mount HDFS on local file system and users can access the data with Linux commands. The tool can be installed with following commands on Ubuntu system:

# add the CDH5 repository
wget http://archive.cloudera.com/cdh5/one-click-install/trusty/amd64/cdh5-repository_1.0_all.deb
sudo dpkg -i cdh5-repository_1.0_all.deb
# install the hadoop-dfs-fuse tool
sudo apt-get update
sudo apt-get install hadoop-hdfs-fuse
# mount to local system
mkdir -p your-mount-directory
sudo hadoop-fuse-dfs dfs://hdfs-name-node-address:9000 your-mount-directory

API

Java API

The Java APIs allow users to access data from Java programs. The detailed HDFS API interfaces can be found on HDFS API Doc

C API

The C API is provided by libhdfs library and it only supports a subset of the HDFS operations. Please follow the instructions on C APIs for details.

Python API

The Python API can be installed with command:

pip install hdfs

Please refer HdfsCLI for the details.

Reference

  1. Hadoop reference doc