HPC-NOW, start your HPC journey in the cloud now, with no operation workload!
A full-stack HPC solution in the cloud, for the HPC community.
Contributions are highly welcomed and respected. Please see CONTRIBUTING.
This project is sponsored by the OpenAtom Foundation.
- 1. Project Background
- 2. Core Components
- 3. How-To: Build, Install, Run, and Use
- 4. Contributing
- 5. Appendix: HPC-NOW Directories
Cloud High-Performance Computing - Cloud HPC, differs from on-premise HPC significantly. Cloud service brings high scalability and flexibility to High-Performance Computing. However, most HPC users are not familiar with building and maintaining HPC services in the cloud. The technical barrier of cloud computing is very high to researchers, engineers, and developers in different scientific and engineering domains, e.g. energy, chemistry, physics, materials, bioscience.
In order to make it super easy to start and manage HPC workloads in the cloud, we have been developing this project: HPC-NOW. NOW stands for:
- Start HPC workloads in the cloud NOW, immediately, in minutes.
- Manage HPC workloads with (almost) No Operation Workload.
Currently, the HPC-NOW platform supports 8 popular cloud platforms, shown as below:
- Alibaba Cloud, HPC-NOW Internal Code: CLOUD_A
- Tencent Cloud, HPC-NOW Internal Code: CLOUD_B
- Amazon Web Services, HPC-NOW Internal Code: CLOUD_C
- Huawei Cloud, HPC-NOW Internal Code: CLOUD_D
- Baidu BCE, HPC-NOW Internal Code: CLOUD_E
- Microsoft Azure, HPC-NOW Internal Code: CLOUD_F
- Google Cloud Platform, HPC-NOW Internal Code: CLOUD_G
- Volcano Engine by ByteDance, HPC-NOW Internal Code: CLOUD_H
DEMO: How easy it is to create a cloud HPC cluster using HPC-NOW?
Demo: You can easily manage multiple clusters across multiple clouds.
Thanks to Terraform and openTofu for making it possible to orchestrate cloud resources in a unified and simple way.
In this project, we are developing several components:
- installer : HPC-NOW service installer. It requires temporary administrator or root privilege to run. why?
- hpcopr : HPC Operator. The main component that manages the HPC clusters, users, jobs, data, monitoring, usage, etc.
- now-crypto : An independent cryptography module (AES-128-ECB-PKCS#7) that encrypts and decrypts the files containing sensitive information.
- hpcmgr : A utility running in every cluster's master node to communicate with the operator.
- infra-as-code : Infrastructure codes in HCL format.
- scripts : Shell scripts to initialize the clusters, install applications, etc.
The high-level architecture of this project is:
NOTICE: This project integrates several third-party components at execution level. Please see the NOTICE.
The HPC-NOW platform is very easy to build, run, and use. It is also cross-platform, which means you can run the HPC-NOW on Microsoft Windows, GNU/Linux (with APT, DNF or YUM), and macOS (Darwin).
Note 1: Currently only x86_64 platform is supported. If you are using other cpu platforms, please let us know.
Note 2: Instead of compiling/building from the source code, you can download pre-built executables/binaries from the release of this repository. In this case, the dev/build tools (git
, gcc
, clang
, or mingw-w64
) are NOT needed.
Note 3: The HPC-NOW relies on some fundamental system utilities. In most cases, these utilities have been included in the OS distros. See the list below. If you are not sure whether the utilities are installed or not, please run the commands in a terminal/command prompt window.
- Microsoft Windows:
curl
tar
ssh
scp
- GNU/Linux Distros:
curl
tar
unzip
ssh
scp
- macOS (Darwin) :
curl
tar
unzip
ssh
scp
If the utility curl
is not pre-installed, please manually install it from the official site or with the package manager (e.g. yum, apt), and add the PATH to the system environment variables. Usually, tar
, unzip
ssh
and scp
are pre-installed.
- Use git for code management.
- Use a standard C compiler. This project required:
- gcc for GNU/Linux distros. You can get it easily with the package manager (E.g.
yum
). - clang for macOS (Darwin). Please install it by running the command
clang
. If clang is absent, macOS would ask you for installing it automatically. - mingw-w64 (with POSIX) for Microsoft Windows, POSIX support is a must. We recommend MinGW-W64 GCC-8.1.0-x86_64-posix-sjlj , you can go to this link and find it. Download the tarball -> Unzip it to a directory -> Add the full path of the
bin
subdirectory to the$PATH
environment variable.
- gcc for GNU/Linux distros. You can get it easily with the package manager (E.g.
git clone https://github.com/zhenrong-wang/hpc-now
If your connectivity to github is not stable, you can also try to clone from gitee:
git clone https://gitee.com/zhenrong-wang/hpc-now
cd hpc-now
- For Microsoft Windows:
.\make_windows.bat build
- For GNU/Linux Distro :
chmod +x make_linux.sh && ./make_linux.sh build
- For macOS (Darwin) :
chmod +x make_darwin.sh && ./make_darwin.sh build
If everything goes well, the binaries will be built to the build
folder.
Temporary Administrator or root privilege is required to run the installer why?.
-
For Microsoft Windows :
Step 1. Open a Command Prompt as Administrator:
- Type 'cmd' in the search box - Right click on the icon of the Command Prompt - Select 'Run as Administrator'
Step 2. Change direcroty to the
build
folder:Suppose your local repo path is
c:\users\public\hpc-now
, then the command should be:cd c:\users\public\hpc-now\build
Step 3. Run the command below:
Suppose your installer version code is 0.3.2, hpcopr version code is 0.3.2, then the command should be:
.\installer-win-0.3.2.exe install --hloc hpcopr-win-0.3.2.exe --cloc now-crypto-aes-win.exe
-
For GNU/Linux Distros :
Suppose your installer version code is 0.3.2, hpcopr version code is 0.3.2, then the command should be:
sudo ./installer-lin-0.3.2.exe install --hloc hpcopr-lin-0.3.2.exe --cloc now-crypto-aes-lin.exe
-
For macOS(Darwin):
Suppose your installer version code is 0.3.2, hpcopr version code is 0.3.2, then the command should be:
sudo ./installer-dwn-0.3.2.exe install --hloc hpcopr-dwn-0.3.2.exe --cloc now-crypto-aes-dwn.exe
IMPORTANT: Please replace the sample version code 0.3.2
with the real code of your own build.
IMPORTANT: Please keep the window open for the next step.
The hpcopr.exe
is designed to be executed by the dedicated system OS user named hpc-now
, which has been created by the installer in the last step.
In order to run the hpcopr.exe
, you'll need to set a password and switch to that user. See the steps below:
-
For Microsoft Windows:
- Set a password for the user 'hpc-now' :
net user hpc-now YOUR_COMPLEX_PASSWORD
- Run a new cmd window as 'hpc-now' :
runas /savecred /user:mymachine\hpc-now cmd
- Run the main program 'hpcopr.exe' :
hpcopr envcheck
- Set a password for the user 'hpc-now' :
-
For GNU/Linux Distros:
- Set a password for the user 'hpc-now' :
sudo passwd hpc-now
- Switch to the user 'hpc-now' :
su hpc-now
- Run the main program 'hpcopr.exe' :
hpcopr envcheck
- Set a password for the user 'hpc-now' :
-
For macOS(Darwin):
- Set a password for the user 'hpc-now' :
sudo dscl . -passwd /Users/hpc-now YOUR_COMPLEX_PASSWORD
- Switch to the user 'hpc-now' :
su hpc-now
- Run the main program 'hpcopr.exe' :
hpcopr envcheck
- Set a password for the user 'hpc-now' :
Several extra packages (around 500 MB) will be downloaded and installed. This process may needs minutes (depending on your internet connectivity).
NOTE 1: For UNIX-like OS, it is not necessary to set a password for 'hpc-now' and switch to it in the terminal. You can just run hpcopr.exe
with sudo -Hu hpc-now
prefix. e.g.:
sudo -Hu hpc-now hpcopr envcheck
The -Hu specifies the user hpc-now and its home directory
This method is only valid for sudoers.
NOTE 2: If you are using a GNU/Linux distro with desktop envrionment (E.g. Debian with GNOME), after switching to the user hpc-now in a terminal, your desktop environment may not be authorized to hpc-now by default. The hpcopr rdp --copypass
function would not work properly. Please follow the instructions below:
- Run command
source /etc/profile
to authorize the user hpc-now to use the current desktop environment. - Add
source /etc/profile
to the~/.bashrc
file using a text editor. E.g.gedit ~/.bashrc
.
The hpcopr
is the main CLI for you to run. Please see the description above.
If you'd like to update/uninstall the HPC-NOW services, you will need to run the installer
with sudo
(for UNIX-like OS) or as administrator(for Windows).
In order to use and manage HPC in the cloud with HPC-NOW, please follow the workflow:
- Import a cloud credential - a keypair or key file (
hpcopr new-cluster ...
) --> - Initialize a new cluster (
hpcopr init ...
) --> - Deploy an application (
hpcopr appman ...
) --> - Upload your data (
hpcopr dataman ...
) --> - Connect to your cluster (
hpcopr ssh ...
ORhpcopr rdp ...
) --> - Start your HPC work (
hpcopr jobman ...
) --> - Waiting for the job to be done - may be minutes, hours, or days ...
- Export your HPC data to local or other places (
hpcopr dataman ...
) --> - Hibernate the cluster (optional,
hpcopr sleep ...
) --> - Destroy the cloud cluster (
hpcopr destroy ...
) --> - Remove the cloud credentials (optional,
hpcopr remove ...
)
DEMO: An example of an HPC-NOW cluster running Paraview
The installer
is designed to manage the installation/update/removal of the HPC-NOW services. It needs temporary administrator privilege to:
- Add/delete the dedicated system user 'hpc-now'
- Change the ownership and/or permissions of the key working directories
- Use system-level package manager to install packages such as wget, zip/unzip, in case the packages are absent
We follow the least privilege principle. Please check the source code directory of installer.
USAGE:
sudo ./installer GENERAL_OPTION(required) ADVANCED_OPTIONS(optional)
for macOS(Darwin) and GNU/Linux sudoers.\installer GENERAL_OPTION(required) ADVANCED_OPTIONS(optional)
for Microsoft Windows Administrators
install
Install the HPC-NOW services and components.update
Update the HPC-NOW services and components.uninstall
Uninstall the HPC-NOW services completely. CAUTION! You must destroy/remove all the clusters managed in current device before running this command! Otherwise, out-of-manage cloud resource may occur!help
Read the help doc forinstaller
.setpass
Set/rotate/change/update the operator's keystring.version
Display the version of theinstaller
, not thehpcopr
.verlist
List out all the available versions provided in the public repository.
--accept
Automatically accept the MIT License terms of this software.--pass KEYSTRING
Specify the operator's crypto password.--hloc LOCATION
The location (an URL or a valid local path) of thehpcopr
CLI.--cloc LOCATION
The location (an URL or a valid local path) of thenow-crypto
binary.--hver VERSION
Specify a version of thehpcopr
CLI, default: latest version.--rdp
Recommend! Install the RDP client for GNU/Linux or macOS(Darwin).
Examples
sudo ./installer install --rdp
sudo ./installer install --hloc ./hpcopr.exe --cloc ./now-crypto.exe --accept
sudo ./installer uninstall
sudo ./installer update --hloc ./hpcopr.exe --accpet
The hpcopr
is a very powerful Command Line Interface (CLI) for you to use.
USAGE: hpcopr [-b] CMD_NAME CMD_FLAG ... [CMD_KEYWORD1 CMD_KEY_STRING1] ...
-b
: An optional flag to enter the batch mode and to skip all the interactions.CMD_NAME
: see all the command names belowCMD_FLAG
: single value, such as--force
,--all
CMD_KEYWORD
: key-value pair, such as-c myFirstCluster
Examples:
hpcopr new-cluster
hpcopr ssh -u user1 -c my_first_cluster
hpcopr -b rdp -u user2
CMD_NAME LIST:
envcheck
Quickly check the running environment.
new-cluster
Create a new cluster to initialize.ls-clusters
List all the current clusters.switch
Switch to a cluster in the registry to operate.glance
View all the clusters or a target cluster.refresh
Refresh a cluster without changing the resources.export
Export a cluster to another hpcopr client. Optional params:import
Import a cluster to the current hpcopr client.remove
Completely remove a cluster from the OS and registry.exit-current
Exit the current cluster.
help
Show this page and the information here.usage
View and/or export the usage history.monman
Get, filter, and extract cluster monitoring data.history
View and/or export the operation log.syserr
View and/or export the system cmd errors.del-logs
Delete the log trashbin or archived logs of a cluster.ssh
SSH to the master node of a cluster.rdp
Connect to the cluster's desktop environment.
Advanced - For developers:
decrypt
VERY RISKY!!! Decrypt sensitive files of a cluster list or all.encrypt
Encrypt sensitive files of a cluster list or all.set-tf
Set the running configurations for openTofu or terraformconfigloc
Configure the locations for the terraform binaries, providers, IaC templates and shell scripts.showloc
Show the current configured locations.showhash
Show the SHA-256 values of core components.resetloc
Reset to the default locations.
rotate-key
Rotate a new keypair for an existing cluster. The new keypair should be valid and comes from the same cloud vendor.get-conf
Get the default configuration file to edit and build a customized HPC cluster later (using the 'init' command).edit-conf
Edit and save the default configuration file before init.rm-conf
Remove the configuration file before init.init
Initialize a new cluster. If the configuration file is absent, the command will generate a default configuration file.rebuild
Rebuild the nodes without destroying the cluster's storage.
vault
Check the sensitive information of the current cluster.graph
Display the cluster map including all the nodes and status.viewlog
View the operation log of the current cluster.status
Check the status of SLURM service running in the cluster.
delc
Delete specified compute nodes. You must specify how many to be added, or use--nn all
addc
Add compute nodes to current cluster. You must specify how many to be added by--nn NUM
.shutdownc
Shutdown specified compute nodes. Similar to 'delc', you can specify to shut down all or part of the compute nodes by the param--nn NUM
or--nn all
.turnonc
Turn on specified compute nodes. Similar to 'delc', you can specify to turn on all or part of the compute nodes by the parameter--nn NUM
or--nn all
.reconfc
Reconfigure all the compute nodes.reconfm
Reconfigure the master node.sleep
Turn off all the nodes (management and compute) of the cluster.wakeup
Wake up the cluster nodes.nfsup
Increase the cluster’s NFS shared volume (in GB, only for Huaweicloud, Google Cloud Platform, and Microsoft Azure).destroy
DESTROY the whole cluster - including all the resources & data.payment
Switch the payment method between on-demand and monthly (not applicable for AWS, Google Cloud Platform, Microsoft Azure, or Volcengine).
Usage: hpcopr userman --ucmd USER_CMD [ KEY_WORD1 KEY_STRING1 ] ...
The cluster must be in running state (minimal or all).
--ucmd list List all the current cluster users.
--ucmd add Add a user to the cluster. By default, added users are enabled.
--ucmd delete Delete a user from the cluster.
--ucmd enable Enable a *disabled* user. Enabled users can run HPC workloads.
--ucmd disable Disable a user. Disabled users still can access the cluster.
--ucmd passwd Change user's password.
Usage: hpcopr dataman CMD_FLAG... [ KEY_WORD1 KEY_STRING1 ] ...
General Flags: -r, -rf, --recursive, --force, -f.
-s SOURCE_PATH Source path of the binary operations. e.g. cp
-d DEST_PATH Destination path of binary operations. e.g. cp
-t TARGET_PATH Target path of unary operations. e.g. ls
Bucket Operations
Transfer and manage data with the bucket.
--dcmd put Upload a local file or folder to the bucket path.
--dcmd get Download a bucket object(file or folder) to the local path.
--dcmd copy Copy a bucket object to another folder/path.
--dcmd list Show the object list of a specified folder/path.
--dcmd delete Delete an object (file or folder) of the bucket.
--dcmd move Move an existed object (file or folder) in the bucket.
Example: hpcopr dataman --dcmd put -s ./foo -d /foo -u user1
Direct Operations
Transfer and manage data in the cluster storage.
The cluster must be in running state (minimal or all).
--dcmd cp Remote copy between local and the cluster storage.
--dcmd mv Move the remote files/folders in the cluster storage.
--dcmd ls List the files/folders in the cluster storage.
--dcmd rm Remove the files/folders in the cluster storage.
--dcmd mkdir Make a directory in the cluster storage.
--dcmd cat Print out a remote plain text file.
--dcmd more Read a remote file.
--dcmd less Read a remote file.
--dcmd tail Streaming out a remote file dynamically.
--dcmd rput Upload a *remote* file or folder to the bucket path.
--dcmd rget Download a bucket object(file or folder) to the *remote* path.
@h/ to specify the $HOME prefix of the cluster.
@d/ to specify the /hpc_data/user_data prefix.
@a/ to specify the /hpc_apps/ prefix, only for root or user1.
@p/ to specify the public folder prefix ( INSECURE !).
@R/ to specify the / prefix, only for root or user1.
@t/ to specify the /tmp prefix.
Example: hpcopr dataman --dcmd cp -s ~/foo/ -d @h/foo -r -u user1
Usage: hpcopr appman --acmd APP_CMD CMD_FLAG [ KEY_WORD1 KEY_STRING1 ] ...
The cluster must be in running state (minimal or all).
-u USERNAME
A valid user name. Use 'root' for all users. Admin or Operator role is required for root.
--acmd store List out the apps in store.
--acmd avail List out all the installed apps.
--acmd check Check whether an app is available.
--acmd install Install an app to all users or a specified user.
--acmd build Compile and build an app to all users or a specified user.
--acmd remove Remove an app from the cluster.
--acmd update-config Update the locations for scripts and pacakge repository
--acmd show-config Display the locations for scripts and pacakge repository
Usage: hpcopr jobman --jcmd APP_CMD [ KEY_WORD1 KEY_STRING1 ] ...
The cluster must be in running state (minimal or all).
-u USERNAME
A valid user name. The root user CANNOT submit jobs.
--jcmd submit Submit a job to the cluster.
--jcmd list List out all the jobs.
--jcmd cancel Cancel a job with specified ID
about
About this software and HPC-NOW project.version
Display the version info.license
Read the terms of the MIT Licenserepair
Try to repair the hpcopr core components.
For more information, please refer to docs/UserManual-EN.pdf CAUTION: This file may not be updated.
The most detailed and updated help info can be found by the command hpcopr help
. We are also considering writing a standard mannual for hpcopr
. If you are interested, please let us know.
Please see the contributing guide .
Also, please feel free to mailto:
The hpc-now service manages 2 top-level directories and several subdirectories on your device and OS. Here is the architecture:
- BINARY_ROOT HPC-NOW binaries and utilities
- Microsoft Windows: C:\hpc-now\
- GNU/Linux: /home/hpc-now/.bin/
- macOS(Darwin): /Users/hpc-now/.bin/
- RUNNING_ROOT HPC-NOW running directories and files
- Microsoft Windows: C:\ProgramData\hpc-now\
- GNU/Linux: /usr/.hpc-now/
- macOS(Darwin): /Applications/.hpc-now/
+- BINARY_ROOT/
+- hpcopr The hpcopr executable
+- utils/ Including cryoto, terraform/tofu and cloud utilities
+- now-crypto-aes
+- terraform/tofu
+- cloud utilities
+- RUNNING_ROOT/
+- .now_crypto_seed.lock The hpcopr crypto string
+- now_logs/ Usages and Logs
+- log_trashbin.txt The trashbin of clusters' logs
+- now-cluster-usage.log The cluster usage log
+- system_command_error.log The system command error
+- now-cluster-operation.log The hpcopr command log
+- .tmp/ Temporary files
+- .now-ssh/ SSH keys for connectivity with your clusters
+- now-cluster-login.tmp Encrypted operator's private key
+- now-cluster-login.pub Operator's public key
+- .CLUSTER_NAME/ Each cluster has its own directory
+- USER_PRIVATE_KEYS.tmp Cluster users' encrypted private keys
+- .etc/ General configuration files
+- .all_clusters.dat.tmp Encrypted cluster registry
+- .all_clusters.dat.dec.bak Decrypted cluster registry
+- current_cluster.dat Current cluster indicator
+- google_check.dat Google connectivity indicator
+- locations.conf Locations of components
+- components.conf Version and SHA of components
+- tf_running.conf TF running configuration
+- .destroyed/ Files of destroyed clusters
+- workdir/ Working directories for all clusters
+- CLUSTER_NAME/ Each cluster has its own directory
+- log/ Cluster-level running logs
+- stack/ TF running dorectory
+- conf/ Cluster configuration
+- vault/ Cluster's sensitive files
+- mon_data/ Monitoring data of all clusters
NOTE:
1. All the directories and files except the .now_crypto_seed.lock are set to be readable, writable, and executable only by the system user hpc-now.
2. The .now_crypto_seed.lock file is set to be readable only by root/Admin and hpc-now. And it is NOT writable even for root/Admin.
3. Manually modification of the .now_crypto_seed.lock file will destroy the whole HPC-NOW because you may not be able to decrypt critical files.