Skip to content
@HPCMonitoring

HPCMonitoring

Building HPC Monitoring system for HCMUT

HPC Monitoring

After using and researching about many technologies of system monitoring like Prometheus and Zabbix, we found that each of them has some drawbacks and not suitable for HPC system in Ho Chi Minh University of Technology (HCMUT).

To serve special demands of our university's HPC system, this project provides a monitoring tool that approach problem in a new way: Event-Driven Architecture.

🚀 Functionalities

With modern tool Apache Kafka, we build a monitoring system with functions:

  • Provide monitor agents, each runs on a node, collect metric data from entire node and especially from each process. Data collections include CPU usage, memory (both physical & virtual), I/O read/write, network in/out bytes, disk usages. On each process, data collections also include name, PID, PPID, UID, GID, execute path, command which used to run process.
  • Support query language that allow monitoring on demands. For example, user can only collect desired metrics like CPU and memory usage. Furthermore, SQL logic-like query language also provide abilities to only collect metrics data on specific conditions, such as collect informations about process which have RAM usage bigger 50% of total.
  • Provide a web UI to manage monitor agents
  • Leverage Apache Kafka to bring out Event-Driven architecture, all metrics collected are sent to Kafka broker, hence worry-free about how to receive metrics.

In additions about monitor agents, my team, highly inspired by DBMS architecture, had developed a module called "virtual sensor". Not only its own query language, but also implemened query optimizations, that is, only collect metrics on demand, hence reduce workloads of reading data. This feature is not appear on any current monitoring tools like Zabbix and Prometheus.

🐧 Authors

We are a group of students from HCMUT. This orgarnization was established due to contain results of our thesis in 2023.

📚 Helpful resources

Our documentation is here. Please look around before exploring repositories.

Pinned Loading

  1. virtual-sensor virtual-sensor Public

    An agent run on each node in HPC system

    C++ 1

  2. docs docs Public template

    All documents of system goes here. Please look through before exploring our source codes.

    Makefile 9

  3. sensor-manager-api sensor-manager-api Public

    Sensor manager API server

    TypeScript 3

  4. sensor-manager sensor-manager Public template

    Sensor management web application

    TypeScript 2

Repositories

Showing 6 of 6 repositories
  • .github Public

    Organization profile page.

    HPCMonitoring/.github’s past year of commit activity
    0 0 0 0 Updated Jul 4, 2023
  • virtual-sensor Public

    An agent run on each node in HPC system

    HPCMonitoring/virtual-sensor’s past year of commit activity
    C++ 1 0 0 0 Updated Jul 4, 2023
  • sensor-manager Public template

    Sensor management web application

    HPCMonitoring/sensor-manager’s past year of commit activity
    TypeScript 2 0 1 0 Updated May 14, 2023
  • sensor-manager-api Public

    Sensor manager API server

    HPCMonitoring/sensor-manager-api’s past year of commit activity
    TypeScript 3 0 0 0 Updated May 14, 2023
  • kafka-subscriber Public

    A sample project that read data from kafka broker then write result to elastic search

    HPCMonitoring/kafka-subscriber’s past year of commit activity
    JavaScript 0 0 0 0 Updated May 2, 2023
  • docs Public template

    All documents of system goes here. Please look through before exploring our source codes.

    HPCMonitoring/docs’s past year of commit activity
    Makefile 9 0 0 0 Updated Mar 27, 2023

Most used topics

Loading…