Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Dynamic Configuration Manager: Unable to Assign Existing Monitoring Jobs to New Node #1035

Closed
onion83 opened this issue Jul 14, 2024 · 15 comments
Assignees
Labels
bug Something isn't working needs triage

Comments

@onion83
Copy link

onion83 commented Jul 14, 2024

Bug description

When using the Dynamic Configuration Manager in Netdata Cloud, encounter an error message "Unknown config id given" when trying to assign existing monitoring tasks to a new node by clicking "Submit to multiple nodes"
iShot_2024-07-15_00 23 15
iShot_2024-07-15_00 23 42

Expected behavior

Success

Steps to Reproduce

  1. Perform a Full New Install of a Node (LXC by Proxmox):

    • Complete a fresh installation of a node using LXC in Proxmox.
  2. Install Using Integrations Auto Shell Script:

    • Use the auto shell script to install kickstart.sh and wait for the node to appear as active in the Netdata Cloud dashboard.
  3. Submit Tasks to Multiple Nodes:

    • Navigate to "Manage Space / Netdata Space / Configurations".
    • Edit an existing job, such as "ping", and attempt to submit it to multiple nodes.

Installation method

kickstart.sh

System info

Linux netdata-sz-ctc 5.15.149-1-pve netdata/netdata#1 SMP PVE 5.15.149-1 (2024-03-29T14:24Z) x86_64 x86_64 x86_64 GNU/Linux
/etc/os-release:NAME="Rocky Linux"
/etc/os-release:VERSION="9.4 (Blue Onyx)"
/etc/os-release:ID="rocky"
/etc/os-release:ID_LIKE="rhel centos fedora"
/etc/os-release:VERSION_ID="9.4"
/etc/os-release:PLATFORM_ID="platform:el9"
/etc/os-release:PRETTY_NAME="Rocky Linux 9.4 (Blue Onyx)"
/etc/os-release:ANSI_COLOR="0;32"
/etc/os-release:LOGO="fedora-logo-icon"
/etc/os-release:CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
/etc/os-release:SUPPORT_END="2032-05-31"
/etc/os-release:ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
/etc/os-release:ROCKY_SUPPORT_PRODUCT_VERSION="9.4"
/etc/os-release:REDHAT_SUPPORT_PRODUCT="Rocky Linux"
/etc/os-release:REDHAT_SUPPORT_PRODUCT_VERSION="9.4"
/etc/redhat-release:Rocky Linux release 9.4 (Blue Onyx)
/etc/rocky-release:Rocky Linux release 9.4 (Blue Onyx)
/etc/system-release:Rocky Linux release 9.4 (Blue Onyx)

Netdata build info

Packaging:
    Netdata Version ____________________________________________ : v1.46.2
    Installation Type __________________________________________ : binpkg-rpm
    Package Architecture _______________________________________ : x86_64
    Package Distro _____________________________________________ :
    Configure Options __________________________________________ : dummy-configure-command
Default Directories:
    User Configurations ________________________________________ : /etc/netdata
    Stock Configurations _______________________________________ : /usr/lib/netdata/conf.d
    Ephemeral Databases (metrics data, metadata) _______________ : /var/cache/netdata
    Permanent Databases ________________________________________ : /var/lib/netdata
    Plugins ____________________________________________________ : /usr/libexec/netdata/plugins.d
    Static Web Files ___________________________________________ : /usr/share/netdata/web
    Log Files __________________________________________________ : /var/log/netdata
    Lock Files _________________________________________________ : /var/lib/netdata/lock
    Home _______________________________________________________ : /var/lib/netdata
Operating System:
    Kernel _____________________________________________________ : Linux
    Kernel Version _____________________________________________ : 5.15.149-1-pve
    Operating System ___________________________________________ : unknown
    Operating System ID ________________________________________ : unknown
    Operating System ID Like ___________________________________ : unknown
    Operating System Version ___________________________________ : unknown
    Operating System Version ID ________________________________ : 9.4
    Detection __________________________________________________ : unknown
Hardware:
    CPU Cores __________________________________________________ : 2
    CPU Frequency ______________________________________________ : 2000000000
    RAM Bytes __________________________________________________ : 2147483648
    Disk Capacity ______________________________________________ : 375141883904
    CPU Architecture ___________________________________________ : x86_64
    Virtualization Technology __________________________________ : none
    Virtualization Detection ___________________________________ : systemd-detect-virt
Container:
    Container __________________________________________________ : lxc
    Container Detection ________________________________________ : systemd-detect-virt
    Container Orchestrator _____________________________________ : none
    Container Operating System _________________________________ : Rocky Linux
    Container Operating System ID ______________________________ : rocky
    Container Operating System ID Like _________________________ : rhel centos fedora
    Container Operating System Version _________________________ : 9.4 (Blue Onyx)
    Container Operating System Version ID ______________________ : 9.4
    Container Operating System Detection _______________________ : /etc/os-release
Features:
    Built For __________________________________________________ : Linux
    Netdata Cloud ______________________________________________ : YES
    Health (trigger alerts and send notifications) _____________ : YES
    Streaming (stream metrics to parent Netdata servers) _______ : YES
    Back-filling (of higher database tiers) ____________________ : YES
    Replication (fill the gaps of parent Netdata servers) ______ : YES
    Streaming and Replication Compression ______________________ : YES (zstd lz4 gzip)
    Contexts (index all active and archived metrics) ___________ : YES
    Tiering (multiple dbs with different metrics resolution) ___ : YES (5)
    Machine Learning ___________________________________________ : YES
Database Engines:
    dbengine (compression) _____________________________________ : YES (zstd lz4)
    alloc ______________________________________________________ : YES
    ram ________________________________________________________ : YES
    none _______________________________________________________ : YES
Connectivity Capabilities:
    ACLK (Agent-Cloud Link: MQTT over WebSockets over TLS) _____ : YES
    static (Netdata internal web server) _______________________ : YES
    h2o (web server) ___________________________________________ : YES
    WebRTC (experimental) ______________________________________ : NO
    Native HTTPS (TLS Support) _________________________________ : YES
    TLS Host Verification ______________________________________ : YES
Libraries:
    LZ4 (extremely fast lossless compression algorithm) ________ : YES
    ZSTD (fast, lossless compression algorithm) ________________ : YES
    zlib (lossless data-compression library) ___________________ : YES
    Brotli (generic-purpose lossless compression algorithm) ____ : NO
    protobuf (platform-neutral data serialization protocol) ____ : YES (system)
    OpenSSL (cryptography) _____________________________________ : YES
    libdatachannel (stand-alone WebRTC data channels) __________ : NO
    JSON-C (lightweight JSON manipulation) _____________________ : YES
    libcap (Linux capabilities system operations) ______________ : NO
    libcrypto (cryptographic functions) ________________________ : YES
    libyaml (library for parsing and emitting YAML) ____________ : YES
Plugins:
    apps (monitor processes) ___________________________________ : YES
    cgroups (monitor containers and VMs) _______________________ : YES
    cgroup-network (associate interfaces to CGROUPS) ___________ : YES
    proc (monitor Linux systems) _______________________________ : YES
    tc (monitor Linux network QoS) _____________________________ : YES
    diskspace (monitor Linux mount points) _____________________ : YES
    freebsd (monitor FreeBSD systems) __________________________ : NO
    macos (monitor MacOS systems) ______________________________ : NO
    statsd (collect custom application metrics) ________________ : YES
    timex (check system clock synchronization) _________________ : YES
    idlejitter (check system latency and jitter) _______________ : YES
    bash (support shell data collection jobs - charts.d) _______ : YES
    debugfs (kernel debugging metrics) _________________________ : YES
    cups (monitor printers and print jobs) _____________________ : YES
    ebpf (monitor system calls) ________________________________ : YES
    freeipmi (monitor enterprise server H/W) ___________________ : YES
    nfacct (gather netfilter accounting) _______________________ : NO
    perf (collect kernel performance events) ___________________ : YES
    slabinfo (monitor kernel object caching) ___________________ : YES
    Xen ________________________________________________________ : NO
    Xen VBD Error Tracking _____________________________________ : NO
    Logs Management ____________________________________________ : YES
Exporters:
    AWS Kinesis ________________________________________________ : NO
    GCP PubSub _________________________________________________ : NO
    MongoDB ____________________________________________________ : YES
    Prometheus (OpenMetrics) Exporter __________________________ : YES
    Prometheus Remote Write ____________________________________ : YES
    Graphite ___________________________________________________ : YES
    Graphite HTTP / HTTPS ______________________________________ : YES
    JSON _______________________________________________________ : YES
    JSON HTTP / HTTPS __________________________________________ : YES
    OpenTSDB ___________________________________________________ : YES
    OpenTSDB HTTP / HTTPS ______________________________________ : YES
    All Metrics API ____________________________________________ : YES
    Shell (use metrics in shell scripts) _______________________ : YES
Debug/Developer Features:
    Trace All Netdata Allocations (with charts) ________________ : NO
    Developer Mode (more runtime checks, slower) _______________ : NO

Additional info

No response

@onion83 onion83 added bug Something isn't working needs triage labels Jul 14, 2024
@ilyam8 ilyam8 transferred this issue from netdata/netdata Jul 14, 2024
@kapantzak kapantzak self-assigned this Jul 15, 2024
@kapantzak
Copy link

Thank you @onion83 for reporting! We're investigating in order to fix this soon.

@ilyam8
Copy link
Member

ilyam8 commented Jul 15, 2024

@onion83, hey. Can you try creating a ping job directly on netdata-sz-ctc? You will need to select the node

Screenshot 2024-07-15 at 10 21 45

Unknown config id given

This may indicate that the go.d.plugin (it has ping functionality) is not running on that particular node.

@onion83
Copy link
Author

onion83 commented Jul 15, 2024

As shown in the attached video, I recreated a brand-new node, netdata-sz-ctc2, and added it to the dashboard, confirming it is online. The video shows netdata.cloud on the left and the local node on the right.

  1. SSH into netdata-sz-ctc2 and confirm via the ps command that go.d.plugin is running in the system processes.
  2. Create a local task named localtest and confirm its success.
  3. In the netdata.cloud management console, attempt to sync an existing node's (netdata-cmc) apps monitoring task to netdata-sz-ctc2. This failed.
  4. In the local backend of netdata-sz-ctc2, attempt to add a task named apps with the monitoring target 1.1.1.1.
  5. In the netdata-cmc node, use the "Submit to multiple nodes" feature and select netdata-sz-ctc2 as the sync target. This time, the task succeeded.
  6. After refreshing the browser with F5 and editing the apps monitoring task on netdata-sz-ctc2, the monitoring target is now fully synchronized with netdata-cmc.

Therefore, the current bug is: After adding a new node, an empty task with the same name must be created to sync with other nodes (only tested with the ping plugin, other plugins not tested).

Expected:

  1. Automatically create non-existent monitoring tasks during synchronization.
  2. Feature: Use an auto-install script to join the same room and automatically sync all monitoring tasks, avoiding manual configuration and improving operational efficiency.
2024-07-15.16.20.25.mp4

@ilyam8
Copy link
Member

ilyam8 commented Jul 15, 2024

@onion83, hey. Not related to the issue, but: performance in the "privileged" mode can become less efficient as the number of targets grows. This is because CPU usage scales disproportionately, meaning it increases much faster than the number of targets. That is a bug in the upstream library we use for go.d/ping. See netdata/netdata#15410.

@onion83
Copy link
Author

onion83 commented Jul 16, 2024

hey @ilyam8 Please take a look at the title, issue, and video description. This is specifically about the task distribution issue with the Dynamic Configuration Manager and not related to ping values or system permissions 、cpu etc...

@ilyam8
Copy link
Member

ilyam8 commented Jul 16, 2024

I know that, that is why I started with "Not related to the issue".

@sashwathn
Copy link

@ilyam8 : Is this a bug at the agent side? I don't see why the user needs to create a local job (on the local Agent dashboard) before submitting it to multiple nodes?
Or @kapantzak have you identified some issue on the FE side for this?

@kapantzak
Copy link

@sashwathn I don't see any FE issue here

@ilyam8
Copy link
Member

ilyam8 commented Aug 13, 2024

Is this a bug at the agent side?

What is happening:

  • @onion83 uses the "update" action to sync (copy) a dyncfg item from one node to another
    • Click Edit an existing job on A.
    • Click Submit to Multiple Nodes.
    • Select Nodes to submit (B, C, ...).
  • This results in a "Dyncfg functions intercept: id is not found" error for any node other than the job source (A) because we are trying to do an "update" and Netdata can't do that because the "update" must be used for an existing job.

We need to provide another way to copy dyncfg items from Node to Node, or treat "update" as "add" if there is no existing job.

@ilyam8
Copy link
Member

ilyam8 commented Aug 13, 2024

or treat "update" as "add" if there is no existing job.

I think we need this, I will discuss it with @ktsaou when he returns.

@ktsaou
Copy link
Member

ktsaou commented Aug 13, 2024

So for all nodes is an update, but for the new nodes it has to be an add.

The solution is to convert an update to an add if the item is not already there?

@ilyam8
Copy link
Member

ilyam8 commented Aug 13, 2024

The solution is to convert an update to an add if the item is not already there?

Yes.


cc @onion83

An alternative is to use this workflow:

  • Click Edit an existing job on A.
  • Click "copy this item and create a new one".
  • Copy/paste the name (so it appears with the same name on other nodes).
  • Select Nodes to submit (B, C, ...) and Submit.

This will result in "add" - no issues.

Screen.Recording.2024-08-13.at.13.01.21.mov

@ilyam8
Copy link
Member

ilyam8 commented Aug 16, 2024

@kapantzak hey 👋 We discussed the issue with @ktsaou and suggest the following changes to frontend:

  • When doing "Submit to multiple nodes" during "Edit":
    • If the node is not the origin (for origin always "update")
    • Do a "get" request first to find out if the item exists or not
      • If exists - "update"
      • Otherwise - "add"

@kapantzak
Copy link

Hi @onion83, we released some changes for this that hopefully fixes the issue.

@ilyam8 ilyam8 closed this as completed Aug 19, 2024
@onion83
Copy link
Author

onion83 commented Aug 21, 2024

It works! thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage
Projects
None yet
Development

No branches or pull requests

5 participants