Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memory DIMM status in dashboard #1024

Closed
faguayot opened this issue May 11, 2022 · 15 comments
Closed

Add memory DIMM status in dashboard #1024

faguayot opened this issue May 11, 2022 · 15 comments
Labels

Comments

@faguayot
Copy link

Is your feature request related to a problem? Please describe.
I'm missing the status for memory dimm information

Describe the solution you'd like
I would like to have information about the status of memory dimm and adding this info to the node dashboard or maybe create a general dashboard with the whole hardware information separate in different sections in the same dashboards. For example the FAN, Power, Memory,

Additional context
Example of a current similar situation image
image

@faguayot faguayot added the feature New feature or request label May 11, 2022
@cgrinds
Copy link
Collaborator

cgrinds commented May 11, 2022

hi @faguayot not sure if you've seen the Power dashboard, but it has the fan, power, temperature information you're looking for. Would adding memory to that dashboard work for you?

We took a look and so far we haven't been able to find a ZAPI or REST endpoint that returns DIMM information. None of the system-node ZAPIs, storage-shelf-info-get-iter or environment-sensors-get-iter returns what we need.

environment-sensors-get-iter returns something like this

{
      "discrete-sensor-state": "normal",
      "discrete-sensor-value": "NORMAL",
      "node-name": "umeng-aff300-06",
      "sensor-name": "Memory0 Hot",
      "sensor-type": "discrete",
      "threshold-sensor-state": "normal"
},

but I think this is for hot swapable RAM, not the status of current memory. The response above only returns a single sensor named Memory0 Hot even though that node has four DIMMs

system controller fru show -node umeng-aff300-06 -subsystem Memory
Node               FRU Name                     Subsystem          Status
------------------ ---------------------------- ------------------ -----------
umeng-aff300-06    DIMM-4                       Memory             ok
umeng-aff300-06    DIMM-1                       Memory             ok
umeng-aff300-06    DIMM-3                       Memory             ok
umeng-aff300-06    DIMM-2                       Memory             ok
4 entries were displayed.

Memory status is a good idea and seems natural to add to the Power dashboard.

The only way to expose this information to Harvest may be to use the private cli pass through like so:

Similar to your query1

curl -k 'https://10.193.48.11/api/private/cli/system/controller/memory/dimm?fields=node,slotname,status,alt-cecc-dimm'

and similar to your query2

curl -k 'https://10.193.48.11/api/private/cli/system/controller/fru?subsystem=Memory&fields=node,fru-name,status'

Harvest's REST collector is still in beta, but supports the private cli pass through. We use it in a few templates already to close gaps between REST and ZAPI.

@faguayot
Copy link
Author

Hello @cgrinds,

Yes I saw the new dashboards for the version that you released this week (I think they are great, I am still checking the new info, many thanks for your improvements and new features) and related with that version and the problem we have with the memory I thought it would be a good idea to have this information in some panel, maybe a panel focus with the physical components. If you add this in the power it is OK for me or if you give a different name to that panel focus in the HW is good too.

Here the output to our cluster now for the command that you put in the end of your response (query1, query2). In the last record appears as OK because we change the module **DIMM 13** in the node snes1p301_01 but I think this could be a good idea add in a panel.

{
  "records": [
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227A76D",
      "fru_name": "DIMM-5",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227A7E0",
      "fru_name": "DIMM-1",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227A7E6",
      "fru_name": "DIMM-13",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227A8BF",
      "fru_name": "DIMM-4",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227A8EE",
      "fru_name": "DIMM-8",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227A8F2",
      "fru_name": "DIMM-6",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227A952",
      "fru_name": "DIMM-3",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227BCDD",
      "fru_name": "DIMM-14",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227BD79",
      "fru_name": "DIMM-12",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227BDF6",
      "fru_name": "DIMM-11",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227BDF8",
      "fru_name": "DIMM-10",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227C0F2",
      "fru_name": "DIMM-15",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227C0FA",
      "fru_name": "DIMM-7",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227C1CE",
      "fru_name": "DIMM-16",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227C1D6",
      "fru_name": "DIMM-2",
      "status": "ok"
    },
    {
      "node": "snes1p301_01",
      "subsystem": "memory",
      "serial_number": "CE-03-1923-1227C1D7",
      "fru_name": "DIMM-9",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-1252E9C6",
      "fru_name": "DIMM-8",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-1252EDC8",
      "fru_name": "DIMM-16",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-1252F6B3",
      "fru_name": "DIMM-15",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-1252F709",
      "fru_name": "DIMM-7",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-1252FF4B",
      "fru_name": "DIMM-14",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-12530118",
      "fru_name": "DIMM-5",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-12530261",
      "fru_name": "DIMM-6",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-1254E9F3",
      "fru_name": "DIMM-1",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-12551305",
      "fru_name": "DIMM-4",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-1255133C",
      "fru_name": "DIMM-2",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-12551B7F",
      "fru_name": "DIMM-9",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-12551BA9",
      "fru_name": "DIMM-11",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-12551BDF",
      "fru_name": "DIMM-10",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-12551C0A",
      "fru_name": "DIMM-12",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-03-1925-12551C70",
      "fru_name": "DIMM-3",
      "status": "ok"
    },
    {
      "node": "snes1p301_02",
      "subsystem": "memory",
      "serial_number": "CE-04-2049-449DA75E",
      "fru_name": "DIMM-13",
      "status": "ok"
    }
  ],
  "num_records": 32
}

Thanks for your great work.

@cgrinds
Copy link
Collaborator

cgrinds commented Jun 28, 2022

We'll document how to add memory panels to the cluster dashboard and check if the EMS collector is a better fit for detecting this problem too.

@rahulguptajss
Copy link
Contributor

@faguayot here is the documentation on how to collect DIMM status through REST collector.

@rahulguptajss
Copy link
Contributor

@faguayot I have also added DIMM panel to this discussion here for your reference . We do not plan to add this panel to harvest yet. Plese let us know your feedback.

You can also use EMS collector to track DIMM related EMS messages (already available in nightly and will be available through our next official release ).
EMS doc here

@faguayot
Copy link
Author

faguayot commented Aug 3, 2022

Hello @rahulguptajss

I've updated the configuration following the steps that you shared and I've created a new dashboard with the code that you shared too.

This is the result for the DIMM panel.

image

Why don't you plan to add this panel to the harvest version yet? Don't you think if it will be useful? Or there are other implications that I don't know? Maybe it is because is part of rest plugin?

Thanks.
Best regards.

@rahulguptajss
Copy link
Contributor

That's great. Ideally you would have gotten only memory subsystem with the shared template. We have added filtering support in private cli in latest nightly build to fix that.

You are right, This is related with Rest Collector being in beta currently.

@faguayot
Copy link
Author

faguayot commented Aug 5, 2022

@rahulguptajss I was looking for the change in this nightly build for adding filtering support in the private cli but I couldn't find any change. My custom_dimm.yaml has the same code #1187

@rahulguptajss
Copy link
Contributor

@faguayot Could you share the output of bin/harvest --version . This commit has the filtering support for private cli.

@faguayot
Copy link
Author

faguayot commented Aug 9, 2022

I have the latest release version in my production environment.

harvest version 22.05.0-1 (commit 2bc2942) (build date 2022-05-11T07:56:11-0400) linux/amd64

@rahulguptajss
Copy link
Contributor

@faguayot Private CLI filtering support came post 22.05.0-1. It is only available in nightly build currently. FYI. We are planning harvest next offical release later this month.

@faguayot
Copy link
Author

faguayot commented Aug 9, 2022

I only can try the nightly build in a test environment but until the new feature doesn't come in a release, I'll never have this functionality in my production env.

I imagined that you will release a new version during this month since it seems you usually do every 3 months. But the problem is you don't plan to add this in the new version, right? Or I am confused with the memory information with the REST API. Sorry if I am mixing different concepts.

@rahulguptajss
Copy link
Contributor

@faguayot Sorry for the confusion. Private Rest CLI support is in Harvest backend (Go code) so You'll get that change with next offical release. Dashboard/Template side changes will not be available in next harvest release for the reasons mentioned above.

@faguayot
Copy link
Author

faguayot commented Aug 9, 2022

Perfect, thanks for the clarification. In that case, when you will release the new version we will update our current version to that.

@rahulguptajss
Copy link
Contributor

@faguayot New release 22.08 is now available. I am marking this issue close. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants