Can nvidiagpubeat be made to also export the process running on each card? #29

musiczhzhao · 2020-11-10T18:45:40Z

Since nvidiagpubeat is based on nvidia-smi and nvidia-smi is able to list the processes that are currently using the gpu cards, in theory nvidiagpubeat should be able to export the process info as metrics. Please correct me if I am wrong.

I am interest to know if there is any plan to do this? It will be very helpful in identifying the GPU resource usage of processes and the code efficiency.

All the best.

deepujain · 2020-11-10T19:32:41Z

@musiczhzhao Yes, it can. I had a piece of code for it. I will try and integrate into nvidiagpubeat.

musiczhzhao · 2020-11-12T06:08:39Z

@deepujain Thank you! 👍

musiczhzhao · 2020-12-11T16:58:28Z

Hi @deepujain, How are things going? Just to check if there is any update? Any if any help are needed? Best

deepujain · 2021-01-03T06:42:17Z

The changes are ready. I lost access to my GPU cluster, hence testing the changes has become a challenge and created a dependency. Here is a sample

The --query-gpu will generate below event by nvidiagpubeat.

Publish event: Publish event: {
  "@timestamp": "2021-01-03T07:27:16.080Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "type": "nvidiagpubeat",
  "gpu_uuid": "GPU-b884db58-6340-7969-a79f-b937f3583884",
  "driver_version": "418.87.01",
  "index": 3,
  "gpu_serial": 3.20218176911e+11,
  "memory": {
    "used": 3256,
    "total": 16280
  },
  "name": "Tesla100-PCIE-16GB",
  "host": {
    "name": "AB-SJC-11111111"
  },
  "utilization": {
    "memory": 50,
    "gpu": 50
  },
  "beat": {
    "name": "AB-SJC-11111111",
    "hostname": "AB-SJC-11111111",
    "version": "6.5.5"
  },
  "pstate": 0,
  "gpu_bus_id": "00000000:19:00.0",
  "count": 4,
  "fan": {
    "speed": "[NotSupported]"
  },
  "gpuIndex": 3,
  "power": {
    "draw": 25.28,
    "limit": 250
  },
  "temperature": {
    "gpu": 24
  },
  "clocks": {
    "gr": 405,
    "sm": 405,
    "mem": 715
  }
}

The --query-compute-apps will generate below event by nvidiagpubeat.

Publish event: {
  "@timestamp": "2021-01-03T07:29:53.633Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "pid": 222414,
  "process_name": "python",
  "used_gpu_memory": 10,
  "gpu_bus_id": "00000000:19:00.0",
  "gpu_serial": 3.20218176911e+11,
  "beat": {
    "name": "AB-SJC-11111111",
    "hostname": "AB-SJC-11111111",
    "version": "6.5.5"
  },
  "gpu_name": "Tesla100-PCIE-16GB",
  "used_memory": 15,
  "gpuIndex": 3,
  "type": "nvidiagpubeat",
  "gpu_uuid": "GPU-b884db58-6340-7969-a79f-b937f3583884",
  "host": {
    "name": "LM-SJC-11004865"
  }
}

…ng on each card? #29] (#30)

…ng on each card? #29]

…ng on each card? #29] (#31)

deepujain · 2021-01-03T09:44:10Z

@musiczhzhao I made the changes to nvidiagpubeat to support process details information and made it generic in the process. Please test and share the results here (including few sample events) for query-compute-apps (active GPU process details) .

It can now support all types of queries as it is generic. I have tested only --query-gpu and --query-compute-apps. In case you plan to use other options, let me know and you can help me with testing.

nvidia-smi -h

  SELECTIVE QUERY OPTIONS:

    Allows the caller to pass an explicit list of properties to query.

    [one of]

    --query-gpu=                Information about GPU.
                                Call --help-query-gpu for more info.
    --query-supported-clocks=   List of supported clocks.
                                Call --help-query-supported-clocks for more info.
    --query-compute-apps=       List of currently active compute processes.
                                Call --help-query-compute-apps for more info.
    --query-accounted-apps=     List of accounted compute processes.
                                Call --help-query-accounted-apps for more info.
    --query-retired-pages=      List of device memory pages that have been retired.
                                Call --help-query-retired-pages for more info.

https://github.com/eBay/nvidiagpubeat#sample-event has details.

…ng on each card? #29]

deepujain · 2021-01-06T05:05:27Z

@musiczhzhao

musiczhzhao · 2021-01-06T07:18:59Z

Hi @deepujain, Thank you! I will test it and get back to you ASAP. 👍

Best

musiczhzhao · 2021-01-16T01:14:42Z

Hello @deepujain,

Happy weekend!

I have briefly tested the new version and confirm it can export the application name and gpu memory usage of the application when --query-compute-apps is used.

One question have is if there is a way to enable both --query-gpu and --query-compute-apps so both documents can be exported. I tried to enable both in the configuration file and it turned out only the later one become effective.

For example, with following in configuration, it seems only export the compute app metrics:

## --query-gpu will provide information about GPU.
query: "--query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
## --query-compute-apps will list currently active compute processes.
query: "--query-compute-apps=gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,pid,process_name,used_gpu_memory,used_memory"

Another question is we find it useful to have the full command line of the app. For example, if a python script is launched with python, current nvidia-smi will just show app as python, without the actual script name and arguments. Searching around from online we found what people generally do it to firstly get the pid of the application and then get the ful command from ps command. (https://stackoverflow.com/questions/50264491/how-to-customize-nvidia-smi-s-output-to-show-pid-username) Can we have this build-in so it can have the cmd just as metricbeat does?

Best,
Zhao

deepujain · 2021-01-16T18:45:07Z

Hello Zhao,

Thank you for testing out.
Please share sample events for both the queries --query-compute-apps and --query-gpu. It will help me update the documentation with real events. I can then close this issue as the current code seems to have met the expectation of Issue #29 .

Could you please raise seperate github issues for each new feature request.

One question have is if there is a way to enable both --query-gpu and --query-compute-apps so both documents can be exported. I tried to enable both in the configuration file and it turned out only the later one become effective.
Please share expected sample events of a combined query "--query-compute-apps-and--query-gpu"
Enriched version for "--query-compute-apps" to get additional details of process.
Searching around from online we found what people generally do it to firstly get the pid of the application and then get the ful command from ps command. (https://stackoverflow.com/questions/50264491/how-to-customize-nvidia-smi-s-output-to-show-pid-username)

Cheers
Deepak

musiczhzhao · 2021-01-27T21:45:01Z

Hi @deepujain,

I did a bit more testing which took some time.

Another issue we found is that the new version seems assume there is only one app running on each GPU card, or nvidia-smi only return 4 processes if there are 4 GPU cards on a machine. Otherwise it will crash with following error message.

2021-01-26T12:00:20.226-0600 INFO runtime/panic.go:975 nvidiagpubeat stopped.
2021-01-26T12:00:20.259-0600 FATAL [nvidiagpubeat] instance/beat.go:154 Failed due to panic. {"panic": "runtime error: index out of range [4] with length 4", "stack": "github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance.Run.func1.1\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance/beat.go:155\nruntime.gopanic\n\t/s0/Compilers/go/go1.14.6/src/runtime/panic.go:969\nruntime.goPanicIndex\n\t/s0/Compilers/go/go1.14.6/src/runtime/panic.go:88\ngithub.com/ebay/nvidiagpubeat/nvidia.Utilization.run\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidia/gpu.go:122\ngithub.com/ebay/nvidiagpubeat/nvidia.Metrics.Get\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidia/metrics.go:52\ngithub.com/ebay/nvidiagpubeat/beater.(*Nvidiagpubeat).Run\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/beater/nvidiagpubeat.go:73\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance.
...

The code allocating the event is in line 71 of nvidia/gpu.go:
events := make([]common.MapStr, gpuCount, 2*gpuCount)

I will attached the sample events in a separate post.

Best,
Zhao

deepujain self-assigned this Nov 10, 2020

deepujain added the enhancement New feature or request label Nov 10, 2020

deepujain mentioned this issue Jan 3, 2021

[Issue#29][Can nvidiagpubeat be made to also export the process runni… #30

Merged

deepujain added a commit that referenced this issue Jan 3, 2021

[Issue#29][Can nvidiagpubeat be made to also export the process runni…

cc929e7

…ng on each card? #29] (#30)

deepujain added a commit that referenced this issue Jan 3, 2021

[Issue#29][Can nvidiagpubeat be made to also export the process runni…

1414fcb

…ng on each card? #29]

deepujain mentioned this issue Jan 3, 2021

[Issue#29][Can nvidiagpubeat be made to also export the process runni… #31

Merged

deepujain added a commit that referenced this issue Jan 3, 2021

[Issue#29][Can nvidiagpubeat be made to also export the process runni…

6a42990

…ng on each card? #29] (#31)

deepujain added a commit that referenced this issue Jan 3, 2021

[Issue#29][Can nvidiagpubeat be made to also export the process runni…

6d7c14c

…ng on each card? #29]

deepujain added a commit that referenced this issue Jan 3, 2021

[Issue#29][Can nvidiagpubeat be made to also export the process runni…

4dce40f

…ng on each card? #29]

deepujain added the help wanted Extra attention is needed label Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can nvidiagpubeat be made to also export the process running on each card? #29

Can nvidiagpubeat be made to also export the process running on each card? #29

musiczhzhao commented Nov 10, 2020

deepujain commented Nov 10, 2020

musiczhzhao commented Nov 12, 2020

musiczhzhao commented Dec 11, 2020

deepujain commented Jan 3, 2021 •

edited

Loading

deepujain commented Jan 3, 2021 •

edited

Loading

deepujain commented Jan 6, 2021

musiczhzhao commented Jan 6, 2021

musiczhzhao commented Jan 16, 2021

deepujain commented Jan 16, 2021

musiczhzhao commented Jan 27, 2021

Can nvidiagpubeat be made to also export the process running on each card? #29

Can nvidiagpubeat be made to also export the process running on each card? #29

Comments

musiczhzhao commented Nov 10, 2020

deepujain commented Nov 10, 2020

musiczhzhao commented Nov 12, 2020

musiczhzhao commented Dec 11, 2020

deepujain commented Jan 3, 2021 • edited Loading

deepujain commented Jan 3, 2021 • edited Loading

deepujain commented Jan 6, 2021

musiczhzhao commented Jan 6, 2021

musiczhzhao commented Jan 16, 2021

deepujain commented Jan 16, 2021

musiczhzhao commented Jan 27, 2021

deepujain commented Jan 3, 2021 •

edited

Loading

deepujain commented Jan 3, 2021 •

edited

Loading