Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can nvidiagpubeat be made to also export the process running on each card? #29

Open
musiczhzhao opened this issue Nov 10, 2020 · 10 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@musiczhzhao
Copy link

Since nvidiagpubeat is based on nvidia-smi and nvidia-smi is able to list the processes that are currently using the gpu cards, in theory nvidiagpubeat should be able to export the process info as metrics. Please correct me if I am wrong.

I am interest to know if there is any plan to do this? It will be very helpful in identifying the GPU resource usage of processes and the code efficiency.

All the best.

@deepujain deepujain self-assigned this Nov 10, 2020
@deepujain deepujain added the enhancement New feature or request label Nov 10, 2020
@deepujain
Copy link
Contributor

@musiczhzhao Yes, it can. I had a piece of code for it. I will try and integrate into nvidiagpubeat.

@musiczhzhao
Copy link
Author

@deepujain Thank you! 👍

@musiczhzhao
Copy link
Author

Hi @deepujain, How are things going? Just to check if there is any update? Any if any help are needed? Best

@deepujain
Copy link
Contributor

deepujain commented Jan 3, 2021

The changes are ready. I lost access to my GPU cluster, hence testing the changes has become a challenge and created a dependency. Here is a sample

The --query-gpu will generate below event by nvidiagpubeat.

Publish event: Publish event: {
  "@timestamp": "2021-01-03T07:27:16.080Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "type": "nvidiagpubeat",
  "gpu_uuid": "GPU-b884db58-6340-7969-a79f-b937f3583884",
  "driver_version": "418.87.01",
  "index": 3,
  "gpu_serial": 3.20218176911e+11,
  "memory": {
    "used": 3256,
    "total": 16280
  },
  "name": "Tesla100-PCIE-16GB",
  "host": {
    "name": "AB-SJC-11111111"
  },
  "utilization": {
    "memory": 50,
    "gpu": 50
  },
  "beat": {
    "name": "AB-SJC-11111111",
    "hostname": "AB-SJC-11111111",
    "version": "6.5.5"
  },
  "pstate": 0,
  "gpu_bus_id": "00000000:19:00.0",
  "count": 4,
  "fan": {
    "speed": "[NotSupported]"
  },
  "gpuIndex": 3,
  "power": {
    "draw": 25.28,
    "limit": 250
  },
  "temperature": {
    "gpu": 24
  },
  "clocks": {
    "gr": 405,
    "sm": 405,
    "mem": 715
  }
}

The --query-compute-apps will generate below event by nvidiagpubeat.

Publish event: {
  "@timestamp": "2021-01-03T07:29:53.633Z",
  "@metadata": {
    "beat": "nvidiagpubeat",
    "type": "doc",
    "version": "6.5.5"
  },
  "pid": 222414,
  "process_name": "python",
  "used_gpu_memory": 10,
  "gpu_bus_id": "00000000:19:00.0",
  "gpu_serial": 3.20218176911e+11,
  "beat": {
    "name": "AB-SJC-11111111",
    "hostname": "AB-SJC-11111111",
    "version": "6.5.5"
  },
  "gpu_name": "Tesla100-PCIE-16GB",
  "used_memory": 15,
  "gpuIndex": 3,
  "type": "nvidiagpubeat",
  "gpu_uuid": "GPU-b884db58-6340-7969-a79f-b937f3583884",
  "host": {
    "name": "LM-SJC-11004865"
  }
}

@deepujain
Copy link
Contributor

deepujain commented Jan 3, 2021

@musiczhzhao I made the changes to nvidiagpubeat to support process details information and made it generic in the process. Please test and share the results here (including few sample events) for query-compute-apps (active GPU process details) .

It can now support all types of queries as it is generic. I have tested only --query-gpu and --query-compute-apps. In case you plan to use other options, let me know and you can help me with testing.

nvidia-smi -h

  SELECTIVE QUERY OPTIONS:

    Allows the caller to pass an explicit list of properties to query.

    [one of]

    --query-gpu=                Information about GPU.
                                Call --help-query-gpu for more info.
    --query-supported-clocks=   List of supported clocks.
                                Call --help-query-supported-clocks for more info.
    --query-compute-apps=       List of currently active compute processes.
                                Call --help-query-compute-apps for more info.
    --query-accounted-apps=     List of accounted compute processes.
                                Call --help-query-accounted-apps for more info.
    --query-retired-pages=      List of device memory pages that have been retired.
                                Call --help-query-retired-pages for more info.

https://github.com/eBay/nvidiagpubeat#sample-event has details.

@deepujain
Copy link
Contributor

@musiczhzhao

@musiczhzhao
Copy link
Author

Hi @deepujain, Thank you! I will test it and get back to you ASAP. 👍

Best

@musiczhzhao
Copy link
Author

Hello @deepujain,

Happy weekend!

I have briefly tested the new version and confirm it can export the application name and gpu memory usage of the application when --query-compute-apps is used.

One question have is if there is a way to enable both --query-gpu and --query-compute-apps so both documents can be exported. I tried to enable both in the configuration file and it turned out only the later one become effective.

For example, with following in configuration, it seems only export the compute app metrics:


## --query-gpu will provide information about GPU.
query: "--query-gpu=name,gpu_bus_id,gpu_serial,gpu_uuid,driver_version,count,index,fan.speed,memory.total,memory.used,utilization.gpu,utilization.memory,temperature.gpu,power.draw,power.limit,clocks.gr,clocks.sm,clocks.mem,pstate"
## --query-compute-apps will list currently active compute processes.
query: "--query-compute-apps=gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,pid,process_name,used_gpu_memory,used_memory"


Another question is we find it useful to have the full command line of the app. For example, if a python script is launched with python, current nvidia-smi will just show app as python, without the actual script name and arguments. Searching around from online we found what people generally do it to firstly get the pid of the application and then get the ful command from ps command. (https://stackoverflow.com/questions/50264491/how-to-customize-nvidia-smi-s-output-to-show-pid-username) Can we have this build-in so it can have the cmd just as metricbeat does?

Best,
Zhao

@deepujain
Copy link
Contributor

Hello Zhao,

Thank you for testing out.
Please share sample events for both the queries --query-compute-apps and --query-gpu. It will help me update the documentation with real events. I can then close this issue as the current code seems to have met the expectation of Issue #29 .

Could you please raise seperate github issues for each new feature request.

  1. One question have is if there is a way to enable both --query-gpu and --query-compute-apps so both documents can be exported. I tried to enable both in the configuration file and it turned out only the later one become effective.
    Please share expected sample events of a combined query "--query-compute-apps-and--query-gpu"

  2. Enriched version for "--query-compute-apps" to get additional details of process.
    Searching around from online we found what people generally do it to firstly get the pid of the application and then get the ful command from ps command. (https://stackoverflow.com/questions/50264491/how-to-customize-nvidia-smi-s-output-to-show-pid-username)

Cheers
Deepak

@musiczhzhao
Copy link
Author

Hi @deepujain,

I did a bit more testing which took some time.

Another issue we found is that the new version seems assume there is only one app running on each GPU card, or nvidia-smi only return 4 processes if there are 4 GPU cards on a machine. Otherwise it will crash with following error message.

2021-01-26T12:00:20.226-0600 INFO runtime/panic.go:975 nvidiagpubeat stopped.
2021-01-26T12:00:20.259-0600 FATAL [nvidiagpubeat] instance/beat.go:154 Failed due to panic. {"panic": "runtime error: index out of range [4] with length 4", "stack": "github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance.Run.func1.1\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance/beat.go:155\nruntime.gopanic\n\t/s0/Compilers/go/go1.14.6/src/runtime/panic.go:969\nruntime.goPanicIndex\n\t/s0/Compilers/go/go1.14.6/src/runtime/panic.go:88\ngithub.com/ebay/nvidiagpubeat/nvidia.Utilization.run\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidia/gpu.go:122\ngithub.com/ebay/nvidiagpubeat/nvidia.Metrics.Get\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/nvidia/metrics.go:52\ngithub.com/ebay/nvidiagpubeat/beater.(*Nvidiagpubeat).Run\n\t/nvidiagpubeat/beats_dev/src/github.com/ebay/nvidiagpubeat/beater/nvidiagpubeat.go:73\ngithub.com/ebay/nvidiagpubeat/vendor/github.com/elastic/beats/libbeat/cmd/instance.
...

The code allocating the event is in line 71 of nvidia/gpu.go:
events := make([]common.MapStr, gpuCount, 2*gpuCount)

I will attached the sample events in a separate post.

Best,
Zhao

@deepujain deepujain added the help wanted Extra attention is needed label Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants