Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong GPU id (enumeration) when different models are used #72

Closed
aleqx opened this issue Aug 31, 2020 · 12 comments
Closed

wrong GPU id (enumeration) when different models are used #72

aleqx opened this issue Aug 31, 2020 · 12 comments

Comments

@aleqx
Copy link

aleqx commented Aug 31, 2020

T-rex is wrongly enumerating GPUs when there are multiple models. For example, this machine has a bunch of 1070 and a bunch of 1080ti. T-rex says:

20200830 20:11:31 GPU #0: Gigabyte GTX 1080 Ti
20200830 20:11:31 GPU #1: EVGA GTX 1080 Ti
20200830 20:11:31 GPU #2: MSI GTX 1080 Ti
20200830 20:11:31 GPU #3: Gigabyte GTX 1080 Ti
20200830 20:11:31 GPU #4: Gigabyte GTX 1080 Ti
20200830 20:11:31 GPU #5: ASUS GTX 1070
20200830 20:11:31 GPU #6: EVGA GTX 1070
20200830 20:11:31 GPU #7: ASUS GTX 1070
20200830 20:11:31 GPU #8: EVGA GTX 1070

Whereas the actual ordering by pci address is:

02:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
04:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
0a:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
0b:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
0d:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
0e:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
0f:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)

@trexminer
Copy link
Owner

Could you please try --ab-indexing parameter and report back?

@aleqx
Copy link
Author

aleqx commented Sep 1, 2020

That would shift device numbering from 1 instead of 0 (which I wanted). Why order them differently than the driver (which is the same as the pci bus) anyway?

@trexminer
Copy link
Owner

Default order matches cudaGetDeviceProperties CUDA function ordering scheme, we don't do any re-ordering ourselves.
--ab-indexing was added to match Afterburner order which ascends corresponding to the PCI ids and starts from 1.

@aleqx
Copy link
Author

aleqx commented Sep 3, 2020

Well, cudaGetDeviceProperties only fetches properties of a single GPU. Even then, the minor field returned by cudaGetDeviceProperties matches the ordering of the PCI bus, and is the same as the minor field exposed by the driver in /proc/driver/nvidia/gpus/<PCI_BUS_ID>/information. Here's proof:

# grep Minor /proc/driver/nvidia/gpus/*/information

/proc/driver/nvidia/gpus/0000:02:00.0/information:Device Minor:          0
/proc/driver/nvidia/gpus/0000:04:00.0/information:Device Minor:          1
/proc/driver/nvidia/gpus/0000:05:00.0/information:Device Minor:          2
/proc/driver/nvidia/gpus/0000:06:00.0/information:Device Minor:          3
/proc/driver/nvidia/gpus/0000:0a:00.0/information:Device Minor:          4
/proc/driver/nvidia/gpus/0000:0b:00.0/information:Device Minor:          5
/proc/driver/nvidia/gpus/0000:0d:00.0/information:Device Minor:          6
/proc/driver/nvidia/gpus/0000:0e:00.0/information:Device Minor:          7
/proc/driver/nvidia/gpus/0000:0f:00.0/information:Device Minor:          8

and here is the model in the same ordering:

# grep Model /proc/driver/nvidia/gpus/*/information

/proc/driver/nvidia/gpus/0000:02:00.0/information:Model:                 GeForce GTX 1080 Ti
/proc/driver/nvidia/gpus/0000:04:00.0/information:Model:                 GeForce GTX 1070
/proc/driver/nvidia/gpus/0000:05:00.0/information:Model:                 GeForce GTX 1070
/proc/driver/nvidia/gpus/0000:06:00.0/information:Model:                 GeForce GTX 1070
/proc/driver/nvidia/gpus/0000:0a:00.0/information:Model:                 GeForce GTX 1080 Ti
/proc/driver/nvidia/gpus/0000:0b:00.0/information:Model:                 GeForce GTX 1080 Ti
/proc/driver/nvidia/gpus/0000:0d:00.0/information:Model:                 GeForce GTX 1080 Ti
/proc/driver/nvidia/gpus/0000:0e:00.0/information:Model:                 GeForce GTX 1080 Ti
/proc/driver/nvidia/gpus/0000:0f:00.0/information:Model:                 GeForce GTX 1070

Your code outputs this ordering:

GPU #0: Gigabyte GTX 1080 Ti
GPU #1: EVGA GTX 1080 Ti
GPU #2: MSI GTX 1080 Ti
GPU #3: Gigabyte GTX 1080 Ti
GPU #4: Gigabyte GTX 1080 Ti
GPU #5: ASUS GTX 1070
GPU #6: EVGA GTX 1070
GPU #7: ASUS GTX 1070
GPU #8: EVGA GTX 1070

Which is clearly wrong ... seems you are grouping models together, 1080ti and 1070ti (why?).

If your code is using the minor field cudaGetDeviceProperties then your code is doing something wrong. If your code is not using the minor field, then it should.

You are also trying to match the manufacturer (MSI, Gigabyte, ASUS, etc -- afaik the driver isn't exposing this info, are you matching the gpu UUID against an external database or is it from the gpu bios?). That part of your code may be responsible for this buggy behavior of bad ordering. Regardless, it's wrong :)

@trexminer
Copy link
Owner

trexminer commented Sep 3, 2020

We are not grouping or trying to match anything.

cudaGetDeviceProperties is defined as follows:

__host__​cudaError_t cudaGetDeviceProperties ( cudaDeviceProp* prop, int  device )

Returns information about the compute-device.
Parameters
prop

  • Properties for the specified device

device

  • Device number to get properties for

"Device number" is what we use to order the GPUs.
--ab-indexing will get you PCI bus order if that's what you're after. Just add/subtract 1 from the indices.

@aleqx
Copy link
Author

aleqx commented Sep 3, 2020

http://developer.download.nvidia.com/compute/cuda/4_0/toolkit/docs/online/group__CUDART__DEVICE_g5aa4f47938af8276f08074d09b7d520c.html

device is an input. For proper ordering of devices you should use the device minor which is a property of the device. This is the standard way, as used by the nvidia driver itself when listing devices

@aleqx
Copy link
Author

aleqx commented Sep 3, 2020

Actually, I forgot that I have set the global env var CUDA_DEVICE_ORDER=PCI_BUS_ID (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars)

All miner software except t-rex are ordering the devices according to the pci bus id, so it seems t-rex is ignoring that env var. Could you please honor it?

The default is FASTEST_FIRST, which is undesirable for mining (cuda tries to determine which gpu is fastest and sets that one to device 0, and leaves all other devices in an unspecified order).

This is better than using --ab-indexing, especially when using the --devices option (i don't even know what indexing the --devices option expects, the order of the GPUs would be unspecified)

@aleqx
Copy link
Author

aleqx commented Sep 3, 2020

My bad. You do actually honor the CUDA_DEVICE_ORDER env var ... it was a bug in my code as I was running the miner under a different user with sudo -u and forgot to preserve env vars (sudo -E).

All works as intended now. No need for --ab-indexing :)

Sorry for wasting your time with this, I should have caught it sooner.

@trexminer
Copy link
Owner

No worries. Thanks for the input.
We'll add a new parameter --pci-indexing in the next version which will do the same for those who can't / don't want to set env variables.

@aleqx
Copy link
Author

aleqx commented Sep 3, 2020

You do have a small bug though: if I use --ab-indexing then --devices doesn't work as expected, e.g. specifying 0 crashes the miner; and I also wonder about the actual ordering since by default CUDA has an unspecified order: is the device ordering for --devices the same as the device ordering for --ab-indexing? There may be another bug there if you first select devices using --devices and then reorder using --ab-indexing, which would give unspecified behavior.

The CUDA_DEVICE_ORDER=PCI_BUS_ID provides the safest option. I still encourage you to use that by default (as most other miner software do).

@trexminer
Copy link
Owner

Added --pci-indexing in 0.17.2

@aleqx
Copy link
Author

aleqx commented Sep 18, 2020

Added --pci-indexing in 0.17.2

Nice. I see you also added ethash ... is it any good? :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants