Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Wrong gradients on Windows-GPU #20471

Open
matteosal opened this issue Jul 27, 2021 · 17 comments · May be fixed by #20814
Open

Wrong gradients on Windows-GPU #20471

matteosal opened this issue Jul 27, 2021 · 17 comments · May be fixed by #20814

Comments

@matteosal
Copy link
Contributor

sym.zip
I only see this on Windows. Download the symbol file and run this script:

import mxnet as mx

json_path = 'sym.json'
sym = mx.sym.load(json_path)

def run_example(ctx, reqs):
	ex = sym._bind(
		ctx,
		{
			'.Inputs.Input': mx.ndarray.array([[1, 2, 3]], ctx=ctx),
			'.Inputs.Target': mx.ndarray.array([[4, 5, 6]], ctx=ctx),
			'seq_715248120': mx.ndarray.array([3], ctx=ctx)
		},
		args_grad={
			'.Inputs.Input': mx.ndarray.zeros([1, 3], ctx=ctx),
			'.Inputs.Target': mx.ndarray.zeros([1, 3], ctx=ctx),
			'seq_715248120': mx.ndarray.zeros([1], ctx=ctx)
		},
		grad_req=dict(zip(['.Inputs.Input', '.Inputs.Target', 'seq_715248120'], reqs))
	)

	ex.forward()
	ex.backward(out_grads=[mx.ndarray.array([1], ctx=ctx), mx.ndarray.array([1], ctx=ctx)])

	print(ex.grad_dict)

print('Input + Target gradient, CPU (OK):')
run_example(mx.cpu(), ['write', 'write', 'null'])
print('\n')
print('Input + Target gradient, GPU (OK):')
run_example(mx.gpu(), ['write', 'write', 'null'])
print('\n')
print('Target gradient only, CPU (OK):')
run_example(mx.cpu(), ['null', 'write', 'null'])
print('\n')
print('Target gradient only, GPU (WRONG):')
run_example(mx.gpu(), ['null', 'write', 'null'])

Output is:

Input + Target gradient, CPU (OK):
{'.Inputs.Input':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @cpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}


Input + Target gradient, GPU (OK):
{'.Inputs.Input':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}


Target gradient only, CPU (OK):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}


Target gradient only, GPU (WRONG):
{'.Inputs.Input': None, '.Inputs.Target':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

The Target gradient has the sign flipped in the last example.

@matteosal
Copy link
Contributor Author

I see the same sign flip with this other symbol (which can be fed to the same above script)
sym2.zip

And with this one
sym3.zip
Which goes with this script:

import numpy as np
import mxnet as mx

json_path = 'sym3.json'
sym = mx.sym.load(json_path)

input_1 = np.random.rand(1, 2, 3, 4).tolist()
input_2 = np.random.rand(1, 2, 4).tolist()
input_3 = np.random.rand(1, 2).tolist()

def run_example(ctx, reqs):
	ex = sym._bind(
		ctx,
		{
			'.Inputs.Input1': mx.ndarray.array(input_1, ctx=ctx),
			'.Inputs.Input2': mx.ndarray.array(input_2, ctx=ctx),
			'.Inputs.Input3': mx.ndarray.array(input_3, ctx=ctx)
		},
		args_grad={
			'.Inputs.Input1': mx.ndarray.zeros([1, 2, 3, 4], ctx=ctx),
			'.Inputs.Input2': mx.ndarray.zeros([1, 2, 4], ctx=ctx),
			'.Inputs.Input3': mx.ndarray.zeros([1, 2], ctx=ctx)
		},
		grad_req=dict(zip(['.Inputs.Input1', '.Inputs.Input2', '.Inputs.Input3'], reqs))
	)

	ex.forward()
	ex.backward(out_grads=[mx.ndarray.ones([1, 2, 3, 4], ctx=ctx)])

	print(ex.grad_dict['.Inputs.Input2'])

print('Input1 + Input2 gradient, CPU (OK):')
run_example(mx.cpu(), ['write', 'write', 'null'])
print('\n')
print('Input1 + Input2 gradient, GPU (OK):')
run_example(mx.gpu(), ['write', 'write', 'null'])
print('\n')
print('Input2 gradient only, CPU (OK):')
run_example(mx.cpu(), ['null', 'write', 'null'])
print('\n')
print('Input2 gradient only, GPU (WRONG):')
run_example(mx.gpu(), ['null', 'write', 'null'])

Output is

Input1 + Input2 gradient, CPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @cpu(0)>


Input1 + Input2 gradient, GPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @gpu(0)>


Input2 gradient only, CPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @cpu(0)>


Input2 gradient only, GPU (WRONG):

[[[3. 2. 3. 2.]
  [0. 2. 2. 3.]]]
<NDArray 1x2x4 @gpu(0)>

@TristonC
Copy link
Contributor

Which version of MXNet did you @matteosal use?

@TristonC
Copy link
Contributor

TristonC commented Jul 27, 2021

With your sym3 example, here is what I got with MXNet 1.9 on Linux. Not sure if this issue only occurs on Windows. Did you @matteosal try it on Linux?

Input1 + Input2 gradient, CPU (OK):
{'.Inputs.Input1': 'write', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[23:37:54] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @cpu(0)>


Input1 + Input2 gradient, GPU (OK):
{'.Inputs.Input1': 'write', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[23:38:01] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @gpu(0)>


Input2 gradient only, CPU (OK):
{'.Inputs.Input1': 'null', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @cpu(0)>


Input2 gradient only, GPU (WRONG):
{'.Inputs.Input1': 'null', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @gpu(0)>

@matteosal
Copy link
Contributor Author

I am using version 2.0, built from source at commit fabcd14
I have tried the same example on Linux (building from the same commit) and the results are good there. This issue only affects Windows.

@TristonC
Copy link
Contributor

TristonC commented Aug 1, 2021

@matteosal Thanks for the update. @leezu Do you have windows platform to help triage the the problem?

@leezu
Copy link
Contributor

leezu commented Aug 2, 2021

I'm not a Windows user, so it's very hard for me to get MXNet running on Windows. @yajiedesign is Windows expert, maybe he can help

@chinakook
Copy link
Contributor

I've tested with a 2.0 version modified by myself on Windows, and It's OK.

Input + Target gradient, CPU (OK):
{'.Inputs.Input': 
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @cpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}


Input + Target gradient, GPU (OK):
{'.Inputs.Input': 
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}


Target gradient only, CPU (OK):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}


Target gradient only, GPU (WRONG):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

@TristonC
Copy link
Contributor

@chinakook What did you modify? Is it related to this gradient issue? Could you share it with @matteosal?

@matteosal
Copy link
Contributor Author

A ping on this
@chinakook what modification are you talking about? Can you reproduce the problem on a plain v2.0 build?

@matteosal
Copy link
Contributor Author

matteosal commented Oct 13, 2021

A ping on this. Can anyone please investigate?

@matteosal
Copy link
Contributor Author

@szha @leezu another ping on this :)

@barry-jin
Copy link
Contributor

@matteosal What build settings should we use to reproduce this issue?

@matteosal
Copy link
Contributor Author

@barry-jin here they are:

cmake -G"Visual Studio 15 2017 Win64" -T host=x64 ^
 %= GENERAL FLAGS =% ^
 -DCMAKE_INSTALL_PREFIX=%output_dir% ^
 -DCMAKE_BUILD_TYPE=Release ^
 -DCMAKE_SKIP_BUILD_RPATH=On ^
 -DUSE_OPENCV=OFF ^
 -DUSE_F16C=Off %= float16 support =%^
 -DUSE_INT64_TENSOR_SIZE=ON ^
 -DCMAKE_C_FLAGS="-D_WIN32" ^
 -DCMAKE_CXX_FLAGS="-D_WIN32" ^
 -DCMAKE_C_FLAGS_RELEASE="/MT -DNDEBUG" ^
 -DCMAKE_CXX_FLAGS_RELEASE="/MT -DNDEBUG" ^
 -DMXNET_FORCE_SHARED_CRT=OFF %= link statically to C runtime =%^
 -DCMAKE_SHARED_LINKER_FLAGS="/DELAYLOAD:nvcuda.dll delayimp.lib" ^
 -DUSE_MXNET_LIB_NAMING=OFF ^
 %= MATH BACKENDS =% ^
 -DBLAS=MKL ^
 -DUSE_LAPACK=OFF ^
 -DUSE_ONEDNN=OFF ^
 -DBLA_VENDOR="Intel10_64ilp" ^
 -DBLA_STATIC=OFF ^
 -DMKL_USE_SINGLE_DYNAMIC_LIBRARY=OFF ^
 -DMKL_INCLUDE_DIR=%mkl_dir% ^
 -DBLAS_LIBRARIES="%mkl_dir%/libiomp5md.lib;%mkl_dir%/mkl_core_dll.lib;%mkl_dir%/mkl_intel_ilp64_dll.lib;%mkl_dir%/mkl_intel_thread_dll.lib" ^
 %= OPENMP =% ^
 -DUSE_OPENMP=ON ^
 -DOpenMP_C_FLAGS="-I%mkl_dir%" ^
 -DOpenMP_C_LIB_NAMES="libiomp5" ^
 -DOpenMP_CXX_FLAGS="-I%mkl_dir%" ^
 -DOpenMP_CXX_LIB_NAMES="libiomp5" ^
 -DOpenMP_libiomp5_LIBRARY="%mkl_dir%/libiomp5md.lib" ^
 %= CUDA =% ^
 -DUSE_CUDA=ON ^
 -DUSE_CUDNN=ON ^
 -DCUDNN_LIBRARY=%home_dir:\=/%cuDNN/lib/cudnn64_8.lib ^
 -DCUDNN_INCLUDE=%home_dir:\=/%cuDNN/include ^
 -DUSE_NCCL=OFF ^
 -DUSE_NVML=OFF ^
 -DCUDNN_ROOT=%home_dir:\=/%cuDNN ^
 -DMXNET_CUDA_ARCH="3.7"\;"5.0"\;"6.0"\;"7.0"\;"8.0+PTX" %= see Readme =%^
 -DCUDAToolkit_ROOT=%cuda_dir% ^
 -DCMAKE_CUDA_COMPILER="%cuda_dir%/bin/nvcc.exe" -I"%cuda_dir%/include" -L"%cuda_dir%/lib/x64"  ^
 -DUSE_SPLIT_ARCH_DLL=OFF ^
 %mxnet_dir%

MKL version is 2019.4 and CUDA version is 11.4.0

@matteosal
Copy link
Contributor Author

@barry-jin any news on this? I have rebuilt with VC2019 in order to fix this issue but I still see this problem here

@barry-jin
Copy link
Contributor

Sorry, I'm still triaging this issue. I built with settings in build_window.py and can also reproduce this issue.

@barry-jin
Copy link
Contributor

@matteosal Current workaround is to replace 'elemwise_sub' with '_npi_subtract'. There are probably some issues in legacy subtract operator.

barry-jin added a commit to barry-jin/incubator-mxnet that referenced this issue Jan 10, 2022
@barry-jin barry-jin linked a pull request Jan 10, 2022 that will close this issue
6 tasks
@matteosal
Copy link
Contributor Author

@barry-jin thank you, I have verified that swapping the operator fixes the problem

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants