Wrong gradients on Windows-GPU #20471

matteosal · 2021-07-27T12:36:39Z

sym.zip
I only see this on Windows. Download the symbol file and run this script:

import mxnet as mx

json_path = 'sym.json'
sym = mx.sym.load(json_path)

def run_example(ctx, reqs):
	ex = sym._bind(
		ctx,
		{
			'.Inputs.Input': mx.ndarray.array([[1, 2, 3]], ctx=ctx),
			'.Inputs.Target': mx.ndarray.array([[4, 5, 6]], ctx=ctx),
			'seq_715248120': mx.ndarray.array([3], ctx=ctx)
		},
		args_grad={
			'.Inputs.Input': mx.ndarray.zeros([1, 3], ctx=ctx),
			'.Inputs.Target': mx.ndarray.zeros([1, 3], ctx=ctx),
			'seq_715248120': mx.ndarray.zeros([1], ctx=ctx)
		},
		grad_req=dict(zip(['.Inputs.Input', '.Inputs.Target', 'seq_715248120'], reqs))
	)

	ex.forward()
	ex.backward(out_grads=[mx.ndarray.array([1], ctx=ctx), mx.ndarray.array([1], ctx=ctx)])

	print(ex.grad_dict)

print('Input + Target gradient, CPU (OK):')
run_example(mx.cpu(), ['write', 'write', 'null'])
print('\n')
print('Input + Target gradient, GPU (OK):')
run_example(mx.gpu(), ['write', 'write', 'null'])
print('\n')
print('Target gradient only, CPU (OK):')
run_example(mx.cpu(), ['null', 'write', 'null'])
print('\n')
print('Target gradient only, GPU (WRONG):')
run_example(mx.gpu(), ['null', 'write', 'null'])

Output is:

Input + Target gradient, CPU (OK):
{'.Inputs.Input':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @cpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}


Input + Target gradient, GPU (OK):
{'.Inputs.Input':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}


Target gradient only, CPU (OK):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}


Target gradient only, GPU (WRONG):
{'.Inputs.Input': None, '.Inputs.Target':
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

The Target gradient has the sign flipped in the last example.

The text was updated successfully, but these errors were encountered:

matteosal · 2021-07-27T15:12:24Z

I see the same sign flip with this other symbol (which can be fed to the same above script)
sym2.zip

And with this one
sym3.zip
Which goes with this script:

import numpy as np
import mxnet as mx

json_path = 'sym3.json'
sym = mx.sym.load(json_path)

input_1 = np.random.rand(1, 2, 3, 4).tolist()
input_2 = np.random.rand(1, 2, 4).tolist()
input_3 = np.random.rand(1, 2).tolist()

def run_example(ctx, reqs):
	ex = sym._bind(
		ctx,
		{
			'.Inputs.Input1': mx.ndarray.array(input_1, ctx=ctx),
			'.Inputs.Input2': mx.ndarray.array(input_2, ctx=ctx),
			'.Inputs.Input3': mx.ndarray.array(input_3, ctx=ctx)
		},
		args_grad={
			'.Inputs.Input1': mx.ndarray.zeros([1, 2, 3, 4], ctx=ctx),
			'.Inputs.Input2': mx.ndarray.zeros([1, 2, 4], ctx=ctx),
			'.Inputs.Input3': mx.ndarray.zeros([1, 2], ctx=ctx)
		},
		grad_req=dict(zip(['.Inputs.Input1', '.Inputs.Input2', '.Inputs.Input3'], reqs))
	)

	ex.forward()
	ex.backward(out_grads=[mx.ndarray.ones([1, 2, 3, 4], ctx=ctx)])

	print(ex.grad_dict['.Inputs.Input2'])

print('Input1 + Input2 gradient, CPU (OK):')
run_example(mx.cpu(), ['write', 'write', 'null'])
print('\n')
print('Input1 + Input2 gradient, GPU (OK):')
run_example(mx.gpu(), ['write', 'write', 'null'])
print('\n')
print('Input2 gradient only, CPU (OK):')
run_example(mx.cpu(), ['null', 'write', 'null'])
print('\n')
print('Input2 gradient only, GPU (WRONG):')
run_example(mx.gpu(), ['null', 'write', 'null'])

Output is

Input1 + Input2 gradient, CPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @cpu(0)>


Input1 + Input2 gradient, GPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @gpu(0)>


Input2 gradient only, CPU (OK):

[[[-3. -2. -3. -2.]
  [ 0. -2. -2. -3.]]]
<NDArray 1x2x4 @cpu(0)>


Input2 gradient only, GPU (WRONG):

[[[3. 2. 3. 2.]
  [0. 2. 2. 3.]]]
<NDArray 1x2x4 @gpu(0)>

TristonC · 2021-07-27T23:08:02Z

Which version of MXNet did you @matteosal use?

TristonC · 2021-07-27T23:39:29Z

With your sym3 example, here is what I got with MXNet 1.9 on Linux. Not sure if this issue only occurs on Windows. Did you @matteosal try it on Linux?

Input1 + Input2 gradient, CPU (OK):
{'.Inputs.Input1': 'write', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[23:37:54] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @cpu(0)>


Input1 + Input2 gradient, GPU (OK):
{'.Inputs.Input1': 'write', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}
[23:38:01] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @gpu(0)>


Input2 gradient only, CPU (OK):
{'.Inputs.Input1': 'null', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @cpu(0)>


Input2 gradient only, GPU (WRONG):
{'.Inputs.Input1': 'null', '.Inputs.Input2': 'write', '.Inputs.Input3': 'null'}

[[[-3. -2. -3. -3.]
  [ 0. -3. -1. -3.]]]
<NDArray 1x2x4 @gpu(0)>

matteosal · 2021-07-30T09:39:57Z

I am using version 2.0, built from source at commit fabcd14
I have tried the same example on Linux (building from the same commit) and the results are good there. This issue only affects Windows.

TristonC · 2021-08-01T18:59:38Z

@matteosal Thanks for the update. @leezu Do you have windows platform to help triage the the problem?

leezu · 2021-08-02T20:59:40Z

I'm not a Windows user, so it's very hard for me to get MXNet running on Windows. @yajiedesign is Windows expert, maybe he can help

chinakook · 2021-08-10T07:19:52Z

I've tested with a 2.0 version modified by myself on Windows, and It's OK.

Input + Target gradient, CPU (OK):
{'.Inputs.Input': 
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @cpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}


Input + Target gradient, GPU (OK):
{'.Inputs.Input': 
[[-0.33333334 -0.33333334 -0.33333334]]
<NDArray 1x3 @gpu(0)>, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}


Target gradient only, CPU (OK):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @cpu(0)>, 'seq_715248120': None}


Target gradient only, GPU (WRONG):
{'.Inputs.Input': None, '.Inputs.Target':
[[0.33333334 0.33333334 0.33333334]]
<NDArray 1x3 @gpu(0)>, 'seq_715248120': None}

TristonC · 2021-08-19T00:14:00Z

@chinakook What did you modify? Is it related to this gradient issue? Could you share it with @matteosal?

matteosal · 2021-08-30T16:39:20Z

A ping on this
@chinakook what modification are you talking about? Can you reproduce the problem on a plain v2.0 build?

matteosal · 2021-10-13T16:05:57Z

A ping on this. Can anyone please investigate?

matteosal · 2021-11-15T17:56:19Z

@szha @leezu another ping on this :)

barry-jin · 2021-11-16T17:13:06Z

@matteosal What build settings should we use to reproduce this issue?

matteosal · 2021-11-18T14:18:07Z

@barry-jin here they are:

cmake -G"Visual Studio 15 2017 Win64" -T host=x64 ^
 %= GENERAL FLAGS =% ^
 -DCMAKE_INSTALL_PREFIX=%output_dir% ^
 -DCMAKE_BUILD_TYPE=Release ^
 -DCMAKE_SKIP_BUILD_RPATH=On ^
 -DUSE_OPENCV=OFF ^
 -DUSE_F16C=Off %= float16 support =%^
 -DUSE_INT64_TENSOR_SIZE=ON ^
 -DCMAKE_C_FLAGS="-D_WIN32" ^
 -DCMAKE_CXX_FLAGS="-D_WIN32" ^
 -DCMAKE_C_FLAGS_RELEASE="/MT -DNDEBUG" ^
 -DCMAKE_CXX_FLAGS_RELEASE="/MT -DNDEBUG" ^
 -DMXNET_FORCE_SHARED_CRT=OFF %= link statically to C runtime =%^
 -DCMAKE_SHARED_LINKER_FLAGS="/DELAYLOAD:nvcuda.dll delayimp.lib" ^
 -DUSE_MXNET_LIB_NAMING=OFF ^
 %= MATH BACKENDS =% ^
 -DBLAS=MKL ^
 -DUSE_LAPACK=OFF ^
 -DUSE_ONEDNN=OFF ^
 -DBLA_VENDOR="Intel10_64ilp" ^
 -DBLA_STATIC=OFF ^
 -DMKL_USE_SINGLE_DYNAMIC_LIBRARY=OFF ^
 -DMKL_INCLUDE_DIR=%mkl_dir% ^
 -DBLAS_LIBRARIES="%mkl_dir%/libiomp5md.lib;%mkl_dir%/mkl_core_dll.lib;%mkl_dir%/mkl_intel_ilp64_dll.lib;%mkl_dir%/mkl_intel_thread_dll.lib" ^
 %= OPENMP =% ^
 -DUSE_OPENMP=ON ^
 -DOpenMP_C_FLAGS="-I%mkl_dir%" ^
 -DOpenMP_C_LIB_NAMES="libiomp5" ^
 -DOpenMP_CXX_FLAGS="-I%mkl_dir%" ^
 -DOpenMP_CXX_LIB_NAMES="libiomp5" ^
 -DOpenMP_libiomp5_LIBRARY="%mkl_dir%/libiomp5md.lib" ^
 %= CUDA =% ^
 -DUSE_CUDA=ON ^
 -DUSE_CUDNN=ON ^
 -DCUDNN_LIBRARY=%home_dir:\=/%cuDNN/lib/cudnn64_8.lib ^
 -DCUDNN_INCLUDE=%home_dir:\=/%cuDNN/include ^
 -DUSE_NCCL=OFF ^
 -DUSE_NVML=OFF ^
 -DCUDNN_ROOT=%home_dir:\=/%cuDNN ^
 -DMXNET_CUDA_ARCH="3.7"\;"5.0"\;"6.0"\;"7.0"\;"8.0+PTX" %= see Readme =%^
 -DCUDAToolkit_ROOT=%cuda_dir% ^
 -DCMAKE_CUDA_COMPILER="%cuda_dir%/bin/nvcc.exe" -I"%cuda_dir%/include" -L"%cuda_dir%/lib/x64"  ^
 -DUSE_SPLIT_ARCH_DLL=OFF ^
 %mxnet_dir%

MKL version is 2019.4 and CUDA version is 11.4.0

matteosal · 2022-01-06T18:05:50Z

@barry-jin any news on this? I have rebuilt with VC2019 in order to fix this issue but I still see this problem here

barry-jin · 2022-01-10T05:32:40Z

Sorry, I'm still triaging this issue. I built with settings in build_window.py and can also reproduce this issue.

barry-jin · 2022-01-10T21:38:51Z

@matteosal Current workaround is to replace 'elemwise_sub' with '_npi_subtract'. There are probably some issues in legacy subtract operator.

matteosal · 2022-01-12T11:19:21Z

@barry-jin thank you, I have verified that swapping the operator fixes the problem

matteosal added Bug needs triage labels Jul 27, 2021

barry-jin added a commit to barry-jin/incubator-mxnet that referenced this issue Jan 10, 2022

[BUGFIX] Fix apache#20471

02b50bf

barry-jin linked a pull request Jan 10, 2022 that will close this issue

[BUGFIX] Fix #20471 #20814

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong gradients on Windows-GPU #20471

Wrong gradients on Windows-GPU #20471

matteosal commented Jul 27, 2021

matteosal commented Jul 27, 2021

TristonC commented Jul 27, 2021

TristonC commented Jul 27, 2021 •

edited

Loading

matteosal commented Jul 30, 2021

TristonC commented Aug 1, 2021

leezu commented Aug 2, 2021 •

edited

Loading

chinakook commented Aug 10, 2021

TristonC commented Aug 19, 2021

matteosal commented Aug 30, 2021

matteosal commented Oct 13, 2021 •

edited

Loading

matteosal commented Nov 15, 2021

barry-jin commented Nov 16, 2021

matteosal commented Nov 18, 2021

matteosal commented Jan 6, 2022

barry-jin commented Jan 10, 2022

barry-jin commented Jan 10, 2022

matteosal commented Jan 12, 2022

Wrong gradients on Windows-GPU #20471

Wrong gradients on Windows-GPU #20471

Comments

matteosal commented Jul 27, 2021

matteosal commented Jul 27, 2021

TristonC commented Jul 27, 2021

TristonC commented Jul 27, 2021 • edited Loading

matteosal commented Jul 30, 2021

TristonC commented Aug 1, 2021

leezu commented Aug 2, 2021 • edited Loading

chinakook commented Aug 10, 2021

TristonC commented Aug 19, 2021

matteosal commented Aug 30, 2021

matteosal commented Oct 13, 2021 • edited Loading

matteosal commented Nov 15, 2021

barry-jin commented Nov 16, 2021

matteosal commented Nov 18, 2021

matteosal commented Jan 6, 2022

barry-jin commented Jan 10, 2022

barry-jin commented Jan 10, 2022

matteosal commented Jan 12, 2022

TristonC commented Jul 27, 2021 •

edited

Loading

leezu commented Aug 2, 2021 •

edited

Loading

matteosal commented Oct 13, 2021 •

edited

Loading