feat: add support for train on windows #37

Wang-zipeng · 2021-02-08T02:58:25Z

Implement train on windows.
Compile steps(need visual studio 2017):

Setup compile environment:
execute "C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvars64.bat" in the windows cmd to establish a compile environment.
execute "set DISTUTILS_USE_SDK=1"
Compile
enter the code folder and use command "python setup.py develop"

Train steps:

enter the "playground\centernet.res18.coco.512size" folder and run "python train_net.py" or run in the IDE like pycharm. If you want to run in the pycharm be attention set python.exe's path in the PATH environment variable.
If you want to train resnet50/101, just copy train_net.py to the "playground\centernet.res50.coco.512size" or "playground\centernet.res101.coco.512size"

Anothers:

Because we can't use "os.statvfs" on the windows so i remove the disk check, and i think it doesn't matter.
Because we can't use "os.getuid()" on the windows so i set a constant name as "User_name", and i think is impossible to train on windows cluster.

FateScript · 2021-02-08T04:02:50Z

dl_lib/engine/defaults.py

@@ -66,7 +66,7 @@ def default_argument_parser():
    # PyTorch still may leave orphan processes in multi-gpu training.
    # Therefore we use a deterministic way to obtain port,
    # so that users are aware of orphan processes by seeing the port occupied.
-    port = 2 ** 15 + 2 ** 14 + hash(os.getuid()) % 2 ** 14
+    port = 2 ** 15 + 2 ** 14 + hash("User_name") % 2 ** 14


hash("User_name") is a fix value, please don't do that.

It's a fixed value i know, but i think is impossible to train on 8-GPU windows machine, I will find a way to get uid on windows.

FateScript · 2021-02-08T04:03:50Z

dl_lib/layers/ROIAlign/ROIAlign_cuda.cu

@@ -334,7 +338,7 @@ at::Tensor ROIAlign_forward_cuda(
  auto output_size = num_rois * pooled_height * pooled_width * channels;
  cudaStream_t stream = at::cuda::getCurrentCUDAStream();

-  dim3 grid(std::min(at::cuda::ATenCeilDiv(output_size, 512L), 4096L));
+  dim3 grid(std::min(ceil_div((int)output_size, 512), 4096));


at::cuda::ATenCeilDiv works for all platform, the real reason for not working on windows is 'L'

I will change it and try to recompile.

If i remove "L", could this function run correctly on linux? Could i just simple "L"?

FateScript · 2021-02-08T04:04:01Z

dl_lib/layers/ROIAlign/ROIAlign_cuda.cu

@@ -390,7 +394,7 @@ at::Tensor ROIAlign_backward_cuda(

  cudaStream_t stream = at::cuda::getCurrentCUDAStream();

-  dim3 grid(std::min(at::cuda::ATenCeilDiv(grad.numel(), 512L), 4096L));
+  dim3 grid(std::min(ceil_div((int)grad.numel(), 512), 4096));


Same as last one.

FateScript · 2021-02-08T04:04:27Z

playground/centernet.res18.coco.512size/config.py

@@ -52,7 +52,7 @@
    SOLVER=dict(
        OPTIMIZER=dict(
            NAME="SGD",
-            BASE_LR=0.02,
+            BASE_LR=0.002,


please do not change this, thanks.

0.02 is too bigger for one GPU, i will change back it.

FateScript · 2021-02-08T04:05:20Z

playground/centernet.res18.coco.512size/train_net.py

@@ -0,0 +1,126 @@
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved


Such a file is duplicated with tools/train_net.py, or you should consider combine them together

Ok, I will try use the same train way as on linux

FateScript

PTAL @Wang-zipeng

Wang-zipeng · 2021-02-08T04:26:51Z

PTAL @Wang-zipeng

i just search PTAL's mean by google.

FateScript · 2021-02-08T09:49:03Z

dl_lib/engine/defaults.py

@@ -66,7 +67,7 @@ def default_argument_parser():
    # PyTorch still may leave orphan processes in multi-gpu training.
    # Therefore we use a deterministic way to obtain port,
    # so that users are aware of orphan processes by seeing the port occupied.
-    port = 2 ** 15 + 2 ** 14 + hash(os.getuid()) % 2 ** 14
+    port = 2 ** 15 + 2 ** 14 + hash(getuser()) % 2 ** 14


Suggested change

port = 2 ** 15 + 2 ** 14 + hash(getuser()) % 2 ** 14

port = 2 ** 15 + 2 ** 14 + hash(os.getuid() if sys.platform != "win32" else 1) % 2 ** 14

FateScript · 2021-02-08T09:50:52Z

dl_lib/layers/ROIAlign/ROIAlign_cuda.cu

@@ -334,7 +338,7 @@ at::Tensor ROIAlign_forward_cuda(
  auto output_size = num_rois * pooled_height * pooled_width * channels;
  cudaStream_t stream = at::cuda::getCurrentCUDAStream();

-  dim3 grid(std::min(at::cuda::ATenCeilDiv(output_size, 512L), 4096L));
+  dim3 grid(std::min(at::cuda::ATenCeilDiv(static_cast<int64_t>(output_size), static_cast<int64_t>(512)), static_cast<int64_t>(4096)));


It's better to break this long line of code.

FateScript · 2021-02-08T09:51:06Z

dl_lib/layers/ROIAlign/ROIAlign_cuda.cu

@@ -390,7 +394,7 @@ at::Tensor ROIAlign_backward_cuda(

  cudaStream_t stream = at::cuda::getCurrentCUDAStream();

-  dim3 grid(std::min(at::cuda::ATenCeilDiv(grad.numel(), 512L), 4096L));
+  dim3 grid(std::min(at::cuda::ATenCeilDiv(static_cast<int64_t>(grad.numel()), static_cast<int64_t>(512)), static_cast<int64_t>(4096)));


FateScript · 2021-02-08T09:54:04Z

setup.py

@@ -39,6 +41,8 @@ def get_extensions():
            "-D__CUDA_NO_HALF_CONVERSIONS__",
            "-D__CUDA_NO_HALF2_OPERATORS__",
        ]
+        if "Windows" == os_name:


is sys.platform suitable for your case?

FateScript · 2021-02-08T10:44:15Z

tools/train_net.py

-    if eval_space_Gb > free_space_Gb:
-        logger.warning(f"{Fore.RED}Remaining space({free_space_Gb}GB) "
-                       f"is less than ({eval_space_Gb}GB){Style.RESET_ALL}")
+    if "Linux" == platform.system():


Suggested change

if "Linux" == platform.system():

if sys.platform == "linux":

FateScript

Remember that Python is not C++, code like

if a = 1

is invalid.

feat: add support for train on windows

1662cd9

FateScript reviewed Feb 8, 2021

View reviewed changes

FateScript requested changes Feb 8, 2021

View reviewed changes

feat: new implementation for train on windows

bc20dcb

FateScript reviewed Feb 8, 2021

View reviewed changes

FateScript requested changes Feb 8, 2021

View reviewed changes

Wang-zipeng and others added 3 commits February 9, 2021 20:21

feat: convert from 'platform' to 'sys'

a1c94c1

feat: implement extremenet

b109923

feat

273d47f

		@@ -0,0 +1,126 @@
		# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved

	port = 2 15 + 2 14 + hash(getuser()) % 2 ** 14
	port = 2 15 + 2 14 + hash(os.getuid() if sys.platform != "win32" else 1) % 2 ** 14

feat: add support for train on windows #37

Are you sure you want to change the base?

feat: add support for train on windows #37

Uh oh!

Conversation

Wang-zipeng commented Feb 8, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FateScript left a comment

Choose a reason for hiding this comment

Uh oh!

Wang-zipeng commented Feb 8, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FateScript left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!