multi-GPU training throw an illegal memory access

When I use one GPU to train, there is no problem. But when I use two or four GPUs, the problem come out.  The log output:

terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator:
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
*** Aborted at 1516866180 (unix time) try "date -d @1516866180" if you are using GNU date ***
terminate called recursively
terminate called recursively
terminate called recursively
PC: @     0x7ff67559f428 gsignal
terminate called recursively
terminate called recursively
E0125 07:43:00.745853 55683 pybind_state.h:422] Exception encountered running PythonOp function: RuntimeError: [enforce fail at context_gpu.h:307] error == cudaSuccess. 77 vs 0. Error at: /mnt/hzhida/project/caffe2/caffe2/core/context_gpu.h:307: an illegal memory access was encountered

At:
  /mnt/hzhida/facebook/detectron/lib/ops/generate_proposals.py(101): forward
*** SIGABRT (@0x3e80000d84f) received by PID 55375 (TID 0x7ff453fff700) from PID 55375; stack trace: ***
terminate called recursively
    @     0x7ff675945390 (unknown)
    @     0x7ff67559f428 gsignal
    @     0x7ff6755a102a abort
    @     0x7ff66f37e84d __gnu_cxx::__verbose_terminate_handler()
    @     0x7ff66f37c6b6 (unknown)
    @     0x7ff66f37c701 std::terminate()
    @     0x7ff66f3a7d38 (unknown)
    @     0x7ff67593b6ba start_thread
    @     0x7ff67567141d clone
    @                0x0 (unknown)
Aborted (core dumped)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multi-GPU training throw an illegal memory access #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multi-GPU training throw an illegal memory access #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions