Tensor.to("cuda:1") core dumped #7095

Flowingsun007 · 2021-12-23T07:13:31Z

Summary

When i try to move a tensor from device 0 to device 1 than print it，i found it bug,
tensor.to("cuda") is normal，tensor.to("cuda:1") or "cuda:x" will bug,
and the bug only in print(tensor)，when print(tensor.numpy()) than everything is ok!

Code to reproduce bug

>>> import torch
>>> x = torch.tensor([[1., 2.], [3., 4.]])
>>> x
tensor([[1., 2.],
        [3., 4.]])
>>> x.to("cuda:1")
tensor([[1., 2.],
        [3., 4.]], device='cuda:1')
>>> import oneflow as flow
>>> x = flow.tensor([[1., 2.], [3., 4.]])
>>> x.to("cuda")
tensor([[1., 2.],
        [3., 4.]], device='cuda:0', dtype=oneflow.float32)
>>> x.to("cuda:1")
F1226 16:15:23.055996 876333 memcpy.cpp:38] Check failed: cudaMemcpyAsync(dst, src, count, cudaMemcpyDefault, cuda_stream->cuda_stream()) : an illegal memory access was encountered (700) 
*** Check failure stack trace: ***
    @     0x7f263d8929a0  google::LogMessage::Fail()
    @     0x7f263d8928db  google::LogMessage::SendToLog()
    @     0x7f263d89220c  google::LogMessage::Flush()
    @     0x7f263d89581a  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f263463cb14  oneflow::ep::primitive::(anonymous namespace)::MemcpyImpl::Launch()
    @     0x7f263579c855  oneflow::AutoMemcpy()
    @     0x7f263579ce29  oneflow::SyncAutoMemcpy()
    @     0x7f26412c6644  oneflow::OfBlob::AutoMemCopyTo<>()
    @     0x7f26412ae29e  oneflow::BlobBufferCopyUtil<>::To()
    @     0x7f26412aa4df  oneflow::BlobNumpyCopyUtil<>::To()
    @     0x7f26412b9298  _ZZZN7oneflow3one33CopyBetweenMirroredTensorAndNumpyIiEENS_5MaybeIvvEERKSt10shared_ptrINS0_6TensorEEP7_objectPFS3_mRKNS_13NumPyArrayPtrEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbENKUlmE_clEmENKUlPKcE_clESQ_
    @     0x7f26412b91ff  _ZZN7oneflow3one33CopyBetweenMirroredTensorAndNumpyIiEENS_5MaybeIvvEERKSt10shared_ptrINS0_6TensorEEP7_objectPFS3_mRKNS_13NumPyArrayPtrEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbENKUlmE_clEm
    @     0x7f26412f1dfd  _ZNSt17_Function_handlerIFvmEZN7oneflow3one33CopyBetweenMirroredTensorAndNumpyIiEENS1_5MaybeIvvEERKSt10shared_ptrINS2_6TensorEEP7_objectPFS5_mRKNS1_13NumPyArrayPtrEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbEUlmE_E9_M_invokeERKSt9_Any_dataOm
    @     0x7f26332b2019  std::function<>::operator()()
    @     0x7f263493eae9  _ZZN7oneflow19InstructionsBuilder24SyncAccessBlobByCallbackISt10shared_ptrINS_3one14MirroredTensorEEEENS_5MaybeIvvEET_RKS2_INS_11SpinCounterEES2_ISt8functionIFvmEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEENKUlmE_clEm
    @     0x7f263495ae04  _ZNSt17_Function_handlerIFvmEZN7oneflow19InstructionsBuilder24SyncAccessBlobByCallbackISt10shared_ptrINS1_3one14MirroredTensorEEEENS1_5MaybeIvvEET_RKS4_INS1_11SpinCounterEES4_ISt8functionIS0_EERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEUlmE_E9_M_invokeERKSt9_Any_dataOm
    @     0x7f26332b2019  std::function<>::operator()()
    @     0x7f26332ad04d  oneflow::vm::AccessBlobByCallbackInstructionType::Compute()
    @     0x7f2636a715d3  oneflow::vm::CudaStreamType::Compute()
    @     0x7f26332adf08  oneflow::vm::StreamType::Compute()
    @     0x7f2636a9e3ff  oneflow::vm::StreamType::Run()
    @     0x7f2636ac7bad  oneflow::vm::VirtualMachineEngine::DispatchInstruction()
    @     0x7f2636ac7801  oneflow::vm::VirtualMachineEngine::DispatchAndPrescheduleInstructions()
    @     0x7f2636ac9818  oneflow::vm::VirtualMachineEngine::Schedule()
    @     0x7f2636ab3fa4  oneflow::VirtualMachine::Loop()
    @     0x7f2636ac253f  _ZSt13__invoke_implIvMN7oneflow14VirtualMachineEFvRKSt8functionIFvvEEEPS1_JS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_
    @     0x7f2636ac22f5  _ZSt8__invokeIMN7oneflow14VirtualMachineEFvRKSt8functionIFvvEEEJPS1_S4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSB_DpOSC_
    @     0x7f2636ac2101  _ZNSt6thread8_InvokerISt5tupleIJMN7oneflow14VirtualMachineEFvRKSt8functionIFvvEEEPS3_S6_EEE9_M_invokeIJLm0ELm1ELm2EEEEvSt12_Index_tupleIJXspT_EEE
    @     0x7f2636ac2019  _ZNSt6thread8_InvokerISt5tupleIJMN7oneflow14VirtualMachineEFvRKSt8functionIFvvEEEPS3_S6_EEEclEv
    @     0x7f2636ac1fa8  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJMN7oneflow14VirtualMachineEFvRKSt8functionIFvvEEEPS4_S7_EEEEE6_M_runEv
    @     0x7f261abb7de4  (unknown)
    @     0x7f2642dad609  start_thread
Aborted (core dumped)

After trying this commit:#5783 , it most likely tensor print function has some bug

Same code can reproduce it

import oneflow as flow

x1 = flow.tensor([[1., 2.], [3., 4.]], device="cpu")
print("x1 >>>>>>>>>>>> \n", x1)
x2 = x1.to("cuda:1")
print("x2.numpy() >>>>>>>> \n", x2.numpy())
print("x2 >>>>>>>>>>> \n", x2)

output

x1 >>>>>>>>>>>> 
 tensor([[1., 2.],
        [3., 4.]], dtype=oneflow.float32)
x2.numpy() >>>>>>>> 
 [[1. 2.]
 [3. 4.]]
x2 >>>>>>>>>>>
F1226 15:53:55.720588 2838954 new_kernel_util.cu:24] Check failed: cudaMemcpyAsync(dst, src, sz, cudaMemcpyDefault, ctx->cuda_stream()) : an illegal memory access was encountered (700) 
*** Check failure stack trace: ***
    @     0x7fd092f00c76  google::LogMessage::Fail()
    @     0x7fd092f00bb1  google::LogMessage::SendToLog()
    @     0x7fd092f004e2  google::LogMessage::Flush()
    @     0x7fd092f03ad0  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fd08b3b3344  oneflow::Memcpy<>()
    @     0x7fd08d526127  oneflow::AutoMemcpy()
    @     0x7fd08d526184  oneflow::SyncAutoMemcpy()
    @     0x7fd08b18aa82  oneflow::OfBlob::AutoMemCopyTo<>()
    @     0x7fd08b1842ed  _ZZN7oneflow19OfBlob_CopyToBufferIfEEvmN8pybind117array_tIT_Li16EEEENKUlvE_clEv
    @     0x7fd08b194442  _ZNSt17_Function_handlerIFvvEZN7oneflow19OfBlob_CopyToBufferIfEEvmN8pybind117array_tIT_Li16EEEEUlvE_E9_M_invokeERKSt9_Any_data
    @     0x7fd08afc34e8  std::function<>::operator()()
    @     0x7fd08b22c7dc  oneflow::GILForeignLockHelper::WithScopedAcquire()
    @     0x7fd08b1843b0  oneflow::OfBlob_CopyToBuffer<>()
    @     0x7fd08b16e595  _ZZZN7oneflow3one12_GLOBAL__N_133CopyBetweenMirroredTensorAndNumpyIfEENS_5MaybeIvvEERKSt10shared_ptrINS0_6TensorEEN8pybind117array_tIT_Li16EEEPFvmSD_ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEENKUlPNS_19InstructionsBuilderEE_clESP_ENKUlmE_clEm
    @     0x7fd08b17999b  _ZNSt17_Function_handlerIFvmEZZN7oneflow3one12_GLOBAL__N_133CopyBetweenMirroredTensorAndNumpyIfEENS1_5MaybeIvvEERKSt10shared_ptrINS2_6TensorEEN8pybind117array_tIT_Li16EEEPFvmSF_ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEENKUlPNS1_19InstructionsBuilderEE_clESR_EUlmE_E9_M_invokeERKSt9_Any_dataOm
    @     0x7fd08c7fdfb1  std::function<>::operator()()
    @     0x7fd08c7f9d8a  oneflow::vm::AccessBlobByCallbackInstructionType::Compute()
    @     0x7fd08d8756ad  oneflow::vm::CudaStreamType::Compute()
    @     0x7fd08c7faee0  oneflow::vm::StreamType::Compute()
    @     0x7fd08d8b4f6d  oneflow::vm::StreamType::Run()
    @     0x7fd08d8c9d55  oneflow::vm::VirtualMachine::DispatchAndPrescheduleInstructions()
    @     0x7fd08d8caee9  oneflow::vm::VirtualMachine::Schedule()
    @     0x7fd08d89b615  oneflow::OneflowVM::Loop()
    @     0x7fd08d8a75ba  _ZSt13__invoke_implIvMN7oneflow9OneflowVMEFvvEPS1_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2_
    @     0x7fd08d8a7407  _ZSt8__invokeIMN7oneflow9OneflowVMEFvvEJPS1_EENSt15__invoke_resultIT_JDpT0_EE4typeEOS6_DpOS7_
    @     0x7fd08d8a7293  _ZNSt6thread8_InvokerISt5tupleIJMN7oneflow9OneflowVMEFvvEPS3_EEE9_M_invokeIJLm0ELm1EEEEvSt12_Index_tupleIJXspT_EEE
    @     0x7fd08d8a71e5  _ZNSt6thread8_InvokerISt5tupleIJMN7oneflow9OneflowVMEFvvEPS3_EEEclEv
    @     0x7fd08d8a7174  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJMN7oneflow9OneflowVMEFvvEPS4_EEEEE6_M_runEv
    @     0x7fd02490fde4  (unknown)
    @     0x7fd09a38b609  start_thread
    @     0x7fd09a2b2293  clone
    @              (nil)  (unknown)
Aborted (core dumped)

System Information

What is your OneFlow installation (pip, source, dockerhub):
OS: ubuntu
OneFlow version (run python3 -m oneflow --doctor):

version: 0.6.0+cu112.git.b4da856819
git_commit: b4da856819
cmake_build_type: Debug
rdma: False
mlir: False

Python version: 3.8.8
CUDA driver version: Driver Version: 495.44

The text was updated successfully, but these errors were encountered:

liufengwei0103 · 2021-12-27T02:58:25Z

'Slice method' will fail after tensor is sent to "cuda:1" by 'to method', tensor_str needs 'slice method', so looks like failure comes from tensor_str.
We can reproduce it by below code.

import oneflow as flow
x = flow.tensor([1, 2], dtype=flow.int32)
y = x.to("cuda:1")
y[0]

@Flowingsun007

liufengwei0103 · 2021-12-29T03:04:50Z

import oneflow as flow
x = flow.tensor([1, 2]， dtype=flow.int64)
y = x.to(device="cuda:1", dtype=flow.int32, copy=False)

run above code, 'to' method self will fail. it looks like 'to' method has bug.

Flowingsun007 added the bug label Dec 23, 2021

Flowingsun007 assigned Flowingsun007 and liufengwei0103 and unassigned Flowingsun007 Dec 24, 2021

liufengwei0103 mentioned this issue Dec 30, 2021

指定非0 device id的copy行为存在bug #7145

Closed

liufengwei0103 linked a pull request Dec 31, 2021 that will close this issue

fix copy bug #7159

Merged

oneflow-ci-bot closed this as completed in #7159 Dec 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor.to("cuda:1") core dumped #7095

Tensor.to("cuda:1") core dumped #7095

Flowingsun007 commented Dec 23, 2021 •

edited

Loading

liufengwei0103 commented Dec 27, 2021 •

edited

Loading

liufengwei0103 commented Dec 29, 2021 •

edited

Loading

Tensor.to("cuda:1") core dumped #7095

Tensor.to("cuda:1") core dumped #7095

Comments

Flowingsun007 commented Dec 23, 2021 • edited Loading

Summary

Code to reproduce bug

System Information

liufengwei0103 commented Dec 27, 2021 • edited Loading

liufengwei0103 commented Dec 29, 2021 • edited Loading

Flowingsun007 commented Dec 23, 2021 •

edited

Loading

liufengwei0103 commented Dec 27, 2021 •

edited

Loading

liufengwei0103 commented Dec 29, 2021 •

edited

Loading