You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I want to use this great work in torch based distributed training to speed up, it works well when only use single gpu, but when use more than one gpu, it get crash and get the error as following: terminate called after throwing an instance of 'pybind11::error_already_set' what(): ValueError: Hold resources failed: cudaErrorInvalidResourceHandle: invalid resource handle
I have tried to print some info to debug this problem, it can be found that all things is fine in rank_0, but cvcuda get crash in rank_1,
the main code is shown as below:
`
# Define the cuda device, context and streams.cuda_device=cuda.Device(self.rank)
cuda_ctx=cuda_device.retain_primary_context()
cuda_ctx.push()
cvcuda_stream=cvcuda.Stream().currenttorch_stream=torch.cuda.default_stream(device=cuda_device)
print(f'rank_{self.rank} start train, cvcuda stream: {cvcuda_stream}, torch_stream: {torch_stream}')
self.data_preprocessor=PreprocessorCvcuda(
self.rank,
cuda_ctx,
cvcuda_stream,
)
# Do everything in streams.withcvcuda_stream, torch.cuda.stream(torch_stream):
self.train(train_dataloaders, test_dataloaders, iterations=iterations)
cuda_ctx.pop()
`````pythonclassImageBatchDecoder:
def__init__(
self,
device_id,
cuda_ctx,
cuda_stream,
cvcuda_perf=None,
):
self.device_id=device_idself.cuda_ctx=cuda_ctxself.cuda_stream=cuda_streamself.cvcuda_perf=cvcuda_perfself.decoder=nvimgcodec.Decoder(device_id=device_id)
def__call__(self, batch: list, aug_params: dict):
# args: # batch: batch of undecoded images bytesifself.cvcuda_perfisnotNone:
self.cvcuda_perf.push_range("decoder.nvimagecodec")
data_batch= [imgforframeinbatchforimginframe]
tensor_list= []
print(f'rank_{self.device_id} start decode, stream: {self.cuda_stream}...', flush=True)
image_list=self.decoder.decode(data_batch, cuda_stream=self.cuda_stream)
print(f'rank_{self.device_id} end decode...', flush=True)
resize=aug_params['resize'].view(-1, 2).cpu().numpy()
crop=aug_params['crop'].view(-1, 4).cpu().numpy()
rotate=aug_params['rotate'].view(-1).cpu().numpy()
rotate_rad=rotate*3.1415926535897932384626433832795/180sin_r=np.sin(rotate_rad)
cor_r=np.cos(rotate_rad)
# Convert the decoded images to nvcv tensors in a list.foriinrange(len(image_list)):
print(f'rank_{self.device_id} start resize_crop_convert_reformat...', flush=True)
aug_img=cvcuda.resize_crop_convert_reformat(
cvcuda.as_tensor(image_list[i], "HWC"),
(resize[i, 0], resize[i, 1]),
cvcuda.Interp.LINEAR,
cvcuda.RectI(
crop[i, 0],
crop[i, 1],
round(crop[i, 2] -crop[i, 0]),
round(crop[i, 3] -crop[i, 1])),
layout="HWC",
data_type=nvcv.Type.U8,
# manip=cvcuda.ChannelManip.REVERSE,# scale=1. / 255,stream=self.cuda_stream,
)
print(f'rank_{self.device_id} start rotate...', flush=True)
aug_img=cvcuda.rotate(
aug_img,
rotate[i],
[0.5* (aug_img.shape[1] -aug_img.shape[1] *cor_r[i] -aug_img.shape[0] *sin_r[i]),
0.5* (aug_img.shape[0] +aug_img.shape[1] *sin_r[i] -aug_img.shape[0] *cor_r[i])],
cvcuda.Interp.LINEAR,
stream=self.cuda_stream,
)
tensor_list.append(aug_img)
# Stack the list of tensors to a single NHWC tensor and convert to NCHW.print(f'rank_{self.device_id} start reformat...', flush=True)
cvcuda_decoded_tensor=cvcuda.reformat(cvcuda.stack(tensor_list), "NCHW", stream=self.cuda_stream)
ifself.cvcuda_perfisnotNone:
self.cvcuda_perf.pop_range()
print(f'rank_{self.device_id} end of ImageBatchDecoder...', flush=True)
returncvcuda_decoded_tensor
The text was updated successfully, but these errors were encountered:
While CV-CUDA and nvImageCodec work great for inference, they may not be well suited to the multiprocess data loading approach PyTorch applies for training.
Have you heard and consider using DALI for that purpose which provides seamless integration with PyTorch?
While CV-CUDA and nvImageCodec work great for inference, they may not be well suited to the multiprocess data loading approach PyTorch applies for training. Have you heard and consider using DALI for that purpose which provides seamless integration with PyTorch?
@JanuszL Thank you for your reply! Yes, I have considered to use DALI, but I think it's a little complicated,I have to refactor many dataset code, so I use CV-CUDA. In fact, I didn't use CV-CUDA in dataset, I use it at the begining of each iteration, dataloader only read image bytes, before model forward, use nvImageCodec and CV-CUDA to do decode ang augmentation on gpu. I think the problem may be that cvcuda.Stream().current hasn't specified the device_id, but I don't found any code to do that...
Hi, I want to use this great work in torch based distributed training to speed up, it works well when only use single gpu, but when use more than one gpu, it get crash and get the error as following:
terminate called after throwing an instance of 'pybind11::error_already_set' what(): ValueError: Hold resources failed: cudaErrorInvalidResourceHandle: invalid resource handle
I have tried to print some info to debug this problem, it can be found that all things is fine in rank_0, but cvcuda get crash in rank_1,
the main code is shown as below:
`
The text was updated successfully, but these errors were encountered: