Bug Description
Forcing to_copy to insert ICast Layer reduces perf (~10%) on Unet.
It's not necessary to insert a Cast Layer if the dtype doesn't change, e.g., from DataType.HALF to DataType.HALF:
Forced Cast ITensor [NORMALIZATION]-[aten_ops.native_group_norm.default]-[model.1.submodule.1.submodule.conv.unit0.adn.N/native_group_norm_4]_output from DataType.HALF to DataType.HALF - [aten_ops.torch.ops.aten.clone.default]-[model.1.submodule.1.submodule.conv.unit0.adn.D/clone_4], type: LayerType.CAST, inputs: 1, outputs: 1
Currently, all copy related ops are inserting Cast Layer and TensorRT doesn't remove them for us during optimization. We need to carefully think about when is a must to insert Cast Layer.