[Feature]: [Perf] Optimize `reshape_and_cache` CUDA Kernel

### 🚀 The feature, motivation and pitch

Similar to https://github.com/vllm-project/vllm/pull/22036
We can optimize the `reshape_and_cache` Cuda kernel.

Pick it up if you are interested.