Speedup the training process when the reward_function is cpu-intensive

The generation and rewarding process are all running under one async context, which means it utilizes only one CPU. It can be very slow, often 2-4x slower than other frameworks that are based on ray tasks. For a different perspective, by swapping the reward function with a placeholder function, the rollout time reduces by 50%.