Implement LLM check for evaluation

Currently, we follow [qwen-math ](https://github.com/QwenLM/Qwen2.5-Math) github to parse the evaluation logic. However, many are false negatives - the responses are mostly correct but wrongly parsed. 

We should use LLM to check the response.