The evaluation library currently includes evaluators that score AI responses for various metrics like Coherence, Fluency etc. as well as for metrics defined by custom IEvaluators.
It would be great to also capture stats such as number of tokens present in the response, latency of the response and whether or not the response was fetched from cache v/s via an LLM call etc. in the report.
Additionally, it would also be a good idea to include similar cache hit, latency and token count measurements for each evaluation that is performed using an LLM.
FYI @peterwald