Replies: 1 comment
-
There seems to be a standard for how models answer with code: in markdown code fences. We have code that parses such responses (and cleans up lots of cases). Code is here https://github.com/symflower/eval-dev-quality/blob/main/model/llm/prompt/parse.go some examples are here https://github.com/symflower/eval-dev-quality/blob/main/model/llm/prompt/parse_test.go Looking at the v0.5.0 evaluation run results (i am currently writing a deep-dive for those) we have ~65% such code responses. It might be that some of these responses are code responses and we do not parse them correctly, or they are just nonsense. In a future release we might simply switch to a forced structured output. For the assessment: When a model answers with code that we can parse (no matter if there is extra text) it receives one point, and if there is only code (and no additional chatter) it receives another point. However, i think no-chatter should receive a bigger weight. |
Beta Was this translation helpful? Give feedback.
-
I just ran the evaluation once and did not look at the code in detail yet.
What is the strategy with regards to the model answering not just straight with code bit some text. Is there an attempt to parse the code from the markdown that is typically returned? IIRC I saw some prompt to only return the code, but not all models follow such "formatting" rules properly.
Beta Was this translation helpful? Give feedback.
All reactions