Skip to content

[ERROR] Prompt xxx failed to parse output:The output parser failed to parse the output including retries. #2022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
674717172 opened this issue Apr 22, 2025 · 1 comment
Assignees
Labels
bug Something isn't working module-metrics this is part of metrics module question Further information is requested

Comments

@674717172
Copy link

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
I have the same problem. When the length of reference and response is very long (response 665 characters, reference 2745 characters), llm_context_precision_with_reference,context_recall and fectual_correctness are all nan, and then the error message is: [ERROR] Prompt xxx failed to parse output: The output parser failed to parse the output including retries.

Ragas version:
Python version:3.11.0

Code to Reproduce

def count_score(self, result) -> dict:
scores = asdict(result).get("scores", [])
print(f'result==分数计算算==>{result}')
if not scores:
return {} # 如果 "scores" 列表为空,直接返回空字典
score = {}
for case in scores:
for metrics, value in case.items():
if math.isnan(value):
value = 0 # 将 nan 替换为 0
if metrics not in score:
score[metrics] = value
else:
score[metrics] += value
mean_score = {metrics: round(value / len(scores),2) for metrics, value in score.items()}
return mean_score

async def evaluation(self) -> Data:
    result = Data(text="")
    if not self.user_input or not self.response or not self.retrieved_contexts:
        self.status = result
        return None
    if not self.referenceContent:
        if self.ContextPrecision: result.data["ContextPrecision"]= "Disabled(reference required)"
        if self.ContextRecall: result.data["ContextRecall"]= " Disabled(reference required)"
        if self.FactualCorrectness: result.data["FactualCorrectness"] = "Disabled(reference required)"
        if self.SemanticSimilarity: result.data["SemanticSimilarity"]= "Disabled(reference required)"
        # self.ContextPrecision = self.ContextRecall = self.FactualCorrectness = self.SemanticSimilarity = False
        # self.reference = ""
    
    _retrive_contexts = []
    for context in self.retrieved_contexts:
        _retrive_contexts.append(context.data["Generate"])
    print(f'{"-"*100}\nretrieved_contexts+++>{_retrive_contexts}\n{"-"*100}')
    
    _referenceContent = self.referenceContent
    
    sample = [SingleTurnSample(
        user_input = self.user_input,
        response = self.response,
        retrieved_contexts = _retrive_contexts,
        reference=_referenceContent,
        reference_contexts =[ _referenceContent ]
    )]
    print(f'{"-"*30}  self.referenceContent==>{type(self.referenceContent)}= self.referenceContent===>{self.referenceContent} \nsample ==>{sample}\n{"-"*30}')

    default_url = "http://0.0.0.0:11435"
    # llm = self.model
    llm = ChatOllama(model="qwen2.5:14b-instruct-fp16", temperature=0,  base_url=default_url) 
    # evaluator_llm = LlamaIndexLLMWrapper(llm)
    evaluator_llm = LangchainLLMWrapper(llm)
    evaluator_embeddings = LangchainEmbeddingsWrapper(self.embedding)
    eval_dataset = EvaluationDataset(sample)
    
    metrics = []
    if self.ContextPrecision: metrics.append(LLMContextPrecisionWithReference(llm=evaluator_llm))
    if self.ContextRecall: metrics.append(LLMContextRecall(llm=evaluator_llm))
    if self.FactualCorrectness: metrics.append(FactualCorrectness(llm=evaluator_llm))
    if self.Faithfulness :metrics.append(Faithfulness(llm=evaluator_llm))
    if self.ResponseRelevancy: metrics.append(ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings))
    if self.SemanticSimilarity: metrics.append(SemanticSimilarity(embeddings=evaluator_embeddings))

    evaluation_result = evaluate(dataset=eval_dataset, metrics=metrics  , run_config=RunConfig(max_workers=4,timeout=3200000))
    score = self.count_score(evaluation_result)
    
    # context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)
    # sample2 = SingleTurnSample(
    #     user_input = self.user_input,
    #     retrieved_contexts = _retrive_contexts,
    #     reference = self.referenceContent
    # )
    # print(f'sample2222=>{sample2}')
    # resss =  await context_precision.single_turn_ascore(sample2)
    # print(f'test LLMContextPrecisionWithReference测试===>{resss}')
    # for metrics, value in score.items():
    #     result += f"{metrics}: {value}\n"
    result_docs = self.retrieved_contexts if  isinstance(self.retrieved_contexts,list) else [self.retrieved_contexts]
    print(f'score==>P{score},===>{result_docs}')
    source = []
    content = []
    for item in result_docs:
        if item.data["source"]:
            source.append(item.data["source"])
        if item.text:
            content.append(item.text)
    resultData = Data(data=dict(score,question=self.user_input, answer = self.response,sources = '\n'.join(source),retrieval_content='\n----------------------------分割线------------------------\n'.join(content) ))
    self.status =  resultData
    return resultData

Error trace
[ERROR] Prompt xxx failed to parse output:The output parser failed to parse the output including retries.

Expected behavior
I hope these three values ​​llm_context_precision_with_reference,context_recall and fectual_correctness can be calculated

Additional context
Add any other context about the problem here.

@674717172 674717172 added the bug Something isn't working label Apr 22, 2025
@dosubot dosubot bot added module-metrics this is part of metrics module question Further information is requested labels Apr 22, 2025
Copy link

dosubot bot commented Apr 22, 2025

The issue you're encountering with NaN values for llm_context_precision_with_reference, context_recall, and factual_correctness metrics, especially with long reference and response lengths, might be related to the limitations of the evaluation framework or the specific configuration of the language model and embeddings.

Here are a few suggestions to address the problem:

  1. Upgrade Ragas Version: Ensure you are using Ragas version 0.2.9 or later, as this version addresses issues related to NaN values in metrics by providing better handling for undetermined states [1].

  2. Batch Processing: Consider splitting your dataset into smaller batches and evaluating them independently. This can help manage memory usage and processing time, which might be contributing to the NaN values.

  3. Check LLM and Embeddings Configuration: Ensure that the LangchainLLMWrapper and LangchainEmbeddingsWrapper are correctly configured and compatible with the metrics you are using. These wrappers are crucial for providing the necessary LLM and embeddings for evaluation [2][3].

  4. Handle NaN Values: In your count_score function, you are already replacing NaN values with 0. This is a good practice to ensure that your scoring logic can proceed without errors. However, it's important to understand why NaN values are occurring to address the root cause.

  5. Review Metric Requirements: Ensure that all required inputs for the metrics are provided. For example, some metrics might require specific reference content or context to function correctly.

By following these steps, you should be able to mitigate the issue and obtain meaningful evaluation results. If the problem persists, consider reviewing the specific configurations and inputs used in your evaluation setup.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module-metrics this is part of metrics module question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants