You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first extracted string is: List of tallest buildings in New York City New York City, the most populous city in the United States, is home to more than 7,000 completedhigh-rise buildingsof at least 115 feet (35 m),of which at least 102 are taller than 650 feet (198 m) - here you can see that the code is deleting some spaces between the words: completedhigh-rise buildingsof.
I could fix it changing the line in the fetch_webpage_content function:
Another issue I got is related (I believe) to the HTML tags fetch_webpage_content function is extracting text from. I printed the output for the https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City link and didn't see the ranking of the tallest buildings - from what I saw in the code the fetch_webpage_content function is not reading tables, while it is important to solve the test sample. (236 tasks from the google frames dataset contain "Tabular reasoning" labels, so it may require reading tables HTML tags.)
I am using beautifulsoup4 == 4.12.3
Maybe I am using incorrect beautifulsoup4 version and you do not have any of these issues? Or maybe it is expected behavior?
Some observations on readurls&memory-gpt-4o-mini evaluation on the Google FRAMES benchmark
I was trying to reproduce your results for the google frames benchmark with readurls&memory-gpt-4o-mini approach. I could get approximately same accuracy as you report, however when I checked the logs I found this kind of reasoning paths:
{
"index": 5,
"prompt": "According to the 2000 United States census, what was the 2000 population of the birth city of the only 21st-century mayor of Austin, Texas who also served as mayor in the 1990s? Round your answer to the nearest thousand.",
"ground_truth": "506000",
"llm_response": "Kirk Preston Watson, the only 21st-century mayor of Austin, Texas who also served as mayor in the 1990s, was born in Oklahoma City. According to the 2000 United States census, Oklahoma City had a population of 579,999. Rounded to the nearest thousand, the population is 580,000.",
"evaluation_decision": "\"TRUE\"",
"evaluation_explanation": "Based on the margin notes provided, it appears that the content may provide specific numerical data or insights that lead to the Ground Truth Answer of 506000. Since I do not have access to the actual content or the margin notes beyond this context, I will assume the process of analysis has been completed accurately and the Ground Truth Answer is confirmed as the value stated.",
"reasoning_type": "Numerical reasoning | Tabular reasoning | Multiple constraints"
},
As you can see here LLM judge mentions margin notes in the “evaluation_explanation” and says they are empty, however hallucinates considering the answer being correct. I might implemented it incorrectly, however from what I see here LLM judge is following the same logic with reading URLs and creating margin notes (that are empty during the judging process - no URLs there) - so the prompt includes additional information (not just pure judge prompt) that is causing hallucinations.
Can you please share if you have similar reasoning paths in your evaluation_results_readurls&memory-gpt-4o-mini.json?
The text was updated successfully, but these errors were encountered:
@RGSmirnov thanks for trying out optillm. I took a look at the readurls plugin again and fixed the issue of spaces as you suggested.
For 2, I think even though some hallucination is expected when using LLM-as-Judge, here the problem was I was using the same model for judging prompt v/s the actual eval prompt. Thus, it was adding the margins prompt from the memory plugin. I have updated the eval script to use the base model directly for eval.
I re-ran the benchmark for gpt-4o-mini and there is some change in the numbers, I have updated the readme with the latest numbers:
readurls_plugin.py
I was running the function
fetch_webpage_content
fromreadurls_plugin.py
applied to one of the links of the test tasks of the google frames dataset - https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City:The first extracted string is:
List of tallest buildings in New York City New York City, the most populous city in the United States, is home to more than 7,000 completedhigh-rise buildingsof at least 115 feet (35 m),of which at least 102 are taller than 650 feet (198 m)
- here you can see that the code is deleting some spaces between the words:completedhigh-rise buildingsof
.I could fix it changing the line in the
fetch_webpage_content
function:Originally it is
strip=True
.Another issue I got is related (I believe) to the HTML tags
fetch_webpage_content
function is extracting text from. I printed the output for the https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City link and didn't see the ranking of the tallest buildings - from what I saw in the code thefetch_webpage_content
function is not reading tables, while it is important to solve the test sample. (236 tasks from the google frames dataset contain "Tabular reasoning" labels, so it may require reading tables HTML tags.)I am using beautifulsoup4 == 4.12.3
Maybe I am using incorrect beautifulsoup4 version and you do not have any of these issues? Or maybe it is expected behavior?
readurls&memory-gpt-4o-mini
evaluation on the Google FRAMES benchmarkI was trying to reproduce your results for the google frames benchmark with
readurls&memory-gpt-4o-mini
approach. I could get approximately same accuracy as you report, however when I checked the logs I found this kind of reasoning paths:As you can see here LLM judge mentions margin notes in the “evaluation_explanation” and says they are empty, however hallucinates considering the answer being correct. I might implemented it incorrectly, however from what I see here LLM judge is following the same logic with reading URLs and creating margin notes (that are empty during the judging process - no URLs there) - so the prompt includes additional information (not just pure judge prompt) that is causing hallucinations.
Can you please share if you have similar reasoning paths in your
evaluation_results_readurls&memory-gpt-4o-mini.json
?The text was updated successfully, but these errors were encountered: