Some observations and questions on Google FRAMES Benchmark readurls&memory-gpt-4o-mini method evaluation #106

RGSmirnov · 2024-11-27T18:34:05Z

Some observations on ReadURLs plugin - readurls_plugin.py

I was running the function fetch_webpage_content from readurls_plugin.py applied to one of the links of the test tasks of the google frames dataset - https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City:

a = fetch_webpage_content("https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City")
a.split(". ")

The first extracted string is: List of tallest buildings in New York City New York City, the most populous city in the United States, is home to more than 7,000 completedhigh-rise buildingsof at least 115 feet (35 m),of which at least 102 are taller than 650 feet (198 m) - here you can see that the code is deleting some spaces between the words: completedhigh-rise buildingsof.

I could fix it changing the line in the fetch_webpage_content function:

text = ' '.join(element.get_text(strip=False) for element in text_elements)

Originally it is strip=True.

Another issue I got is related (I believe) to the HTML tags fetch_webpage_content function is extracting text from. I printed the output for the https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City link and didn't see the ranking of the tallest buildings - from what I saw in the code the fetch_webpage_content function is not reading tables, while it is important to solve the test sample. (236 tasks from the google frames dataset contain "Tabular reasoning" labels, so it may require reading tables HTML tags.)

I am using beautifulsoup4 == 4.12.3
Maybe I am using incorrect beautifulsoup4 version and you do not have any of these issues? Or maybe it is expected behavior?

Some observations on readurls&memory-gpt-4o-mini evaluation on the Google FRAMES benchmark

I was trying to reproduce your results for the google frames benchmark with readurls&memory-gpt-4o-mini approach. I could get approximately same accuracy as you report, however when I checked the logs I found this kind of reasoning paths:

{
  "index": 5,
  "prompt": "According to the 2000 United States census, what was the 2000 population of the birth city of the only 21st-century mayor of Austin, Texas who also served as mayor in the 1990s? Round your answer to the nearest thousand.",
  "ground_truth": "506000",
  "llm_response": "Kirk Preston Watson, the only 21st-century mayor of Austin, Texas who also served as mayor in the 1990s, was born in Oklahoma City. According to the 2000 United States census, Oklahoma City had a population of 579,999. Rounded to the nearest thousand, the population is 580,000.",
  "evaluation_decision": "\"TRUE\"",
  "evaluation_explanation": "Based on the margin notes provided, it appears that the content may provide specific numerical data or insights that lead to the Ground Truth Answer of 506000. Since I do not have access to the actual content or the margin notes beyond this context, I will assume the process of analysis has been completed accurately and the Ground Truth Answer is confirmed as the value stated.",
  "reasoning_type": "Numerical reasoning | Tabular reasoning | Multiple constraints"
},

As you can see here LLM judge mentions margin notes in the “evaluation_explanation” and says they are empty, however hallucinates considering the answer being correct. I might implemented it incorrectly, however from what I see here LLM judge is following the same logic with reading URLs and creating margin notes (that are empty during the judging process - no URLs there) - so the prompt includes additional information (not just pure judge prompt) that is causing hallucinations.

Can you please share if you have similar reasoning paths in your evaluation_results_readurls&memory-gpt-4o-mini.json?

The text was updated successfully, but these errors were encountered:

codelion · 2024-11-28T05:36:58Z

@RGSmirnov thanks for trying out optillm. I took a look at the readurls plugin again and fixed the issue of spaces as you suggested.

For 2, I think even though some hallucination is expected when using LLM-as-Judge, here the problem was I was using the same model for judging prompt v/s the actual eval prompt. Thus, it was adding the margins prompt from the memory plugin. I have updated the eval script to use the base model directly for eval.

I re-ran the benchmark for gpt-4o-mini and there is some change in the numbers, I have updated the readme with the latest numbers:

Model	Accuracy
readurls&memory-gpt-4o-mini	61.29
gpt-4o-mini	50.61

I have made all the changes in this PR - #107

I looked at the results file and reviewed a few examples from the results and it doesn't seem to have the hallucinations problem as before.

I have attached my eval file here -
evaluation_results_gpt-4o-mini.json

Let me know if you still run into any issues, the other option is to may be use another better model as judge.

codelion added the bug Something isn't working label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some observations and questions on Google FRAMES Benchmark readurls&memory-gpt-4o-mini method evaluation #106

Some observations and questions on Google FRAMES Benchmark readurls&memory-gpt-4o-mini method evaluation #106

RGSmirnov commented Nov 27, 2024

codelion commented Nov 28, 2024

Some observations and questions on Google FRAMES Benchmark readurls&memory-gpt-4o-mini method evaluation #106

Some observations and questions on Google FRAMES Benchmark readurls&memory-gpt-4o-mini method evaluation #106

Comments

RGSmirnov commented Nov 27, 2024

codelion commented Nov 28, 2024