Large Language Models (LLMs) are typically poor at performing calculations. This limitation arises from their design, but does it have to be this way?
While LLMs struggle with calculations, they excel at writing code, including code for these calculations.
Project Cloudberry aims to create a system that combines the strengths of LLMs with the ability to perform accurate calculations.
The approach is straightforward:
- The user asks a question.
- An LLM writes Python code to answer the question. This code is then executed to obtain the solution.
- The same request is sent to another LLM instance without specific instructions.
- A third LLM instance is presented with the question and two answer options. If the question involves counting or calculations, it selects the answer generated by the Python-executing LLM. For general questions, it chooses the response from the second LLM instance.
Based on testing, the system performs exceptionally well on tasks that require calculations. Modern state-of-the-art (SOTA) LLMs struggle to provide consistently correct answers to questions like:
- "How many 'l's are there in the sentence 'llama lives an alluring life at Lollapalooza'?"
- "How much is 2387 x 9045?"
- "What is the meaning of life?"
- "Which weighs more: a pound of water, two pounds of bricks, a pound of feathers, or three pounds of air?"
- "I’m in London and facing west, is Edinburgh to my left or my right?"
- "How many 'r's are there in the word 'strawberry'?"
- "Give me a list of ingredients for an omelette and tell me how many different ingredients I need."
- "How much do 3.486 liters of milk weigh?"
- The system struggles with riddles that require complex logical thinking.
- Project Cloudberry lacks a mechanism for output verification, so it cannot produce correct answers to requests like "Write a sentence containing exactly 11 words."
The project is built entirely using free infrastructure: