Head over to alxndrtl.github.io/ARC if you want to visualize the ARC tasks solved (or not) by LLMs !
This repo contains the code and results for the evaluation of some famous LLMs on the Abstraction and Reasoning Corpus.
In the responses
folder, there is one folder per model, with the completions of all the tasks it was evaluated on (as long as it answered with a valid response - ie one that can be made into a numpy array). The file results.txt
shows on which specific tasks the different models succeeded. The code used to get the results is also available.
The present document shows the results, then elaborates on why it matters and why this is interesting for present and future research, and finally, touch the point that while the ARC tasks were seen during training, there is little evidence, following multiple experiment, that the model just learned the tasks.
The code used to get the results is also provided as a notebook. Note that it is compatible only with the 0.27 version of the python openai library. Please see the OpenAI documentation if you want to work with newer versions (very little changes).
The direct evaluation of ARC tasks on LLMs yields the following results :
model | result |
---|---|
gpt-4 | 21% |
gpt-4-turbo | 18% |
text-davinci-003 | 14% |
gpt-3.5-turbo-instruct | 10.5% |
gpt-3.5-turbo (4k) | 11% |
text-davinci-002 | 10.5% |
llama2-70b | 4% |
llama2-70b-chat | 0% |
Note : these are pass@1 results. The ARC paper suggests to leave 3 tries to the human/model. This is not done here, only one try.
All the models were evaluated on the same subset composed of 100 tasks choosen randomly (from the 400 training tasks), except for :
- text-davinci-002, evaluated on 200 of these 400 training tasks (choosen randomly).
- text-davinci-003 and gpt-3.5-turbo-instruct, evaluated on the whole 400 training tasks. Default temperature was used (=1), except for gpt-3.5-turbo-instruct, gpt-3.5-turbo and gpt-4-turbo, where a temperature of 0 was used.
Note : some tasks were too long to fit in the context of some models (namely : text-davinci-003/2, and the llamas), but the percentage shown here doesn't take it into account (basically, a task is considered failed if its too big to fit in the context). It doesn't really matter, as most task that are long are failed by these models (more about it below).
model | mean lenght of succeeded tasks | context lenght |
---|---|---|
gpt-4 | 1066 | 8k |
gpt-3.5-turbo-instruct | 600 | 4k |
gpt-3.5-turbo (4k) | 553 | 4k |
The mean length of the tasks which the models were evaluated on is 2242. So we see that the LLMs mostly succeed on short tasks. Is it because these tasks are easier ? or is it because their context-learning ability is limited to short contexts ? A little bit of both I would say. The gap between the 3.5 and 4 family is impressive.
I think that this benchmark is very interesting for the fine-tuning of LLMs. We see that turning an LLM into a chatbot with RLHF makes the success rate goes down by a few points. Inversly, can't we fine-tune an LLM in an other way, and have it perform better on ARC ? Of course, the final goal isn't to have a model good at ARC. As I said, training on the ARC tasks is I believe irrelevant. But a chatbot is just one possiblity among many other when one chooses what to do with a base LLM. See Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning for an example of such fine-tuning.
One reasonable claim to make about these results is that the different models simply learned the tasks (which are present on the Internet). The evidence supporting this claim are the results of these LLMs on the private ARC tasks that only @fchollet have access to. While we don't have any results nor studies, @fchollet implied that all the LLMs achieved <5% success rate on these private tasks.
However, there are results which are hard to explain if we accept this claim : by replacing the tokens of the tasks by random ones, and transposing the grids, the success rates are nearly not affected. Every ARC tasks are encoded with numbers, each number being attributed to one color : blue=0, red=1, etc... The results showed aboved were obtained by simply giving the LLMs these tasks with these classic tokens, so a task looks like : 0,0,0,1,8,0\n0,1,0,9,0,0,0 .... What I tried to do is changing the tokens to which colors are mapped. I tried using random tokens (as in Large Language Models as General Pattern Machines). We have : blue="am", red="sure", etc... Additionnaly, I transposed the grids. And the success rates are only affected by 2 or 3%.
If the claim that LLMs just learned the task is true, then we have shown that they have learnt the tasks in a very subtle way : not by just remembering the order of some determined tokens. If this is true, this work shows a special kind of "learning power" of LLMs.
Focus the test on smaller models and see the impact of different training/fine-tuning choices. For example, compare llama-7b with its code variants. Try phi-1.5 (first results aren't good, about 1%).