Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“Emergent” abilities in LLMs actually develop gradually and predictably – study | Hacker News #882

Open
1 task
ShellLM opened this issue Aug 14, 2024 · 1 comment
Labels
forum-discussion Quotes clipped from forums llm Large Language Models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets Papers Research papers

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented Aug 14, 2024

"Emergent" abilities in LLMs actually develop gradually and predictably – study

Snippet

"Emergent" abilities in LLMs actually develop gradually and predictably – study (quantamagazine.org)

255 points by Anon84 4 months ago | hide | past | favorite | 201 comments

Comments

a_wild_dandan

There are several issues with the study:

  1. Replacing pass/fail accuracy with smoother alternatives (e.g token edit distance) could be a terrible proxy for skill, depending on the task.
  2. Even by the authors' metrics, they still find a few potentially emergent abilities.
  3. Hindsight is 20-20. Yes, we can revisit the data and fiddle until we find transforms that erase emergence from aptitude plots. The fact is, folk used commonplace test accuracy measurements, and the results were unpredictable and surprising. That's the true notable phenomenon.

I think there's value in the paper. Just...don't take its conclusions too far.

Gisbitus

Just like it's mentioned later in the article: it doesn't really matter if you get an addition mostly right. You either get it right or you don't. I still appreciate their effort though, because even after altering the grading system, there were still some emergent abilities.

arka2147483647

Assume we have a child, and we test him regularly:

  • Test 1: First he can just draw squiggles on the math test
  • Test 2: Then he can do arithmetic correctly
  • Test 3: He fails on the last details on the algebraic calculation.

Now, event though he fails on all tests, any reasonable parent would see that he improving nicely, and would be able to work in his chosen field in a year or so.

Or alternatively, if we talk about AI, we can set the Test as a threshold, and we see the results are continuously trending upwards, and we can expect the curve to breach the threshold in the future.

That is; measuring improvement, instead of pass/fail, allows one to predict when we might be able to use the AI for something.

londons_explore

With AI you can do millions of tests. Some tests are easy by chance (eg. "Please multiply this list of numbers by zero"). Some tests are correct by chance alone, easy or hard.

When you actually do these millions of tests, I don't think it really matters what the exact success metric is - an AI which is 'closer to correct, but still wrong' on one test will still get more tests correct overall on the dataset of millions of tests.

raincole

Human beings do arithmetic problems wrong all the time so I'm not sure "doing addition 100% right" is a merit of intelligence.

I'm not saying LLM will achieve AGI (I don't know if it will, or when it does we'll even know). But somehow people seem to be judging AI's intelligence with this simple procedural:

  1. Find a task that AI can't do perfectly. 2. Gotcha! AI isn't intelligent.

It just makes me question humans' intelligence if anything.

Jensson

Arithmetics is extremely easy for a neural network to perform and learn perfectly, that LLMs fails to learn it even though it is so easy is strong evidence that LLMs has very limited capability to learn logical structures that can't be represented as grammar.

Human beings do arithmetic problems wrong all the time

Humans built cars and planes and massive ships before we had calculators, that requires a massive amount of calculations that are all perfect to be possible. Humans aren't bad at getting calculations right, they are just a bit slow. Today humans are bad since we don't practice it, not because we can't. LLMs can't do that today, can learn and can't is a massive difference.

pedrovhb

My intuition is that a significant challenge for LLMs' ability to do arithmetics has to do with tokenization. For instance, 1654+73225 as per the OpenAI tokenizer tool breaks down into 165•4•+•732•25, meaning the LLM is incapable of considering digits individually; that is, "165" is a single "word" and its relationship to "4" and in fact each other token representing a numerical value has to be learned. It can't do simple carry operations (or other arithmetic abstractions humans have access to) in the vast majority of cases because its internal representation of text is not designed for this. Arithmetic is easy to do in base 10 or 2 or 16, but it's a whole lot harder in base ~100k where 99% of the "digits" are words like "cat" or "///////////".

Compare that to understanding arbitrary base64-encoded strings; that's much harder for humans to do without tools. Tokenization still isn't the greatest fit for it, but it's a lot more tractable, and LLMs can do it no problem. Even understanding ASCII art is impressive, given they have no innate idea of what any letter looks like, and they "see" fragments of each letter on each line.

So I'm not sure if I agree or disagree with you here. I'd say LLMs in fact have very impressive capabilities to learn logical structures. Whether grammar is the problem isn't clear to me, but their internal representation format obviously and enormously influences how much harder seemingly trivial tasks become. Perhaps some efforts in hand-tuning vocabularies could improve performance in some tasks, perhaps something different altogether is necessary, but I don't think it's an impossible hurdle to overcome.

Closi

I don't think that's really how it works - sure this is true at the first level in a neural network, but in deep neural networks after the first few layers the LLM shouldn't be 'thinking' in tokens anymore.
The tokens are just the input - the internal representation can be totally different (and that format isn't tokens).

Der_Einzige

Please don't act like you "know how it works" when you obviously don't.
The issue is not the fact that the model "thinks or doesn't think in tokens". The model is forced at the final sampling/decoding step to convert it's latent back into tokens, one token at a time.
The models are fully capable of understanding the premise that they should "output a 5-7-5 syllable Haiku", but from the perspective of a model trying to count its own syllables, this is not possible, as its own vocabulary is tokenized in such a way that not only does the model not have direct phonetic information within the dataset, but it literally has no analogue for how humans count syllables (measuring mouth drops). Models can't reason about the number of characters or even tokens used in a reply too for the same exact reason too.
The person you're replying to broadly is right, and you are broadly wrong. The internal format does not matter when the final decoding step forces a return of tokenization. Please actually use these systems rather than pontificating about them online.

Suggested labels

None

@ShellLM ShellLM added forum-discussion Quotes clipped from forums llm Large Language Models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets Papers Research papers labels Aug 14, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented Aug 14, 2024

Related content

#856 similarity score: 0.92
#810 similarity score: 0.88
#855 similarity score: 0.87
#652 similarity score: 0.86
#686 similarity score: 0.86
#14 similarity score: 0.85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forum-discussion Quotes clipped from forums llm Large Language Models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets Papers Research papers
Projects
None yet
Development

No branches or pull requests

1 participant