description |
---|
Notes from reading a good introductory article to Generative AI |
Title: Generative AI exists because of the transformer
Link: https://ig.ft.com/generative-ai/
Author(s): Visual Storytelling Team and Madhumita Murgia
Published: 9/11/2023 London
Referred from: HN post - How Transformers Work.
Generative AI - software that can create plausible and sophisticated text, images, code at a level that mimics human ability. LLMs are pattern-spotting engines that guess the next best option in a sequence. They are not search engines that are looking up facts. Process
- tokenizes the text (breaks it down into sub-words)
- observes that word in the context of other occurrences of the word
- processes this data and produces a vector - word embedding - a list of numbers, based on each word's proximity to the word LLM is underpinned by the Transformer architecture proposed by Google researchers in 2017. Transformer architecture
- processes a sequence of words together at a once, instead of each individual words separately. This helped LLMs to have better context and patterns, and thus producing more accurate text. And this process runs faster because it can be parallelized.
- A key concept of the transformer architecture is self-attention which allows LLMs to understand relationship between words. It looks at each token in a body of text and and decided which of those words are most important to understand its meaning.
- Prior to transformers, the standard method for language translation was RNN (Recurrent Neural Network), which scanned each word sequentially - in the forward direction only.
- With self-attention, transformers compute all the words at the same time.
When the user gives a prompt, it tokenizes, encodes the prompt and represents it in a machine-understandable format - it includes the meaning, positions and relationship between words. Then it will try to predict the next word, the next word etc. until the output is complete. These predictions also come as tokens and the model assigns a probability score to each of these tokens - to indicate the likelihood that the token is the best next word.
So the content generated by LLMs may seem plausible and coherent, but they not always factually correct. Transformer models can recognize any repeating patterns - pixels in image, code, notes in music, DNA in proteins etc. Two ways to predict these tokens:
- greedy search - the model predicts each word in isolation. But this could make the whole phrase irrelevant, even when each individual token is meaningful.
- beam search - the model looks at the probability of a larger set of tokens instead of each token individually. So it considers multiple routes and finds the best option.
- LLM is a giant leap forward in our quest to build intelligence
- Generative AI - software that can create plausible and sophisticated text, images, code at a level that mimics human ability.
- LLM is underpinned by the Transformer architecture proposed by Google researchers in 2017
How does LLM generate text?
First, it translates words into a format that it can understand.
- a set of words are broken into tokens - tokenization. Tokens are usually sub-words. Example -
We go to work by train
. One of the tokens iswork
. - in order to understand the meaning of
work
, the LLM observes that word in the context of other occurrences of the word - using enormous sets of training data created from the internet. - after the training, there will be...
- a few words that appear next to
work
-are
,her
,friend
,admirable
,streamlined
- and other words that don't -
dove
,polka
- a few words that appear next to
- the LLM processes this data and produces a vector - word embedding - a list of numbers, based on each word's proximity to the word
work
.- word embedding of
work
could be[.35, .21, .07, .25, .33,....]
- each of these embeddings contain 100's of values, each of which represents various linguistic features of the word.
- we don't know exactly what each of these values represent, but the words with similar embeddings are usually used in similar (comparable) context.
- Example -
football
andsoccer
are not identical, but have similar meaning, so their embeddings quantify that closeness - if we take just two of these characteristics and project them to a 2-d plain, we can see the distance between them and can identify clusters of similar words.
- word embedding of
Transformer architecture processes a sequence of words together at a once, instead of each individual words separately. This helped LLMs to have better context and patterns, and thus producing more accurate text. And this process runs faster because it can be parallelized. This architecture was first published by Google Research team in 2017 - Transformer: A Novel Neural Network Architecture for Language Understanding
A key concept of the transformer architecture is self-attention which allows LLMs to understand relationship between words. It looks at each token in a body of text and and decided which of those words are most important to understand its meaning.
Prior to transformers, the standard method for language translation was RNN (Recurrent Neural Network), which scanned each word sequentially - in the forward direction only. With self-attention, transformers compute all the words at the same time.
Example - take the word interest
. In the sentence I have no interest in politics
, the word interest
is used as a noun to indicate the subject's affiliation to politics. In the sentence The bank's interest rates continue to rise
, the same word is used in the financial sense. Even when we combine the two usages, I have no interest in hearing about the rising interest rate of the bank
, the model is able to recognize the meaning of the word in each context. In the first use of the word intrest
, no
and in
gets the highest attention. For the second usage, it is rate
and bank
. It also allows the model to use other words in place of interest
- at the right place. For example, I have no enthusiasm in hearing about the rising...
. This is particularly useful when summarizing content.
Another example -
The dog chewed the bone because it was hungry
. Here,it
refers to the dog.The dog chewed the bone because it was delicious
. Here,it
refers to the bone, not the dog.
This self-attention helps LLMs to gather context from a broad area - well beyond the sentence boundaries. This helps you scale things up.
- OpenAI's GPT-4
- Google PaLM which powers its Bard chatbot (and now a newer model Gemini)
- Anthropic Claude
- Meta's LLaMA
- Cohere's Command
- Mistral
LLMs are trained on huge corpus of text available in the internet. They identify patterns and context in this data and create word embeddings, positional encoding and self-attention.
When the user gives a prompt, it tokenizes, encodes the prompt and represents it in a machine-understandable format - it includes the meaning, positions and relationship between words. Then it will try to predict the next word, the next word etc. until the output is complete.
These predictions also come as tokens and the model assigns a probability score to each of these tokens - to indicate the likelihood that the token is the best next word.
There are two ways to predict these tokens:
- greedy search - the model predicts each word in isolation. But this could make the whole phrase irrelevant, even when each individual token is meaningful.
- beam search - the model looks at the probability of a larger set of tokens instead of each token individually. So it considers multiple routes and finds the best option.
Beam search produces better accurate results and more human-like text.
Beam search that looks at larger set of tokens
LLMs are not search engines that look up facts. They are pattern-spotting engines that guess the next best option in a sequence. The output of the LLMs may seem plausible and coherent, but they may not be factually correct. So they can fabricate information in a process called hallucination. So they can make up references to articles that don't exist, wrong authors for papers etc.
Companies are trying to limit the extend of this hallucination in a few ways:
- put humans in the loop to give feedback and fill in the gaps in information - RLHF (Reinforcement Learning with Human Feedback)
- a method called grounding - cross-checks the LLM's output against web search results and give citations so that people can verify.
The power of LLMs go far beyond text. Transformer models can recognize any repeating patterns (pixels in image, code, notes in music, DNA in proteins).
For decades, AI research had produced specialized models to summarize, translate, search and retrieve. Transformers unified them all into a single structure capable of doing multiple tasks.