Skip to content
This repository was archived by the owner on Jul 12, 2021. It is now read-only.

Commit ae34182

Browse files
** corrections to language.
1 parent b589ab5 commit ae34182

File tree

8 files changed

+105
-91
lines changed

8 files changed

+105
-91
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,11 +58,11 @@ This repository has the following features:
5858

5959
After reviewing these models, the world's your oyster in terms of other models to explore:
6060

61-
ELMO, XLNET, all the other BERTs, BART, Performer, T5, etc....
61+
Char-RNN, BERT, ELMO, XLNET, all the other BERTs, BART, Performer, T5, etc....
6262

6363
## Roadmap
6464

65-
Future models:
65+
Future models to implement:
6666

6767
- [ ] Char-RNN (Kaparthy)
6868
- [ ] BERT

notebooks/cnn/README.md

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -81,11 +81,12 @@ of those features, say either making a feature possess a bell-curve distribution
8181
In terms of feature engineering/ representations, a break through in this regard was employing unsupervised training to derive semantic [embeddings](../word2vec/README.md) as features
8282
(link is to implementation of Skipgram).
8383
Many paper results have shown a significant improvement in model performance when using these pre-trained representations. It's
84-
surprising to see that one could input these into a linear model and outperform TF-IDF in many situations. One downside of using embeddings
85-
in a linear model is that you need to pool across the embeddings in order to collapse the dimensionality, so a max/average pooling technique
84+
surprising to see that one could input these into a linear regression and outperform TF-IDF in many situations. One downside of using a linear model
85+
is that you need to pool across the embeddings in order to collapse the dimensionality, so a max/average pooling technique
8686
is often required.
8787

88-
Following this logical thread, is there a way to take these embeddings and extract further information from them?
88+
Following this logical thread, how can we further capture the inter-dependency between words when trying to predict
89+
a target? Is there a way to take these embeddings and extract further information from them?
8990
*How can we get at automated feature extraction?*
9091

9192
This is where deep learning, and in particular, convolutional neural networks (CNNs) come into play.
@@ -110,26 +111,30 @@ convolution filters applied to the pixels of a picture of the Taj Mahal. One is
110111
the colors.
111112

112113
Convolution filters and pooling layers form the bedrock of the CNN architecture.
113-
These are neural network architectures that *derive* the values of a series of convolution filters
114-
in order to extract useful features from a series of inputs.
114+
These are neural network architectures that **automatically derive convolution filters**
115+
in order to boost the model's ability to learn a target.
115116

116117
### Advantages of CNNs
117118

118-
Convolution layers have many advantages in that one can vary
119-
the number of filters simultaneously running over a set of inputs, as well as the properties such as:
120-
(1) *size of the filters* (i.e. the window size of the filter as it moves over the set of inputs); (2) *the stride or pace
121-
of the filters* (i.e. if it skips over the volume of inputs ); etc. One benefit of using convolution layers is that they
122-
may be stacked on top of each other in a series of layers. *Each layer of convolution filters is thought to derive a different
123-
level of feature extraction*, from the most rudimentary at the deepest levels to the finer details at the shallowest levels.
119+
CNNs are highly flexible. One has several knobs available when selecting these layers:
120+
1. *the number of simultaneous filters* (i.e., how many different simultaneous feature derivations to make from an input)
121+
2. *size of the filters* (i.e. the window size of the filter as it moves over the set of inputs)
122+
3. *the stride or pace of the filters* (i.e. if it skips over the volume of inputs ); etc.
123+
124+
Another benefit of using convolution layers is that they
125+
may be stacked on top of each other in a series of layers. Each layer of convolution filters is thought to derive a different
126+
level of feature extraction, from the most rudimentary at the deepest levels to the finer details at the shallowest levels.
127+
124128
Pooling layers are interspersed between convolution layers in order to summarize (i.e. reduce the dimensionality of)
125129
the information from a set of feature maps via sub-sampling.
126130

127131
A final note is that CNNs are typically considered very fast to train compared to other typical deep
128-
architectures (like say the RNN) as they process things in a simultaneous manner.
132+
architectures (like say the RNN) as they process a batch of data simultaneously.
129133

130134
### CNNs Work Well for Classification/Identification/Detection
131135

132-
Both pooling and convolution operations are **locally invariant**, which means that their ability to detect a feature
136+
Both pooling and convolution operations have the highly useful property that they are **locally invariant**,
137+
which means that their ability to detect a feature
133138
is independent of the location in the set of inputs. This lends itself very well to classification tasks.
134139

135140
## Model-Specifics

notebooks/gpt/README.md

Lines changed: 29 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -70,20 +70,21 @@ trainer.run()
7070
GPT, which stands for "Generative Pre-trained Transformer", is a part of the realm of "sequence models" (sequence-to-sequence or "seq2seq"),
7171
models that attempt to map an input (source) sequence and to an output (target) sequence.
7272
Sequence models encompass a wide range of representations, from long-standing, classical probabilistic approaches such as
73-
Hidden Markov Models (HMMs), Bayesian networks, et.c to more recent "deep learning" models such as recurrent neural networks (RNNs).
73+
Hidden Markov Models (HMMs), Conditional Random Fields (CRFs) etc. to more recent "deep learning" models such as recurrent neural networks (RNNs).
7474

7575
GPT belongs to a newer the class of models known as Transformers, which we touch upon [here](../transformer/README.md).
7676

7777
### GPT is a Language Model Transformer
7878

7979
Unlike the original Transformer which is originally posed as a Machine Translation model
80-
(i.e. translate one sequence into another sequence), the GPT is a Language Model (LM). LMs ask the natural question:
80+
(i.e. translate one sequence into another sequence), the GPT is a **Language Model**. LMs ask the natural question:
8181
given a sequence of words, what is most likely to follow? Put mathematically, LMs are concerned with predicting
82-
the next term(s) in a sequence conditional on all the previous points in the sequence:
82+
the next term(s) in a sequence conditional on a neighboring window of words. In the GPT model, this context is
83+
all the previous points in the sequence:
8384
```python
84-
p(u_i|u_i-1,u_i-2,...,u_i-block_size)
85+
p(u_i| context) := p(u_i|u_i-1,u_i-2,...,u_i-block_size)
8586
```
86-
In this sense, LMs are -auto-regressive- models.
87+
In this sense, we see that GPT is an **auto-regressive** model.
8788

8889
To fit the Transformer architecture to an LM problem, we can take the encoder-decoder architecture of the OG Transformer
8990
and discard the encoder. The decoder, you will notice, inherently is concerned with predicting the next item in a sequence
@@ -102,6 +103,9 @@ ability to help learn useful things (see notes on word2vec embeddings [here](../
102103
Motivated by this, they hypothesized LMs as the ultimate self-supervised models that could ultimately be applied as
103104
transfer learners.
104105

106+
Unlike embeddings, language models have the ability to capture the *contextual* meaning of a word. For instance,
107+
a language model can differentiate between the meaning of bear in "the right to bear arms" and "I saw a black bear".
108+
105109
### The GPT Approach
106110

107111
Using their own decoder Transformer architecture, they establish a "semi-supervised" approach using a combination
@@ -142,36 +146,39 @@ Image source: Radford et al. (2018)
142146

143147
## GPT-Model-Details
144148

145-
Here are some small notes from the each paper. I highly recommend you take a crack at reading them further better
146-
information.
149+
Here are some small notes from the each paper. Note that my observations are not comprehensive and am still wrapping my head
150+
around GPT-3.
151+
152+
I highly recommend reading the original papers.
147153

148154
### GPT-1
149155

150156
Most of the above was written based on GPT-1 observations.
151157
Here are some other notes:
152158

153159
* pre-training:
154-
- on BooksCorpus dataset. Didn't like Word Benchmark because it
160+
>- on BooksCorpus dataset. Didn't like Word Benchmark because it
155161
shuffled sentences, breaking up dependencies.
156162
* 12-layer decoder-only transformer.
157-
- dim_model = 768
158-
- num_heads = 12 (of self-attention model)
163+
>- dim_model = 768
164+
>- num_heads = 12 (of self-attention model)
159165
- Optimization:
160-
- Max learning rate of 2.5e-4
161-
- 2000 updates linear increase, annealed to 0 using Cosine Scheduler
162-
(Note: I stuck with NoamOptimizer to make life simple)
163-
- Weight initialization of N(0,0.02)
164-
- BPE with 40k merges (Note: stuck with usual word-encoding)
165-
- Activation functions used were GELU instead of RELU
166-
- Learned positions instead of original fixed Sinusoidal
167-
- Spacy tokenizer (Note: I used pre-built torchtext)
166+
>- Max learning rate of 2.5e-4
167+
>- 2000 updates linear increase, annealed to 0 using Cosine Scheduler
168+
>(Note: I stuck with NoamOptimizer to make life simple)
169+
>- Weight initialization of N(0,0.02)
170+
>- BPE with 40k merges (Note: I stuck with usual word-encoding, not BPE)
171+
>- Activation functions used were GELU instead of RELU
172+
>- Learned positions instead of original fixed Sinusoidal
173+
>- Spacy tokenizer (Note: I used my own tokenizer)
168174
* fine-tuning:
169-
- Same hyper-parameters as pre-training + drop-out.
170-
- Found 3 epochs of training was sufficient.
175+
>- Same hyper-parameters as pre-training + drop-out.
176+
>- Found 3 epochs of training was sufficient.
171177
172178
### GPT-2
173179

174-
tl;dr GPT-1 with larger pre-training and **multi-task learning**.
180+
*tl;dr* GPT-1 with larger pre-training and **multi-task learning**.
181+
175182

176183
The largest model achieves SOTA on 7/8 benchmarks in zero-shot (i.e, no fine-tuning) setting.
177184

@@ -203,7 +210,7 @@ Here is a list of some model specifics that changed since GPT-1:
203210

204211
This paper I'm still working through. The model is now at 175B parameters with
205212
96 layers, 96 heads, and dim_model = 12,288. I gather than the
206-
attention mechanism is different.
213+
attention mechanism may also be different from the canonical self-attention.
207214

208215
## Features
209216

notebooks/gpt/gpt.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@
122122
"id": "vH8GgVrAKHfW"
123123
},
124124
"source": [
125-
"Ahould be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
125+
"Should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
126126
]
127127
},
128128
{
@@ -253,7 +253,7 @@
253253
"source": [
254254
"## Language Model: WikiText2\n",
255255
"\n",
256-
"We will try to train our transformer model to learn how to predict the next word in torchtext WikiText2 database."
256+
"We will try to train our GPT model to learn how to predict the next word in WikiText2 data."
257257
]
258258
},
259259
{
@@ -328,7 +328,7 @@
328328
"# Self-supervised training\n",
329329
"\n",
330330
"\n",
331-
"This is an unsupervised (more aptly described as \"self-supervised\") loss. After this model is trained,\n",
331+
"This is the self-supervised pre-training. After this model is trained,\n",
332332
"we can run then continue it onto another problem (can freeze layers to only continue training the top layers)."
333333
]
334334
},
@@ -591,7 +591,7 @@
591591
"Well, as expected... this doesn't make any sense really. Pockets of words make sense, but overall it does not.\n",
592592
"\n",
593593
"A couple of considerations for further work: (1) training for longer and (2) larger models/ different hyperparameters. \n",
594-
"There is a third option, (3), which is to attempt a character-level language model.\n"
594+
"There is a third option, (3), which is to attempt a character-level language model, as well.\n"
595595
]
596596
}
597597
]

notebooks/transformer/README.md

Lines changed: 15 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -65,12 +65,12 @@ trainer.run()
6565

6666
## Background
6767

68-
### Sequence models
68+
### Sequence Models
6969

7070
The Transformer is a part of the realm of "sequence models",
7171
models that attempt to map an input (source) sequence to an output (target) sequence.
7272
Sequence models encompass a wide range of representations, from long-standing, classical probabilistic approaches such as
73-
Hidden Markov Models (HMMs), Bayesian networks, et.c to more recent "deep learning" models.
73+
Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), etc. to more recent "deep learning" models.
7474

7575
Sequence models come in many different varieties of problems as seen below:
7676

@@ -79,26 +79,23 @@ Sequence models come in many different varieties of problems as seen below:
7979

8080
Image source: Kaparthy's article on RNNs [here](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
8181

82-
### The OG Transformer is posed as a Machine Translation model
82+
### The OG Transformer is Posed as a Machine Translation Model
8383

8484
In the original Transformer, the task being trained is that of machine translation, i.e., taking one sentence in one
8585
language and learning to translate it into another language. This is a "many-to-many" sequence problem.
8686

87-
![Machine translation](/media/machine_translation.png)
88-
89-
90-
Image source: DeepAi.org [here](https://deepai.org/machine-learning-glossary-and-terms/neural-machine-translation)
91-
92-
### The predecessor to Transformer: RNN
87+
### The Predecessor to Transformer: RNN
9388

9489

9590
Prior to the Transformer, the dominant architecture found in "deep" sequence models was the
9691
recurrent network (i.e. RNN). While the convolutional network shares parameters across space,
97-
the recurrent model shares parameters across the time dimension (left to right in a sequence). At each time step,
92+
the recurrent model shares parameters across the time dimension (for instance, left to right in a sequence. At each time step,
9893
a new hidden state is computed using the previous hidden state and the current sequence value. These hidden states
9994
serve the function of "memory" within the model. The model hopes to encode useful enough information into these
10095
states such that it can derive contextual relationships between a given word and any previous words in a sequence.
10196

97+
### Encoder-Decoder Architectures
98+
10299
These RNN cells form the basis of an "encoder-decoder" architecture.
103100

104101
![Encoder decoder](/media/encoder-decoder.png)
@@ -107,13 +104,13 @@ These RNN cells form the basis of an "encoder-decoder" architecture.
107104
Image source: Figure 9.7.1. in this illustrated guide [here](https://d2l.ai/chapter_recurrent-modern/seq2seq.html).
108105

109106
The goal of the encoder-decoder is to take a source sequence
110-
and predict a target sequence (sequence-to-sequence or seq2seq). A common example of a seq2seq task is machine translation of one language to another.
107+
and predict a target sequence (sequence-to-sequence or "seq2seq"). A common example of a seq2seq task is machine translation of one language to another.
111108
An encoder maps the source sequence into a hidden state that is then passed to a decoder. The decoder then attempts to predict the next word in a target sequence using the encoder's hidden state(s) and
112109
the prior decoder hidden state(s).
113110

114111
2 different challenges confront the RNN class of models.
115112

116-
### RNN and learning complex context
113+
### RNN and Learning Complex Context
117114

118115
First, there is a challenge of specifying an RNN architecture capable of learning enough context to aid in
119116
predicting longer and more complex sequences. This has been an area of continual innovation. The first breakthrough was to
@@ -131,16 +128,12 @@ Then for each time step in the decoder block, a "context" state would be derived
131128
decoder could determine which words (via their hidden state) to "pay attention to" in the source sequence in order to predict
132129
the next word. This breakthrough was shown to extend the prediction power of RNNs in longer sequences.
133130

134-
### Sequential computation difficult to parallelize
131+
### Sequential Computation Difficult to Parallelize
135132

136133
Second, due to the sequential nature of how RNNs are computed, RNNs can be slow to train at scale.
137134

138-
### RNNs aren't dominantly worse btw
139-
140135
For a positive perspective on RNNs,
141136
see Andrej Kaparthy's blog post on RNNs [here](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
142-
One thing, for example, to realize about the transformer models (in particular, the language model versions such as GPT)
143-
is that they have a finite context window while RNNs have theoretically an infinite context window.
144137

145138
## Transformer
146139

@@ -165,16 +158,17 @@ encoding via a series of `sin(pos,wave_number)` and `cos(pos,wave_number)` funct
165158
In this plot you can see the different wave functions along the sequence length.
166159

167160
These fixed positional encodings are added to word embeddings of the same dimension such that these tensors capture both
168-
relative _semantic_ (note: this is open to interpretation. neural nets do better with dense representations)
161+
relative _semantic_ (note: this is open to interpretation)
169162
and _positional_ relationships between words. These representations are then passed down stream into the
170163
encoder and decoder stacks.
171164

165+
Note that the parameters in this layer are fixed.
166+
172167
### Attention (Self and Encoder-Decoder Attention)
173168

174169
The positional embeddings described above are then passed to the encoder-decoder stacks where the attention mechanism
175-
is used to identify the contextual relationship between words. Attention can be thought of a mechanism that scales
176-
values along the input sequence by values computing using "query" and "key" pairs. While attention mechanisms are an active
177-
area of research, the authors used a scaled dot-product attention calculation.
170+
is used to identify the inter-relationship between words in the translation task. Note that attention mechanisms are an active
171+
area of research and that the authors used a scaled dot-product attention calculation.
178172

179173
As the model trains these parameters, this mechanism
180174
emphasizes the importance of different terms in learning the context within a sequence as well as across the source and target

notebooks/transformer/transformer.ipynb

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@
122122
"id": "hLYfAP4LiFca"
123123
},
124124
"source": [
125-
"Ahould be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
125+
"Should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
126126
]
127127
},
128128
{
@@ -251,7 +251,7 @@
251251
"source": [
252252
"## Language Translation: German to English\n",
253253
"\n",
254-
"We will try to train our transformer model to learn how to translate German -> English using the torchtext::Multi30k data."
254+
"We will try to train our transformer model to learn how to translate German -> English using the Multi30k data."
255255
]
256256
},
257257
{
@@ -264,7 +264,7 @@
264264
"### Hyper-parameters\n",
265265
"\n",
266266
"These are the data processing and model training hyper-parameters for this run. Note that we are running a smaller model\n",
267-
"than cited in the paper for fewer iterations...on a CPU. This is meant merely to demonstrate it works."
267+
"than cited in the paper for fewer iterations."
268268
]
269269
},
270270
{
@@ -761,16 +761,9 @@
761761
},
762762
"source": [
763763
"As the model picks up on more signal, we would expect more distinct patterns as the attention layers learn the relationship\n",
764-
"between different tokens both within a sequence and across encoder-decoder. While the pattern in the encoder self-attention\n",
765-
"layers appear noisy, the decoder self-attention and encoder-decoder attention layers appear to have a more discernible pattern.\n",
764+
"between different tokens both within a sequence and across encoder-decoder.\n",
766765
"\n",
767-
"(Note that I didn't run the code long enough to get a good enough model. I also set num_layers = 2 instead of 6 and ran for 15\n",
768-
"epochs on a CPU.)\n",
769-
"\n",
770-
"Attention-based architectures are a very active area of research. Note that one of the drawbacks of this attention mechanism\n",
771-
"is that it scales quadratic in time and memory with the size of sequences. As a result, there is research into sparsity-based\n",
772-
"attention mechanisms as one potential solution (i.e., Google's Performer model). Refer to the README for a more in-depth\n",
773-
"overview of the Transformer models."
766+
"Refer to the README for a more in-depth overview of the attention patterns."
774767
]
775768
}
776769
]

0 commit comments

Comments
 (0)