You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 12, 2021. It is now read-only.
Copy file name to clipboardExpand all lines: notebooks/cnn/README.md
+18-13Lines changed: 18 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -81,11 +81,12 @@ of those features, say either making a feature possess a bell-curve distribution
81
81
In terms of feature engineering/ representations, a break through in this regard was employing unsupervised training to derive semantic [embeddings](../word2vec/README.md) as features
82
82
(link is to implementation of Skipgram).
83
83
Many paper results have shown a significant improvement in model performance when using these pre-trained representations. It's
84
-
surprising to see that one could input these into a linear model and outperform TF-IDF in many situations. One downside of using embeddings
85
-
in a linear model is that you need to pool across the embeddings in order to collapse the dimensionality, so a max/average pooling technique
84
+
surprising to see that one could input these into a linear regression and outperform TF-IDF in many situations. One downside of using a linear model
85
+
is that you need to pool across the embeddings in order to collapse the dimensionality, so a max/average pooling technique
86
86
is often required.
87
87
88
-
Following this logical thread, is there a way to take these embeddings and extract further information from them?
88
+
Following this logical thread, how can we further capture the inter-dependency between words when trying to predict
89
+
a target? Is there a way to take these embeddings and extract further information from them?
89
90
*How can we get at automated feature extraction?*
90
91
91
92
This is where deep learning, and in particular, convolutional neural networks (CNNs) come into play.
@@ -110,26 +111,30 @@ convolution filters applied to the pixels of a picture of the Taj Mahal. One is
110
111
the colors.
111
112
112
113
Convolution filters and pooling layers form the bedrock of the CNN architecture.
113
-
These are neural network architectures that *derive* the values of a series of convolution filters
114
-
in order to extract useful features from a series of inputs.
114
+
These are neural network architectures that **automatically derive convolution filters**
115
+
in order to boost the model's ability to learn a target.
115
116
116
117
### Advantages of CNNs
117
118
118
-
Convolution layers have many advantages in that one can vary
119
-
the number of filters simultaneously running over a set of inputs, as well as the properties such as:
120
-
(1) *size of the filters* (i.e. the window size of the filter as it moves over the set of inputs); (2) *the stride or pace
121
-
of the filters* (i.e. if it skips over the volume of inputs ); etc. One benefit of using convolution layers is that they
122
-
may be stacked on top of each other in a series of layers. *Each layer of convolution filters is thought to derive a different
123
-
level of feature extraction*, from the most rudimentary at the deepest levels to the finer details at the shallowest levels.
119
+
CNNs are highly flexible. One has several knobs available when selecting these layers:
120
+
1.*the number of simultaneous filters* (i.e., how many different simultaneous feature derivations to make from an input)
121
+
2.*size of the filters* (i.e. the window size of the filter as it moves over the set of inputs)
122
+
3.*the stride or pace of the filters* (i.e. if it skips over the volume of inputs ); etc.
123
+
124
+
Another benefit of using convolution layers is that they
125
+
may be stacked on top of each other in a series of layers. Each layer of convolution filters is thought to derive a different
126
+
level of feature extraction, from the most rudimentary at the deepest levels to the finer details at the shallowest levels.
127
+
124
128
Pooling layers are interspersed between convolution layers in order to summarize (i.e. reduce the dimensionality of)
125
129
the information from a set of feature maps via sub-sampling.
126
130
127
131
A final note is that CNNs are typically considered very fast to train compared to other typical deep
128
-
architectures (like say the RNN) as they process things in a simultaneous manner.
132
+
architectures (like say the RNN) as they process a batch of data simultaneously.
129
133
130
134
### CNNs Work Well for Classification/Identification/Detection
131
135
132
-
Both pooling and convolution operations are **locally invariant**, which means that their ability to detect a feature
136
+
Both pooling and convolution operations have the highly useful property that they are **locally invariant**,
137
+
which means that their ability to detect a feature
133
138
is independent of the location in the set of inputs. This lends itself very well to classification tasks.
Copy file name to clipboardExpand all lines: notebooks/gpt/gpt.ipynb
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -122,7 +122,7 @@
122
122
"id": "vH8GgVrAKHfW"
123
123
},
124
124
"source": [
125
-
"Ahould be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
125
+
"Should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
126
126
]
127
127
},
128
128
{
@@ -253,7 +253,7 @@
253
253
"source": [
254
254
"## Language Model: WikiText2\n",
255
255
"\n",
256
-
"We will try to train our transformer model to learn how to predict the next word in torchtext WikiText2 database."
256
+
"We will try to train our GPT model to learn how to predict the next word in WikiText2 data."
257
257
]
258
258
},
259
259
{
@@ -328,7 +328,7 @@
328
328
"# Self-supervised training\n",
329
329
"\n",
330
330
"\n",
331
-
"This is an unsupervised (more aptly described as \"self-supervised\") loss. After this model is trained,\n",
331
+
"This is the self-supervised pre-training. After this model is trained,\n",
332
332
"we can run then continue it onto another problem (can freeze layers to only continue training the top layers)."
333
333
]
334
334
},
@@ -591,7 +591,7 @@
591
591
"Well, as expected... this doesn't make any sense really. Pockets of words make sense, but overall it does not.\n",
592
592
"\n",
593
593
"A couple of considerations for further work: (1) training for longer and (2) larger models/ different hyperparameters. \n",
594
-
"There is a third option, (3), which is to attempt a character-level language model.\n"
594
+
"There is a third option, (3), which is to attempt a character-level language model, as well.\n"
Prior to the Transformer, the dominant architecture found in "deep" sequence models was the
96
91
recurrent network (i.e. RNN). While the convolutional network shares parameters across space,
97
-
the recurrent model shares parameters across the time dimension (left to right in a sequence). At each time step,
92
+
the recurrent model shares parameters across the time dimension (for instance, left to right in a sequence. At each time step,
98
93
a new hidden state is computed using the previous hidden state and the current sequence value. These hidden states
99
94
serve the function of "memory" within the model. The model hopes to encode useful enough information into these
100
95
states such that it can derive contextual relationships between a given word and any previous words in a sequence.
101
96
97
+
### Encoder-Decoder Architectures
98
+
102
99
These RNN cells form the basis of an "encoder-decoder" architecture.
103
100
104
101

@@ -107,13 +104,13 @@ These RNN cells form the basis of an "encoder-decoder" architecture.
107
104
Image source: Figure 9.7.1. in this illustrated guide [here](https://d2l.ai/chapter_recurrent-modern/seq2seq.html).
108
105
109
106
The goal of the encoder-decoder is to take a source sequence
110
-
and predict a target sequence (sequence-to-sequence or seq2seq). A common example of a seq2seq task is machine translation of one language to another.
107
+
and predict a target sequence (sequence-to-sequence or "seq2seq"). A common example of a seq2seq task is machine translation of one language to another.
111
108
An encoder maps the source sequence into a hidden state that is then passed to a decoder. The decoder then attempts to predict the next word in a target sequence using the encoder's hidden state(s) and
112
109
the prior decoder hidden state(s).
113
110
114
111
2 different challenges confront the RNN class of models.
115
112
116
-
### RNN and learning complex context
113
+
### RNN and Learning Complex Context
117
114
118
115
First, there is a challenge of specifying an RNN architecture capable of learning enough context to aid in
119
116
predicting longer and more complex sequences. This has been an area of continual innovation. The first breakthrough was to
@@ -131,16 +128,12 @@ Then for each time step in the decoder block, a "context" state would be derived
131
128
decoder could determine which words (via their hidden state) to "pay attention to" in the source sequence in order to predict
132
129
the next word. This breakthrough was shown to extend the prediction power of RNNs in longer sequences.
133
130
134
-
### Sequential computation difficult to parallelize
131
+
### Sequential Computation Difficult to Parallelize
135
132
136
133
Second, due to the sequential nature of how RNNs are computed, RNNs can be slow to train at scale.
137
134
138
-
### RNNs aren't dominantly worse btw
139
-
140
135
For a positive perspective on RNNs,
141
136
see Andrej Kaparthy's blog post on RNNs [here](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
142
-
One thing, for example, to realize about the transformer models (in particular, the language model versions such as GPT)
143
-
is that they have a finite context window while RNNs have theoretically an infinite context window.
144
137
145
138
## Transformer
146
139
@@ -165,16 +158,17 @@ encoding via a series of `sin(pos,wave_number)` and `cos(pos,wave_number)` funct
165
158
In this plot you can see the different wave functions along the sequence length.
166
159
167
160
These fixed positional encodings are added to word embeddings of the same dimension such that these tensors capture both
168
-
relative _semantic_ (note: this is open to interpretation. neural nets do better with dense representations)
161
+
relative _semantic_ (note: this is open to interpretation)
169
162
and _positional_ relationships between words. These representations are then passed down stream into the
170
163
encoder and decoder stacks.
171
164
165
+
Note that the parameters in this layer are fixed.
166
+
172
167
### Attention (Self and Encoder-Decoder Attention)
173
168
174
169
The positional embeddings described above are then passed to the encoder-decoder stacks where the attention mechanism
175
-
is used to identify the contextual relationship between words. Attention can be thought of a mechanism that scales
176
-
values along the input sequence by values computing using "query" and "key" pairs. While attention mechanisms are an active
177
-
area of research, the authors used a scaled dot-product attention calculation.
170
+
is used to identify the inter-relationship between words in the translation task. Note that attention mechanisms are an active
171
+
area of research and that the authors used a scaled dot-product attention calculation.
178
172
179
173
As the model trains these parameters, this mechanism
180
174
emphasizes the importance of different terms in learning the context within a sequence as well as across the source and target
Copy file name to clipboardExpand all lines: notebooks/transformer/transformer.ipynb
+5-12Lines changed: 5 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -122,7 +122,7 @@
122
122
"id": "hLYfAP4LiFca"
123
123
},
124
124
"source": [
125
-
"Ahould be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
125
+
"Should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
126
126
]
127
127
},
128
128
{
@@ -251,7 +251,7 @@
251
251
"source": [
252
252
"## Language Translation: German to English\n",
253
253
"\n",
254
-
"We will try to train our transformer model to learn how to translate German -> English using the torchtext::Multi30k data."
254
+
"We will try to train our transformer model to learn how to translate German -> English using the Multi30k data."
255
255
]
256
256
},
257
257
{
@@ -264,7 +264,7 @@
264
264
"### Hyper-parameters\n",
265
265
"\n",
266
266
"These are the data processing and model training hyper-parameters for this run. Note that we are running a smaller model\n",
267
-
"than cited in the paper for fewer iterations...on a CPU. This is meant merely to demonstrate it works."
267
+
"than cited in the paper for fewer iterations."
268
268
]
269
269
},
270
270
{
@@ -761,16 +761,9 @@
761
761
},
762
762
"source": [
763
763
"As the model picks up on more signal, we would expect more distinct patterns as the attention layers learn the relationship\n",
764
-
"between different tokens both within a sequence and across encoder-decoder. While the pattern in the encoder self-attention\n",
765
-
"layers appear noisy, the decoder self-attention and encoder-decoder attention layers appear to have a more discernible pattern.\n",
764
+
"between different tokens both within a sequence and across encoder-decoder.\n",
766
765
"\n",
767
-
"(Note that I didn't run the code long enough to get a good enough model. I also set num_layers = 2 instead of 6 and ran for 15\n",
768
-
"epochs on a CPU.)\n",
769
-
"\n",
770
-
"Attention-based architectures are a very active area of research. Note that one of the drawbacks of this attention mechanism\n",
771
-
"is that it scales quadratic in time and memory with the size of sequences. As a result, there is research into sparsity-based\n",
772
-
"attention mechanisms as one potential solution (i.e., Google's Performer model). Refer to the README for a more in-depth\n",
773
-
"overview of the Transformer models."
766
+
"Refer to the README for a more in-depth overview of the attention patterns."
0 commit comments