** corrections to language.

will-thompson-k · will-thompson-k · commit ae341825167a · 2020-12-09T17:29:17.000-06:00
diff --git a/README.md b/README.md
@@ -58,11 +58,11 @@ This repository has the following features:
 
 After reviewing these models, the world's your oyster in terms of other models to explore:
 
-ELMO, XLNET, all the other BERTs, BART, Performer, T5, etc....
+Char-RNN, BERT, ELMO, XLNET, all the other BERTs, BART, Performer, T5, etc....
 
 ## Roadmap
 
-Future models:
+Future models to implement:
 
 - [ ] Char-RNN (Kaparthy)
 - [ ] BERT
diff --git a/notebooks/cnn/README.md b/notebooks/cnn/README.md
@@ -81,11 +81,12 @@ of those features, say either making a feature possess a bell-curve distribution
 In terms of feature engineering/ representations, a break through in this regard was employing unsupervised training to derive semantic [embeddings](../word2vec/README.md) as features 
 (link is to implementation of Skipgram).
 Many paper results have shown a significant improvement in model performance when using these pre-trained representations. It's
-surprising to see that one could input these into a linear model and outperform TF-IDF in many situations. One downside of using embeddings
-in a linear model is that you need to pool across the embeddings in order to collapse the dimensionality, so a max/average pooling technique
+surprising to see that one could input these into a linear regression and outperform TF-IDF in many situations. One downside of using a linear model
+ is that you need to pool across the embeddings in order to collapse the dimensionality, so a max/average pooling technique
 is often required.
 
-Following this logical thread, is there a way to take these embeddings and extract further information from them?
+Following this logical thread, how can we further capture the inter-dependency between words when trying to predict
+a target? Is there a way to take these embeddings and extract further information from them?
 *How can we get at automated feature extraction?*
 
 This is where deep learning, and in particular, convolutional neural networks (CNNs) come into play.
@@ -110,26 +111,30 @@ convolution filters applied to the pixels of a picture of the Taj Mahal. One is
 the colors. 
 
 Convolution filters and pooling layers form the bedrock of the CNN architecture. 
-These are neural network architectures that *derive* the values of a series of convolution filters 
-in order to extract useful features from a series of inputs. 
+These are neural network architectures that **automatically derive convolution filters**
+in order to boost the model's ability to learn a target. 
 
 ### Advantages of CNNs
 
-Convolution layers have many advantages in that one can vary 
-the number of filters simultaneously running over a set of inputs, as well as the properties such as:
-(1) *size of the filters* (i.e. the window size of the filter as it moves over the set of inputs); (2) *the stride or pace
-of the filters* (i.e. if it skips over the volume of inputs ); etc. One benefit of using convolution layers is that they
-may be stacked on top of each other in a series of layers. *Each layer of convolution filters is thought to derive a different
-level of feature extraction*, from the most rudimentary at the deepest levels to the finer details at the shallowest levels.
+CNNs are highly flexible. One has several knobs available when selecting these layers:
+1. *the number of simultaneous filters* (i.e., how many different simultaneous feature derivations to make from an input)
+2. *size of the filters* (i.e. the window size of the filter as it moves over the set of inputs)
+3. *the stride or pace of the filters* (i.e. if it skips over the volume of inputs ); etc. 
+
+Another benefit of using convolution layers is that they
+may be stacked on top of each other in a series of layers. Each layer of convolution filters is thought to derive a different
+level of feature extraction, from the most rudimentary at the deepest levels to the finer details at the shallowest levels.
+
 Pooling layers are interspersed between convolution layers in order to summarize (i.e. reduce the dimensionality of) 
 the information from a set of feature maps via sub-sampling.
 
 A final note is that CNNs are typically considered very fast to train compared to other typical deep 
-architectures (like say the RNN) as they process things in a simultaneous manner.
+architectures (like say the RNN) as they process a batch of data simultaneously.
 
 ### CNNs Work Well for Classification/Identification/Detection
 
-Both pooling and convolution operations are **locally invariant**, which means that their ability to detect a feature
+Both pooling and convolution operations have the highly useful property that they are **locally invariant**, 
+which means that their ability to detect a feature
 is independent of the location in the set of inputs. This lends itself very well to classification tasks. 
 
 ## Model-Specifics
diff --git a/notebooks/gpt/README.md b/notebooks/gpt/README.md
@@ -70,20 +70,21 @@ trainer.run()
 GPT, which stands for "Generative Pre-trained Transformer", is a part of the realm of "sequence models" (sequence-to-sequence or "seq2seq"),
 models that attempt to map an input (source) sequence and to an output (target) sequence. 
 Sequence models encompass a wide range of representations, from long-standing, classical probabilistic approaches such as
-Hidden Markov Models (HMMs), Bayesian networks, et.c to more recent "deep learning" models such as recurrent neural networks (RNNs).
+Hidden Markov Models (HMMs), Conditional Random Fields (CRFs) etc. to more recent "deep learning" models such as recurrent neural networks (RNNs).
 
 GPT belongs to a newer the class of models known as Transformers, which we touch upon [here](../transformer/README.md).
 
 ### GPT is a Language Model Transformer
 
 Unlike the original Transformer which is originally posed as a Machine Translation model
-(i.e. translate one sequence into another sequence), the GPT is a Language Model (LM). LMs ask the natural question:
+(i.e. translate one sequence into another sequence), the GPT is a **Language Model**. LMs ask the natural question:
 given a sequence of words, what is most likely to follow? Put mathematically, LMs are concerned with predicting
-the next term(s) in a sequence conditional on all the previous points in the sequence:
+the next term(s) in a sequence conditional on a neighboring window of words. In the GPT model, this context is
+all the previous points in the sequence:
 ```python
-p(u_i|u_i-1,u_i-2,...,u_i-block_size)
+p(u_i| context) := p(u_i|u_i-1,u_i-2,...,u_i-block_size)
 ```
-In this sense, LMs are -auto-regressive- models.
+In this sense, we see that GPT is an **auto-regressive** model.
  
 To fit the Transformer architecture to an LM problem, we can take the encoder-decoder architecture of the OG Transformer
 and discard the encoder. The decoder, you will notice, inherently is concerned with predicting the next item in a sequence
@@ -102,6 +103,9 @@ ability to help learn useful things (see notes on word2vec embeddings [here](../
 Motivated by this, they hypothesized LMs as the ultimate self-supervised models that could ultimately be applied as 
 transfer learners. 
 
+Unlike embeddings, language models have the ability to capture the *contextual* meaning of a word. For instance,
+a language model can differentiate between the meaning of bear in "the right to bear arms" and "I saw a black bear".
+
 ### The GPT Approach
 
 Using their own decoder Transformer architecture, they establish a "semi-supervised" approach using a combination 
@@ -142,36 +146,39 @@ Image source: Radford et al. (2018)
 
 ## GPT-Model-Details
 
-Here are some small notes from the each paper. I highly recommend you take a crack at reading them further better
-information.
+Here are some small notes from the each paper. Note that my observations are not comprehensive and am still wrapping my head
+around GPT-3.
+
+I highly recommend reading the original papers.
 
 ### GPT-1
 
 Most of the above was written based on GPT-1 observations.
 Here are some other notes:
 
 * pre-training: 
-- on BooksCorpus dataset. Didn't like Word Benchmark because it
+>- on BooksCorpus dataset. Didn't like Word Benchmark because it
 shuffled sentences, breaking up dependencies.
 * 12-layer decoder-only transformer.
-- dim_model = 768
-- num_heads = 12 (of self-attention model)
+>- dim_model = 768
+>- num_heads = 12 (of self-attention model)
 - Optimization:
-- Max learning rate of 2.5e-4
-- 2000 updates linear increase, annealed to 0 using Cosine Scheduler
-(Note: I stuck with NoamOptimizer to make life simple)
-- Weight initialization of N(0,0.02)
-- BPE with 40k merges (Note: stuck with usual word-encoding)
-- Activation functions used were GELU instead of RELU
-- Learned positions instead of original fixed Sinusoidal
-- Spacy tokenizer (Note: I used pre-built torchtext)
+>- Max learning rate of 2.5e-4
+>- 2000 updates linear increase, annealed to 0 using Cosine Scheduler
+>(Note: I stuck with NoamOptimizer to make life simple)
+>- Weight initialization of N(0,0.02)
+>- BPE with 40k merges (Note: I stuck with usual word-encoding, not BPE)
+>- Activation functions used were GELU instead of RELU
+>- Learned positions instead of original fixed Sinusoidal
+>- Spacy tokenizer (Note: I used my own tokenizer)
 * fine-tuning:
-- Same hyper-parameters as pre-training + drop-out.
-- Found 3 epochs of training was sufficient.
+>- Same hyper-parameters as pre-training + drop-out.
+>- Found 3 epochs of training was sufficient.
 
 ### GPT-2
 
-tl;dr GPT-1 with larger pre-training and **multi-task learning**.
+*tl;dr* GPT-1 with larger pre-training and **multi-task learning**.
+
 
 The largest model achieves SOTA on 7/8 benchmarks in zero-shot (i.e, no fine-tuning) setting.
 
@@ -203,7 +210,7 @@ Here is a list of some model specifics that changed since GPT-1:
 
 This paper I'm still working through. The model is now at 175B parameters with
 96 layers, 96 heads, and dim_model = 12,288. I gather than the
-attention mechanism is different. 
+attention mechanism may also be different from the canonical self-attention. 
 
 ## Features
 
diff --git a/notebooks/gpt/gpt.ipynb b/notebooks/gpt/gpt.ipynb
@@ -122,7 +122,7 @@
     "id": "vH8GgVrAKHfW"
    },
    "source": [
-    "Ahould be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
+    "Should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
    ]
   },
   {
@@ -253,7 +253,7 @@
    "source": [
     "## Language Model: WikiText2\n",
     "\n",
-    "We will try to train our transformer model to learn how to predict the next word in torchtext WikiText2 database."
+    "We will try to train our GPT model to learn how to predict the next word in WikiText2 data."
    ]
   },
   {
@@ -328,7 +328,7 @@
     "# Self-supervised training\n",
     "\n",
     "\n",
-    "This is an unsupervised (more aptly described as \"self-supervised\") loss. After this model is trained,\n",
+    "This is the self-supervised pre-training. After this model is trained,\n",
     "we can run then continue it onto another problem (can freeze layers to only continue training the top layers)."
    ]
   },
@@ -591,7 +591,7 @@
     "Well, as expected... this doesn't make any sense really. Pockets of words make sense, but overall it does not.\n",
     "\n",
     "A couple of considerations for further work: (1) training for longer and (2) larger models/ different hyperparameters. \n",
-    "There is a third option, (3),  which is to attempt a character-level language model.\n"
+    "There is a third option, (3),  which is to attempt a character-level language model, as well.\n"
    ]
   }
  ]
diff --git a/notebooks/transformer/README.md b/notebooks/transformer/README.md
@@ -65,12 +65,12 @@ trainer.run()
 
 ## Background
 
-### Sequence models
+### Sequence Models
 
 The Transformer is a part of the realm of "sequence models",
 models that attempt to map an input (source) sequence to an output (target) sequence. 
 Sequence models encompass a wide range of representations, from long-standing, classical probabilistic approaches such as
-Hidden Markov Models (HMMs), Bayesian networks, et.c to more recent "deep learning" models.
+Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), etc. to more recent "deep learning" models.
 
 Sequence models come in many different varieties of problems as seen below:
 
@@ -79,26 +79,23 @@ Sequence models come in many different varieties of problems as seen below:
 
 Image source: Kaparthy's article on RNNs [here](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
 
-### The OG Transformer is posed as a Machine Translation model
+### The OG Transformer is Posed as a Machine Translation Model
 
 In the original Transformer, the task being trained is that of machine translation, i.e., taking one sentence in one 
 language and learning to translate it into another language. This is a "many-to-many" sequence problem.
 
-![Machine translation](/media/machine_translation.png)
-
-
-Image source: DeepAi.org [here](https://deepai.org/machine-learning-glossary-and-terms/neural-machine-translation)
-
-### The predecessor to Transformer: RNN
+### The Predecessor to Transformer: RNN
 
 
 Prior to the Transformer, the dominant architecture found in "deep" sequence models was the
 recurrent network (i.e. RNN). While the convolutional network shares parameters across space,
-the recurrent model shares parameters across the time dimension (left to right in a sequence). At each time step,
+the recurrent model shares parameters across the time dimension (for instance, left to right in a sequence. At each time step,
 a new hidden state is computed using the previous hidden state and the current sequence value. These hidden states 
 serve the function of "memory" within the model. The model hopes to encode useful enough information into these
 states such that it can derive contextual relationships between a given word and any previous words in a sequence.
 
+### Encoder-Decoder Architectures
+
 These RNN cells form the basis of an "encoder-decoder" architecture.
 
 ![Encoder decoder](/media/encoder-decoder.png)
@@ -107,13 +104,13 @@ These RNN cells form the basis of an "encoder-decoder" architecture.
 Image source: Figure 9.7.1. in this illustrated guide [here](https://d2l.ai/chapter_recurrent-modern/seq2seq.html).
  
  The goal of the encoder-decoder is to take a source sequence
-and predict a target sequence (sequence-to-sequence or seq2seq). A common example of a seq2seq task is machine translation of one language to another. 
+and predict a target sequence (sequence-to-sequence or "seq2seq"). A common example of a seq2seq task is machine translation of one language to another. 
 An encoder maps the source sequence into a hidden state that is then passed to a decoder. The decoder then attempts to predict the next word in a target sequence using the encoder's hidden state(s) and
 the prior decoder hidden state(s).
 
 2 different challenges confront the RNN class of models. 
 
-### RNN and learning complex context
+### RNN and Learning Complex Context
 
 First, there is a challenge of specifying an RNN architecture capable of learning enough context to aid in 
 predicting longer and more complex sequences. This has been an area of continual innovation. The first breakthrough was to 
@@ -131,16 +128,12 @@ Then for each time step in the decoder block, a "context" state would be derived
 decoder could determine which words (via their hidden state) to "pay attention to" in the source sequence in order to predict 
 the next word. This breakthrough was shown to extend the prediction power of RNNs in longer sequences. 
 
-### Sequential computation difficult to parallelize
+### Sequential Computation Difficult to Parallelize
 
 Second, due to the sequential nature of how RNNs are computed, RNNs can be slow to train at scale.
 
-### RNNs aren't dominantly worse btw
-
 For a positive perspective on RNNs,
 see Andrej Kaparthy's blog post on RNNs [here](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).
-One thing, for example, to realize about the transformer models (in particular, the language model versions such as GPT) 
-is that they have a finite context window while RNNs have theoretically an infinite context window.
 
 ## Transformer
 
@@ -165,16 +158,17 @@ encoding via a series of `sin(pos,wave_number)` and `cos(pos,wave_number)` funct
 In this plot you can see the different wave functions along the sequence length.
 
 These fixed positional encodings are added to word embeddings of the same dimension such that these tensors capture both
-relative _semantic_ (note: this is open to interpretation. neural nets do better with dense representations) 
+relative _semantic_ (note: this is open to interpretation) 
 and _positional_ relationships between words. These representations are then passed down stream into the
 encoder and decoder stacks.
 
+Note that the parameters in this layer are fixed.
+
 ### Attention (Self and Encoder-Decoder Attention)
 
 The positional embeddings described above are then passed to the encoder-decoder stacks where the attention mechanism
-is used to identify the contextual relationship between words. Attention can be thought of a mechanism that scales
-values along the input sequence by values computing using "query" and "key" pairs. While attention mechanisms are an active
-area of research, the authors used a scaled dot-product attention calculation. 
+is used to identify the inter-relationship between words in the translation task. Note that attention mechanisms are an active
+area of research and that the authors used a scaled dot-product attention calculation. 
 
 As the model trains these parameters, this mechanism
 emphasizes the importance of different terms in learning the context within a sequence as well as across the source and target
diff --git a/notebooks/transformer/transformer.ipynb b/notebooks/transformer/transformer.ipynb
@@ -122,7 +122,7 @@
     "id": "hLYfAP4LiFca"
    },
    "source": [
-    "Ahould be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
+    "Should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."
    ]
   },
   {
@@ -251,7 +251,7 @@
    "source": [
     "## Language Translation: German to English\n",
     "\n",
-    "We will try to train our transformer model to learn how to translate German -> English using the torchtext::Multi30k data."
+    "We will try to train our transformer model to learn how to translate German -> English using the Multi30k data."
    ]
   },
   {
@@ -264,7 +264,7 @@
     "### Hyper-parameters\n",
     "\n",
     "These are the data processing and model training hyper-parameters for this run. Note that we are running a smaller model\n",
-    "than cited in the paper for fewer iterations...on a CPU. This is meant merely to demonstrate it works."
+    "than cited in the paper for fewer iterations."
    ]
   },
   {
@@ -761,16 +761,9 @@
    },
    "source": [
     "As the model picks up on more signal, we would expect more distinct patterns as the attention layers learn the relationship\n",
-    "between different tokens both within a sequence and across encoder-decoder. While the pattern in the encoder self-attention\n",
-    "layers appear noisy, the decoder self-attention and encoder-decoder attention layers appear to have a more discernible pattern.\n",
+    "between different tokens both within a sequence and across encoder-decoder.\n",
     "\n",
-    "(Note that I didn't run the code long enough to get a good enough model. I also set num_layers = 2 instead of 6 and ran for 15\n",
-    "epochs on a CPU.)\n",
-    "\n",
-    "Attention-based architectures are a very active area of research. Note that one of the drawbacks of this attention mechanism\n",
-    "is that it scales quadratic in time and memory with the size of sequences. As a result, there is research into sparsity-based\n",
-    "attention mechanisms as one potential solution (i.e., Google's Performer model). Refer to the README for a more in-depth\n",
-    "overview of the Transformer models."
+    "Refer to the README for a more in-depth overview of the attention patterns."
    ]
   }
  ]
diff --git a/notebooks/word2vec/README.md b/notebooks/word2vec/README.md
diff --git a/notebooks/word2vec/word2vec.ipynb b/notebooks/word2vec/word2vec.ipynb

Original file line number	Diff line number	Diff line change
`@@ -122,7 +122,7 @@`
`122`	`122`	`"id": "vH8GgVrAKHfW"`
`123`	`123`	`},`
`124`	`124`	`"source": [`
`125`		`- "Ahould be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."`
	`125`	`+ "Should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly)."`
`126`	`126`	`]`
`127`	`127`	`},`
`128`	`128`	`{`
`@@ -253,7 +253,7 @@`
`253`	`253`	`"source": [`
`254`	`254`	`"## Language Model: WikiText2\n",`
`255`	`255`	`"\n",`
`256`		`- "We will try to train our transformer model to learn how to predict the next word in torchtext WikiText2 database."`
	`256`	`+ "We will try to train our GPT model to learn how to predict the next word in WikiText2 data."`
`257`	`257`	`]`
`258`	`258`	`},`
`259`	`259`	`{`
`@@ -328,7 +328,7 @@`
`328`	`328`	`"# Self-supervised training\n",`
`329`	`329`	`"\n",`
`330`	`330`	`"\n",`
`331`		`- "This is an unsupervised (more aptly described as \"self-supervised\") loss. After this model is trained,\n",`
	`331`	`+ "This is the self-supervised pre-training. After this model is trained,\n",`
`332`	`332`	`"we can run then continue it onto another problem (can freeze layers to only continue training the top layers)."`
`333`	`333`	`]`
`334`	`334`	`},`
`@@ -591,7 +591,7 @@`
`591`	`591`	`"Well, as expected... this doesn't make any sense really. Pockets of words make sense, but overall it does not.\n",`
`592`	`592`	`"\n",`
`593`	`593`	`"A couple of considerations for further work: (1) training for longer and (2) larger models/ different hyperparameters. \n",`
`594`		`- "There is a third option, (3), which is to attempt a character-level language model.\n"`
	`594`	`+ "There is a third option, (3), which is to attempt a character-level language model, as well.\n"`
`595`	`595`	`]`
`596`	`596`	`}`
`597`	`597`	`]`