-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlearning-ml.txt
298 lines (164 loc) · 11.9 KB
/
learning-ml.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
https://www.youtube.com/watch?v=aircAruvnKk&t=107s
by 3Blue1Brown
'Hello world' of neural networks: neural network that recognizes hand written digits between 0..9
first layer of neurons: each neuron is a number between 0..1 - it lights up when the pixel gets some light (pixels of the input image)
last layer of neurons: ten of them, neuron #n has number value that reflects the certainty that digit #n has been recognized ,
if the third neuron is the brightest then the number two is recognized.
inbetween there are the 'hidden layers'.
Each layer m gets input from previous layer m-1 and produces output for nex layer m+1
They hope that the neurons in intermediate layers are representing some of the features of the numbers that the model wants to recognize
example: with two intermediate layers: maybe second layer recognizes edges, the third layer whole visual subcomponents - these combine to recognize he whole digit in the third layer
a=(a1....an) - input layer vector
b=(b1....bm) - output layer vector
So there is a matrix of weights A
A = (w[0,0] .... w[n,0]
w[0,m] .... w[n,m]
where matrix multiplication:
b = sigmoid ( A * a ) - bias_vector
Now sigmoid(x) = 1 / (1 + e ^ x) is a function with output range from 0...1 - it is applied on each value of the result vector to normalize the values.
bias_vector = (b[0]......b[m]) - constant; some threshold number to set at which value of the output the next layer neuron will count as 'activated'
https://www.youtube.com/watch?v=IHZwWFHWa-w
Learning:
find the values of the weight matrix and bias vector for each of the layers
first step: init all the weights and biases for each of the matrixes to random values.
cost-for-each-training-example(result-vector) = [ sqrt( (result_vector[i] - desired_result[i] ) ^ 2 ) ] - vector with difference between actual an desired result
average cost for each training example - measure of how good the network is.
target of supervised machine learning - minimize this cost function - by changing the weights and bias vectors of the model !!!
- when looking at the gradient vector: the absolute value of a gradient[i] tells us how import weight (or bias) i is
Learning process does gradient descent to find a minimum (you don't know if that's the global or a local minimum)
- the vectorspace being all of the weight values and biases for all of the layers
- algorithm for computing the gradient is called backpropagation
Problems:
- feeding in random pixels - gives confident answer that this is a number!
- can you visualize what is going on in the model? Does it really identify edges in the first layer and components in the second layer? No, not really.
Recommends this book on machine learning:
http://neuralnetworksanddeeplearning.com/
Michael Nielsen: Neural Networks and Deep Learning
(has lots of example and code, free book)
https://www.youtube.com/watch?v=Ilg3gGewQ5U
Backpropagation
- For a given if the error is large then they will more likely change the weights of the current layer (for a small error it will adjust the previous layer?)
- no you can't make a step that considers all training examples, instead on each iteration they take subsets with training examples chosen at random
https://www.youtube.com/watch?v=tIeHLnjs5U8
========
Word embeddings - how words are represented in ML models
https://www.youtube.com/watch?v=gQddtTdmG_8
- you want some vector representing the word, you want similar words to have similar (close) vector values.
- word2vec - reporesent each word by a n-dimensional vector (n-big), now words that appear often in the same context get a vector with a small distance
https://turbomaze.github.io/word2vecjson/ - word2vec demo
- now you can do vector arithmetics, sometimes with surprising results.
woman + (king - man) = queen
germany + (london - england) = berlin
france + (london - england) = paris
australia + (london - england) = sydney
japan + (london - england) = tokyo
(doesn't work with usa)
https://www.youtube.com/watch?v=ERibwqs9p38
Traditionally you would deal with word meaning in nlp by assigning categories.
like wordnet - get the synonyms, antonoyms of a word. Now you lease nuance (synonyms don't have the same meaning, no means to measure how close word_a is to word_b)
Vector representations:
One-hot representation: words numbered 0....n
each word gets a vector where index n is one - all other indexes are zero.
problem: no notion of similarity between words.
Distributional similarity :=
"you shall know a word by the company it keeps" -- JF Firth ;
Wittgenstein: use theory of meaning; meaning of word defined by predicting which context a word appears in.
so the goal is to have a big vector, where dot product (closeness) can define word similarity. / that's a distributed model of representation /
For given word W[t] they want a model that predicts the probability of a word in the given context (given the surrounding words in a sliding window of fixed size)
A model that computes:
P(context|w[t])
So they need to minimize the following:
1 - P(W[-t] | W[t]) where W[-t] is 'all the other words in the context, except word W[t]'
Word2Vec
two ways, giving a sliding context of N words,
Skip-Gram = given a center word, predict all the other words occuring in it's context (around it in the sliding window)
Continuous bag of words = given all the other context window, predict the center word in the context
---
Softmax
- given set of values:
- take any value, raise it to the exponent and divide it by the sum of the exponennts of all values.
def softmax(vec, idx):
return exp(vec[idx]) / sum( map( exp, vec ) )
- the sum of the results is one - so that softmax is turning all of the vector values into probabilities.
- what is good about it?
- exponent makes the value big, quickly - so that maximum value element is getting a very high probability.
- also you can derive the exponent function, so that they can do gradient descent on it.
Trying out softmax:
from math import exp
def notsoftmax(vec, i):
return vec[i] / sum( vec )
def softmax(vec, idx):
return exp(vec[idx]) / sum( map( exp, vec ) )
def softmaxall(vec):
all_val = sum( map( exp, vec ) )
return [ exp(elm) / all_val for elm in vec ]
def notsoftmaxall(vec):
all_val = sum( vec )
return [ elm / all_val for elm in vec ]
a=[80,50,2,7,9,1]
res = softmaxall(a)
nres = notsoftmaxall(a)
print(f"input: {a}")
print(f"softmax: {list(res)}")
print(f"sumsoftmax: {sum(res)}")
print(f"notsoftmax: {list(nres)}")
print(f"sumnotsoftmax: {sum(nres)}")
Got:
input: [80, 50, 2, 7, 9, 1]
softmax: [0.9999999999999064, 9.357622968839299e-14, 1.3336148155021367e-34, 1.9792598779467192e-32, 1.4624862272510943e-31, 4.906094730648821e-35]
sumsoftmax: 1.0
notsoftmax: [0.5369127516778524, 0.33557046979865773, 0.013422818791946308, 0.04697986577181208, 0.06040268456375839, 0.006711409395973154]
sumnotsoftmax: 1.0000000000000002
--
Open LLM leaderboard https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
benchmark for evaluating LLM models (the openly available ones)
Here they have some benchmarks for testing LLM's
HellaSwag "HellaSwag is a challenge dataset for evaluating commonsense NLI (natural language inference ) that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy). " https://paperswithcode.com/dataset/hellaswag
TruthfulQA - question file with expected answers "We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts"
https://arxiv.org/abs/2109.07958
https://github.com/sylinrl/TruthfulQA
https://github.com/sylinrl/TruthfulQA/blob/main/TruthfulQA.csv - the questions
ARC (The Abstraction and Reasoning Corpus)
https://arxiv.org/pdf/1911.01547.pdf
visual puzzles "Test-takers are given a set of demonstration grid pairs. The task involves determining the size of the output grid for the test and correctly filling each cell of the grid with the appropriate color or number."
MMLU - Multi-task Language Understanding https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
https://paperswithcode.com/dataset/mmlu
https://huggingface.co/datasets/Stevross/mmlu
----
DLVVU
https://www.youtube.com/watch?v=KmAISyVvE1Y
Series of videos on transformers - Lennart Svennson
https://www.youtube.com/watch?v=0SmNEp4zTpc&list=PLDw5cZwIToCvXLVY2bSqt7F2gu8y-Rqje
Transformer = input sentence -> encoder -> decoder -> output sentence
encoder (translates words to feature vectors/embeddings (a vector like what you get from word2vec)
decoder lots of hidden layers to predict the next word (and then the next word - until end of sequence token) + translation back to output text
encoder and decoder use self attention :-
The 'Attention' in 'attention is all you need': https://www.youtube.com/watch?v=KmAISyVvE1Y
Simple self attention
take the feature vectors v[1] ... v[n] that stand in for all words of the input window.
t
Compute matrix W , where W[i,j] = v[i] * v[j] # cell W[i,j] is the dot matrix product of vector v[i] v[j]
Then for each row of W - transform them as softmax (so that the sum o the row values is equal to one - and that they are a bit like probabilities)
Now transform each vector v[i] : out[o] = W * v
- each of the neighboring vectors had some inluence on the input vector.
(v*v is high value - it's the square of each of he dimension, also v * w if w is close to v (in terms of word2vec), if the value is high, then the
t
scaled self attention: W[i,j] = (v[i] * v[j] ) / sqrt(k) ; k -input dimension. Purpose: keep weights within lower range, so as not to suffer from 'vanishing gradient' upon SOFTMAX
--
Problem with this: doesnt take care of the sentence structure: senteces have head words and dependents - these re not linear structures.
'This restaurant is not too terrible'
not -> too -> terrible
restaurant -------------->
'Emma hates games but she is a great friend'
she -------> friend would like the transformed vector for 'friend' be influenced by 'she' (and 'Emma')
The fix: more complicated computation!
compute query vectors for each word in the sentence: (so each sentence gets it's own W-query and W-keys matrixes?)
q[i] = W-query v[i]
compute key vectors
k[i] = W-keys v[i]
Multiply resulting query vectors by key vectors to get the weights
t
w[i] = q[i] k[i]
This gets us the weight matrix
W[i]=[w[0]....w[n]
Then apply SOFTMAX ove W[i] | NOTE: each word in the sentence now gets its own weight matrix W[i] !!!