-
Notifications
You must be signed in to change notification settings - Fork 0
/
opt_report.tex
392 lines (355 loc) · 18.5 KB
/
opt_report.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
\documentclass[12pt]{amsart}
\usepackage{geometry} % see geometry.pdf on how to lay out the page. There's lots.
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{subfig}
\geometry{a4paper} % or letter or a5paper or ... etc
% \geometry{landscape} % rotated page geometry
% See the ``Article customise'' template for come common customisations
\title{ISYE 6663 Project Report}
\author{Brant Callaway, John Rogers}
\date{} % delete this line to display the current date
%%% BEGIN DOCUMENT
\begin{document}
\maketitle
%\tableofcontents
\section{Project Implementation}
For our project, we implemented the Gradient Descent method, two versions of Conjugate Gradient method (Fletcher-Reeves and Polak-Ribiere), the Quasi-Newton BFGS method, and the limited memory BFGS method. In order to test our algorithms, we implemented four functions with known properties. These were the Rosenbrock function, the Trigonometric function, Wood function, and a large quadratic function. Exact line search was used with the quadratic problem, and a strong Wolfe-Powell inexact line search was used on all of the problems.
\subsection{Optimization Routine Implementation}
Both of the conjugate gradient methods (CGFR and CGPR) often did not converge to sufficient precision on some of the harder nonlinear problems. This is due to round off errors from finite precision computer arithmetic as well as local quadratic approximation of the function changes for nonlinear problems as we move from one step to the next. Each of the search directions explored by CG is Q-Conjugate to the prior search directions. Since we are using an inexact line search procedure, one search along this direction may not be sufficient to reach the global minimum. To combat this shortcoming, we reset to a gradient descent step whenever two consecutive gradients are not sufficiently orthogonal, as was suggested in the textbook. This effectively resets the conjugate-gradient method to allow it to minimize along all search directions again.
The L-BFGS quasi newton optimization algorithm was implemented with a variable amount of memory storage. Our original implementation had a fixed history of 5, but we chose to implement variable history to perform an analysis of how the history size affects the algorithm performance. Our expectation is that the optimal history size is problem dependent. That is, generally, the number of iterations required to find a solution decreases with the an increasing history size; however, the amount of storage increases (and hence the cost of each iteration) with increasing history size. Thus, whichever of these effects is stronger will determine what the optimal history size to choose will be, and this depends on the function being minimized.
\subsection{Test Function Implementation}
Another issue that we had was in testing the Trigonometric function. We had major concerns that we had implemented the function wrong or not gotten the gradient correct. Finally, we found that from the starting point given($xo = (1/n, 1/n,\hdots,1/n)$) in the notes, there is a local minimum that occurs between that starting point and what the notes give as the correct solution $x* = (0,0,\hdots,0)$. To further illustrate the point, here are the graphs for the trigonometric function with n=1 in figure~\ref{fig:2d-trig} and n=2 in figure~\ref{fig:3d-trig}. We can see that for $n=1$, there is a local minimum close to 0.927 which is where our each of our algorithms terminate (and further notice that this occurs between $x^*=0$ and $x_o = 1/1 = 1$. Similarly, for $n=2$, although slightly more difficult to see in the graph, we find a local minimum at $x = (0.243, 0.613)$ rather than at $x^* = (0,0)$. In our results section, we will present the convergence to this local minimum, plus the convergence to the true global minimum from a different starting point.
We also implemented the Wood function from the test problems sheet and we run on the Rosenbrock function which was provided. In addition, we prepared a large quadratic function primarily to test the exact line search performance compared to inexact line search. The Rosenbrock function was tried with multiple sizes up to 20 dimensions, as was the Trigonometric function.
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.60]{images/2dtrig.jpg}
\caption{The trigonometric function, shown in 1 dimension. The optimization routine gets stuck in a local minimum.}
\label{fig:2d-trig}
\end{figure}
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.60]{images/3dtrig.jpg}
\caption{The trigonometric function, shown with 2 dimensions. The optimization routine gets stuck in a local minimum.}
\label{fig:3d-trig}
\end{figure}
\section{Results}
The number of iterations and CPU time are compared between each of the tested algorithms on each of the test problems. In addition, the number of restarts are presented for the CG methods. The gradient descent algorithm is usually only able to reach a lower precision than the other routines due to its inefficiency with poorly conditioned problems. All graphs are presented in a logarithmic scale to better illustrate the progress of the algorithms.
\subsection{Quadratic Results}
The optimization routines were first tried on a quadratic problem of the form $min x^T A x - b^T x $. The matrix $A$ was generated by the rand function in Matlab and made to be positive definite by computing $A^T A$. The vector $b$ was also computed using the rand function. The results for exact line search can be seen in table \ref{table_exact_quad}. Comparing these results to table \ref{table_inexact_quad}, we see that the gradient descent routine achieves the desired precision in exactly the same number of iterations, though exact line search is much quicker than inexact line search for CPU time. Further inspection led us to realize that the inexact line search routine was generating the exact same $\alpha_k$ as the exact line search routine. The difference in CPU time is due to the simplicity in the exact line search -- gradient descent spends a lot of CPU cycles on inexact line search. A few other notable differences are that the conjugate gradient methods terminate in about 20 iterations and don't require restarts when using exact line search. When they use inexact line search, more iterations are necessary and restarts are also used multiple times. Our implementation of strong Wolfe-Powell appears to generate exact line search results when the gradients are parallel to the search directions, so since the CG methods do not use gradient directions the two line search routines generate different $\alpha_k$. The BFGS routine in figure~\ref{fig:BFGS-exact} terminates in 21 iterations, which makes sense because after the first 20 iterations the Hessian approximation is exact and the final step is a Newton step. On a quadratic function, this results in the solution being reached in only one more iteration. This is slightly faster than the result with inexact line search in figure~\ref{fig:BFGS-inexact}.
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.80,clip=true,viewport=1in 3in 8in 8in]{images/quad-exact/BFGS.pdf}
\caption{BFGS with exact line search, completes in 21 iterations.}
\label{fig:BFGS-exact}
\end{figure}
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.80,clip=true,viewport=1in 3in 8in 8in]{images/quad-inexact/BFGS.pdf}
\caption{BFGS with inexact line search, completes in 24 iterations.}
\label{fig:BFGS-inexact}
\end{figure}
\begin{table}
\caption{Experimental Results - Quadratic Function, exact line search, 20 dimensional, $\eta=1e-4$}
\label{table_exact_quad}
\begin{center}
\begin{tabular}{|c||c||c||c||c|}
\hline
Algorithm & Iterations & CPU-Time & Restarts & Notes\\
\hline
GD & 61368 & 15.776 & 0 & \\
\hline
CG-FR & 26 & 0.110 & 0 & \\
\hline
CG-PR & 26 & 0.078 & 0 & \\
\hline
BFGS & 21 & 0.081& 0 & \\
\hline
L-BFGS-5 & 496 & 0.427 & 0 &\\
\hline
L-BFGS-7 & 392 & 0.537 & 0 &\\
\hline
L-BFGS-9 & 284 & 0.482 & 0 &\\
\hline
L-BFGS-11 & 321 & 0.804 & 0 &\\
\hline
\end{tabular}
\end{center}
\end{table}
\begin{table}
\caption{Experimental Results - Quadratic Function, inexact line search, 20 dimensional, $\eta=1e-4$}
\label{table_inexact_quad}
\begin{center}
\begin{tabular}{|c||c||c||c||c|}
\hline
Algorithm & Iterations & CPU-Time & Restarts & Notes\\
\hline
GD & 61368 & 41.357 & 0 & \\
\hline
CG-FR & 77 & 0.157 & 5 & \\
\hline
CG-PR & 202 & 0.238 & 14 & \\
\hline
BFGS & 24 & 0.140& 0 & \\
\hline
L-BFGS-5 & 310 & 0.490 & 0 &\\
\hline
L-BFGS-7 & 330 & 0.648 & 0 &\\
\hline
L-BFGS-9 & 413 & 0.899 & 0 &\\
\hline
L-BFGS-11 & 268 & 0.771& 0 &\\
\hline
\end{tabular}
\end{center}
\end{table}
\subsection{Trigonometric Results}
The Trigonometric function from the test problems sheet presented us with more difficulty than the other functions. For the start point specified in the problem sheet, all of the optimization routines become stuck on a suboptimal local minimum, as was shown in figures \ref{fig:2d-trig} and \ref{fig:3d-trig}. The algorithms get stuck here with $\eta=1e-8$ pretty quickly as can be seen in Table \ref{results_trig_stuck}. We used a different starting point which is closer to the global optimum point and achieved rapid convergence in Table \ref{results_trig_done}.
\begin{table}
\caption{Experimental Results - Trigonometric Function 10 dimensional, 1/n starting point, $\eta=1e-8$}
\label{results_trig_stuck}
\begin{center}
\begin{tabular}{|c||c||c||c||c|}
\hline
Algorithm & Iterations & CPU-Time & Restarts & Notes\\
\hline
GD & 167 & 0.087 & 0 & Stuck\\
\hline
CG-FR & 54 & .150 & 15 & Stuck\\
\hline
CG-PR & 49 & .152 & 17 & Stuck\\
\hline
BFGS & 24 & .139 & 0 & Stuck\\
\hline
L-BFGS-5 & 59 & 0.181 & 0 & Stuck\\
\hline
L-BFGS-7 & 66 & 0.196 & 0 & Stuck\\
\hline
L-BFGS-9 & 77 & 0.24 & 0 & Stuck \\
\hline
L-BFGS-11 & 80 & 0.274 & 0 & Stuck\\
\hline
\end{tabular}
\end{center}
\end{table}
\begin{table}
\caption{Experimental Results - Trigonometric Function, 40 dimensional, starting point 1/(10n), $\eta=1e-12$}
\label{results_trig_done}
\begin{center}
\begin{tabular}{|c||c||c||c||c|}
\hline
Algorithm & Iterations & CPU-Time & Restarts & Notes\\
\hline
GD & 6 & 0.0144 & 0 & \\
\hline
CG-FR & 5 & 0.0492 & 3 & \\
\hline
CG-PR & 5 & 0.048 & 3 & \\
\hline
BFGS & 11 & 0.682 & 0 & \\
\hline
L-BFGS-5 & 14 & 0.0896 & 0 &\\
\hline
L-BFGS-7 & 12 & 0.089 & 0 &\\
\hline
L-BFGS-9 & 13 & 0.117 & 0 &\\
\hline
L-BFGS-11 & 13 & 0.129 & 0 &\\
\hline
\end{tabular}
\end{center}
\end{table}
\subsection{Wood Results}
The Wood function was a good test for the Gradient Descent algorithm because it was able to converge to the same $\eta$ value in a reasonable amount of time as the more advanced optimization routines. Table \ref{results_table_wood} shows the results which show BFGS as clearly superior with LBFGS with limited memory working well also. Both of the CG routines must restart very often due to the limited dimensionality of this problem.
\begin{table}
\caption{Experimental Results - Wood Function, $\eta=1e-10$}
\label{results_table_wood}
\begin{center}
\begin{tabular}{|c||c||c||c||c|}
\hline
Algorithm & Iterations & CPU-Time & Restarts & Notes\\
\hline
GD & 6707 & 1.64 & 0 & \\
\hline
CG-FR & 115 & 0.163 & 43 & \\
\hline
CG-PR & 139 & 0.172 & 55 & \\
\hline
BFGS & 62 & 0.165 & 0 & \\
\hline
L-BFGS-5 & 111 & 0.202 & 0 &\\
\hline
L-BFGS-7 & 141 & 0.232 & 0 &\\
\hline
L-BFGS-9 & 152 & 0.271 & 0 &\\
\hline
L-BFGS-11 & 170 & 0.312 & 0 &\\
\hline
\end{tabular}
\end{center}
\end{table}
\subsection{Rosenbrock Results}
The Rosenbrock function was tested with multiple dimensions up to 20 to test the performance of the optimization routines when the problem gets very large. The gradient descent algorithm had a lot of trouble with this problem and could only achieve $1e-2$ gradient norm in a reasonable amount of time. The BFGS algorithm is seen to always be superior to other routines and L-BFGS with small memory is also very good. The CG routines restart approximately every n iterations where n is the dimensionality of the problem. Table \ref{results_rosenbrock_6} shows the results for the 6 dimensional problem, Table \ref{results_rosenbrock_8} for 8 dimensional, Table \ref{results_rosenbrock_10} for 10 dimensional and Table \ref{results_rosenbrock_20} for 20 dimensional.
Graphical results showing the logarithm of the gradient norm per iteration step for the Rosenbrock 20 function can be seen at Figures~\ref{fig:GD-rosen},\ref{fig:CGFR-rosen},~\ref{fig:CGPR-rosen},~\ref{fig:BGFS-rosen},~\ref{fig:LBFGS5-rosen},~\ref{fig:LBFGS7-rosen},~\ref{fig:LBFGS9-rosen}, and~\ref{fig:LBFGS11-rosen}.
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.60,clip=true,viewport=1in 3in 8in 8in]{images/rosenbrock-20/GD.pdf}
\caption{Gradient Descent}
\label{fig:GD-rosen}
\end{figure}
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.60,clip=true,viewport=1in 3in 8in 8in]{images/rosenbrock-20/CGFR.pdf}
\caption{Conjugate Gradients- Fletcher Reeves}
\label{fig:CGFR-rosen}
\end{figure}
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.60,clip=true,viewport=1in 3in 8in 8in]{images/rosenbrock-20/CGPR.pdf}
\caption{Conjugate Gradients- Polack Ribiere}
\label{fig:CGPR-rosen}
\end{figure}
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.80,clip=true,viewport=1in 3in 8in 8in]{images/rosenbrock-20/BFGS.pdf}
\caption{BGFS}
\label{fig:BGFS-rosen}
\end{figure}
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.60]{images/rosenbrock-20/LBFGS5.pdf}
\caption{L-BFGS-5}
\label{fig:LBFGS5-rosen}
\end{figure}
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.60]{images/rosenbrock-20/LBFGS7.pdf}
\caption{L-BFGS-7}
\label{fig:LBFGS7-rosen}
\end{figure}
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.60,clip=true,viewport=1in 3in 8in 8in]{images/rosenbrock-20/LBFGS9.pdf}
\caption{L-BFGS-9}
\label{fig:LBFGS9-rosen}
\end{figure}
\begin{figure}[thpb]
\centering
\includegraphics[scale=0.60,clip=true,viewport=1in 3in 8in 8in]{images/rosenbrock-20/LBFGS11.pdf}
\caption{L-BFGS-11}
\label{fig:LBFGS11-rosen}
\end{figure}
\begin{table}
\caption{Experimental Results - Rosenbrock Function 6 dimensional, $\eta=1e-10$}
\label{results_rosenbrock_6}
\begin{center}
\begin{tabular}{|c||c||c||c||c|}
\hline
Algorithm & Iterations & CPU-Time & Restarts & Notes\\
\hline
GD & $>14718$ & $>4sec$ & 0 & $\eta=1e-2$\\
\hline
CG-FR & 371 & 0.157 & 65 & \\
\hline
CG-PR & 387 & 0.231 & 60 & \\
\hline
BFGS & 116 & 0.172 & 0 & \\
\hline
L-BFGS-5 & 266 & 0.287 & 0 &\\
\hline
L-BFGS-7 & 327 & 0.343 & 0 &\\
\hline
L-BFGS-9 & 372 & 0.413 & 0 &\\
\hline
L-BFGS-11 & 449 & 0.561 & 0 &\\
\hline
\end{tabular}
\end{center}
\end{table}
\begin{table}
\caption{Experimental Results - Rosenbrock Function 8 dimensional, $\eta=1e-10$}
\label{results_rosenbrock_8}
\begin{center}
\begin{tabular}{|c||c||c||c||c|}
\hline
Algorithm & Iterations & CPU-Time & Restarts & Notes\\
\hline
GD & $>14508$ & $>3.86$& 0 & $\eta=1e-2$ \\
\hline
CG-FR & 1120 & 0.353 & 146 & \\
\hline
CG-PR & 1186 & 0.465 & 150 & \\
\hline
BFGS & 258 & 0.245 & 0 & \\
\hline
L-BFGS-5 & 572 & 0.444 & 0 &\\
\hline
L-BFGS-7 & 762 & 0.675 & 0 &\\
\hline
L-BFGS-9 & 992 & 0.987 & 0 &\\
\hline
L-BFGS-11 & 1062 & 1.24 & 0 &\\
\hline
\end{tabular}
\end{center}
\end{table}
\begin{table}
\caption{Experimental Results - Rosenbrock Function 10 dimensional, $\eta=1e-10$}
\label{results_rosenbrock_10}
\begin{center}
\begin{tabular}{|c||c||c||c||c|}
\hline
Algorithm & Iterations & CPU-Time & Restarts & Notes\\
\hline
GD & $>14508$ & $>3.875$ & 0 & $\eta=1e-2$ \\
\hline
CG-FR & 3543 & 1.031 & 358 & \\
\hline
CG-PR & 3465 & 1.105 & 384 & \\
\hline
BFGS & 589 & 0.373 & 0 & \\
\hline
L-BFGS-5 & 1286 & 0.922 & 0 &\\
\hline
L-BFGS-7 & 1888 & 1.65 & 0 &\\
\hline
L-BFGS-9 & 2250 & 2.458 & 0 &\\
\hline
L-BFGS-11 & 3383 & 4.59 & 0 &\\
\hline
\end{tabular}
\end{center}
\end{table}
\begin{table}
\caption{Experimental Results - Rosenbrock Function 20 dimensional, $\eta=1e-5$}
\label{results_rosenbrock_20}
\begin{center}
\begin{tabular}{|c||c||c||c||c|}
\hline
Algorithm & Iterations & CPU-Time & Restarts & Notes\\
\hline
GD & $>14508$ & $>3.9$ & 0 & $\eta=1e-2$\\
\hline
CG-FR & 2522 & 0.818 & 243 & \\
\hline
CG-PR & 2713 & 0.887 & 262 & \\
\hline
BFGS & 2184 & 1.18 & 0 & \\
\hline
L-BFGS-5 & 10124 & 8.42 & 0 &\\
\hline
L-BFGS-7 & 10625 & 12.8 & 0 &\\
\hline
L-BFGS-9 & 14011 & 23.4 & 0 &\\
\hline
L-BFGS-11 & 14030 & 32.6 & 0 &\\
\hline
\end{tabular}
\end{center}
\end{table}
\section{Conclusion}
The main thing that we learned in working on this project was that between almost any of the methods, there are tradeoffs. Here are several examples.
The Gradient Descent method was the easiest of the methods to implement and is intuitively easy to understand. However, as discussed frequently in class, the method suffers from severe performance issues when given an ill-conditioned problem. In fact, the Gradient Descent algorithm was usually too slow to terminate at all for the same convergence threshold as the more sophisticated optimization routines that we implemented. By comparison, on the 20 dimensional Rosenbrock function, the BFGS routine was about to achieve 8 more decimal points of accuracy than the GD with far less computation.
Similarly, before we implemented restarts in the CG methods, the Polak-Ribiere Conjugate Gradient method performed much better than the Fletcher-Reeves method which we expected. When we implemented restarts, both methods improved; however, the CGFR improved so much more that it seems to now be more effective than the CGPR on our test problems.
Finally, we found that the BFGS method took the fewest number of iterations to converge. However, it was not always the fastest - the limited memory BFGS method took more iterations to complete, but was sometimes faster than the BFGS method. Further, the performance of the limited BFGS method depended strongly upon the number of vectors to remember. Although we expected the number of iterations to decrease with increases to the history size, this was not the case for most of the functions we tested. The BFGS routine typically performs best with around 5 vectors except in the quadratic function case where up to 20 vectors would probably be beneficial. In fact, for the Wood and Rosenbrock functions the smallest history (m=5)took the least number of iterations for all memory sizes to complete. And for all of the functions, the smallest history (m=5) had the best performance in terms of CPU time. Thus, it seems like, at least at the scale that we were able to achieve, the cost of additional memory size had a stronger effect on the performance than did the benefit of having more vector history. This is due to the changing landscape of the optimization problem and the loss of utility from information of the past in hard nonlinear optimization problems.
\end{document}