-
Notifications
You must be signed in to change notification settings - Fork 0
/
bayesopt_1_key_ideas_GPs.html
1372 lines (728 loc) · 89.8 KB
/
bayesopt_1_key_ideas_GPs.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<link href='https://fonts.googleapis.com/css?family=Raleway' rel='stylesheet'>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Bayesian Optimization, Part 1: Key Ideas, Gaussian Processes</title>
<meta name="description" content="">
<link href="https://fonts.googleapis.com/css?family=EB+Garamond" rel="stylesheet">
<link href="https://fonts.googleapis.com/css?family=Alegreya" rel="stylesheet">
<link href="https://fonts.googleapis.com/css?family=Libre+Baskerville" rel="stylesheet">
<link href="https://fonts.googleapis.com/css?family=Raleway:300" rel="stylesheet" type='text/css'>
<link href="https://fonts.googleapis.com/css?family=PT+Serif|Work+Sans:300" rel="stylesheet">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=EB+Garamond:ital,wght@0,400;0,500;0,800;1,800&family=Raleway:wght@100&display=swap" rel="stylesheet">
<link href="https://fonts.googleapis.com/css2?family=Libre+Baskerville&display=swap" rel="stylesheet">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Zilla+Slab:wght@400&display=swap" rel="stylesheet">
<link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">
<link rel="stylesheet" href="/assets/main.css">
<link rel="canonical" href="https://blog.quipu-strands.com/bayesopt_1_key_ideas_GPs">
<link rel="alternate" type="application/rss+xml" title="A Not-So Primordial Soup" href="/feed.xml">
<script data-goatcounter="https://abhgh.goatcounter.com/count"
async src="//gc.zgo.at/count.js"></script>
</head>
<body>
<header class="site-header" role="banner">
<div class="wrapper">
<a class="site-title" href="/">A Not-So Primordial Soup</a>
<nav class="site-nav">
<input type="checkbox" id="nav-trigger" class="nav-trigger" />
<label for="nav-trigger">
<span class="menu-icon">
<svg viewBox="0 0 18 15" width="18px" height="15px">
<path fill="#424242" d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.031C17.335,0,18,0.665,18,1.484L18,1.484z"/>
<path fill="#424242" d="M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0c0-0.82,0.665-1.484,1.484-1.484 h15.031C17.335,6.031,18,6.696,18,7.516L18,7.516z"/>
<path fill="#424242" d="M18,13.516C18,14.335,17.335,15,16.516,15H1.484C0.665,15,0,14.335,0,13.516l0,0 c0-0.82,0.665-1.484,1.484-1.484h15.031C17.335,12.031,18,12.696,18,13.516L18,13.516z"/>
</svg>
</span>
</label>
<div class="trigger">
<a class="page-link" href="/about/">About</a>
<a class="page-link" href="/Terms/">Terms</a>
</div>
</nav>
</div>
</header>
<main class="page-content" aria-label="Content">
<div class="wrapper">
<article class="post" itemscope itemtype="http://schema.org/BlogPosting">
<header class="post-header">
<h1 class="post-title" itemprop="name headline">Bayesian Optimization, Part 1: Key Ideas, Gaussian Processes</h1>
<p class="post-meta">
<time datetime="2023-11-18T11:00:00-08:00" itemprop="datePublished">
Created: Nov 18, 2023. Last major update:
Jul 1, 2024.
</time>
</p>
</header>
<div class="post-content" itemprop="articleBody">
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
"HTML-CSS": { scale: 100, linebreaks: { automatic: true } },
SVG: { linebreaks: { automatic:true } },
displayAlign: "center" });
</script>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/bayesopt_cover.png" alt="test" />
<p class="image-caption">The real reason I like Bayesian Optimization: lots of pretty pictures!</p>
</div>
<p>If I wanted to sell you on the idea of <em>Bayesian Optimization (BayesOpt)</em>, I’d just list some of its applications:</p>
<ul>
<li>Hyperparameter Optimization (HPO) <a class="citation" href="#bayesopt_is_superior">(Turner et al., 2021)</a>.</li>
<li>Neural Architecture Search (NAS) <a class="citation" href="#white2019bananas">(White et al., 2021)</a>.</li>
<li>Molecule discovery <a class="citation" href="#LSBO_paper">(Gómez-Bombarelli et al., 2018)</a>.</li>
<li>Liquid chromatography <a class="citation" href="#BOELRIJK2023340789">(Boelrijk et al., 2023)</a>.</li>
<li>Creating low-carbon concrete <a class="citation" href="#ament2023sustainable">(Ament et al., 2023)</a>.</li>
<li>Plasma control in nuclear fusion <a class="citation" href="#mehta2022an">(Mehta et al., 2022)</a>.</li>
<li>Parameter tuning for lasers <a class="citation" href="#20.500.11850/385955">(Kirschner et al., 2019-11)</a></li>
</ul>
<p>My own usage has been tamer, but also varied: I’ve used it to (a) learn distributions <a class="citation" href="#frontiers_density_tree_nopdf">(Ghose & Ravindran, 2020; Ghose & Ravindran, 2019)</a>, and (b) tune SHAP model explanations to identify informative training instances <a class="citation" href="#XAI_human_in_the_loop">(Nguyen & Ghose, 2023)</a>.</p>
<p>Why is BayesOpt everywhere? Because the problems it solves are everywhere. It provides tools to optimize an <em>noisy and expensive black-box</em> function: given an input, you can only query your function for an output, and you don’t have any other information such as its gradients. Sometimes you mightn’t even have a mathematical form for it. Since its expensive, you can’t query it a lot. Plus, function evaluations might be noisy, i.e., the same input might produce slightly different outputs. The lack of function information is in contrast to the plethora of Gradient Descent variants that seem to be fueling the Deep Learning revolution. But if you think about it, this is a very practical problem. For example, you want to optimize various small and large processes in a factory to increase throughput - what does a gradient even mean here?</p>
<p>I encountered the area in 2018, when, as part of my Ph.D., I’d run out of ways to maximize a certain function. And the area has only grown since then. If you’re interested, I’ve an introductory write-up of BayesOpt in my <a href="https://drive.google.com/file/d/1zf_MIWyLY7nxEr5UioUQ7KhOQ1_clYYl/view?usp=drive_link">dissertation</a> (Section 2.1) too, and Section 2.3 there talks about the general problem I wanted to solve. Much of the material here though comes from a talk I gave a while ago (<a href="https://drive.google.com/file/d/1AFs1IqJT6qUi50yZ5XO_uHfgUVEBn2pz/view?usp=drive_link">slides</a>).</p>
<p>OK, let’s get started. This is <strong>part-1 of a two-part series</strong>, where I intend to provide a detailed introduction to BayesOpt. I also introduce <em>Gaussian Processes</em>, and if you’re here just for that you can jump straight to <a href="#gaussian_processes">this section</a>. Here’s a layout for this part:</p>
<ul id="markdown-toc">
<li><a href="#the-problem" id="markdown-toc-the-problem">The Problem</a></li>
<li><a href="#key-ideas" id="markdown-toc-key-ideas">Key Ideas</a></li>
<li><a href="#comparison-to-gradient-descent-gd" id="markdown-toc-comparison-to-gradient-descent-gd">Comparison to Gradient Descent (GD)</a></li>
<li><a href="#gaussian_processes" id="markdown-toc-gaussian_processes">Gaussian Processes (GP)</a> <ul>
<li><a href="#intuition" id="markdown-toc-intuition">Intuition</a></li>
<li><a href="#deep-dive" id="markdown-toc-deep-dive">Deep Dive</a></li>
<li><a href="#connection-with-bayesian-linear-regression" id="markdown-toc-connection-with-bayesian-linear-regression">Connection with Bayesian Linear Regression</a></li>
<li><a href="#minimal-code-for-gp" id="markdown-toc-minimal-code-for-gp">Minimal Code for GP</a></li>
<li><a href="#summary-and-challenges" id="markdown-toc-summary-and-challenges">Summary and Challenges</a></li>
</ul>
</li>
<li><a href="#references" id="markdown-toc-references">References</a></li>
</ul>
<p>In part-2, I’ll talk about <em>acquisition functions</em>, learning resources, etc.
As you can probably tell, this is going to be a long-ish series; settle in with ample amounts of coffee!</p>
<h2 id="the-problem">The Problem</h2>
<p>At its heart the family of BayesOpt techniques still strive to solve an optimization problem:</p>
\[\require{color}
\begin{aligned}
\arg \max_{x \in \mathcal{X}} f(x)
\end{aligned}\]
<p>Here, we are maximizing (or minimizing) wrt \(x\) whose values range over a domain \(\mathcal{X}\). Reiterating what we said before, BayesOpt is useful when \(f(x)\) possesses one or more of these properties:</p>
<ol>
<li>Its mathematical expression is unavailable to us or it is not differentiable.</li>
<li>Evaluations are noisy, i.e., for the same input \(x\) you might obtain slightly different values of \(f(x)\).</li>
<li>It is expensive to evaluate.</li>
</ol>
<p>These are very practical challenges. To ground them, I’ll use the (personally appealing) example of baking a perfect chocolate chip cookie <a class="citation" href="#bayesopt_cookie">(Solnik et al., 2017)</a>.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/chocolate_chip_cookie.png" alt="test" />
<p class="image-caption">This is a good chocolate chip cookie! Barefoot, Campbell, CA if you're in the area!</p>
</div>
<p>The problem is: <em>find a recipe, i.e., amount of flour, sugar, baking temperature and time, etc.,</em> for baking cookies that is <em>optimal wrt taste</em>. Taste assessment is made by multiple human raters. The objective function \(f\) is the assessment (averaged, if there are multiple human raters) given the input vector \(x\) of various cookie ingredients, e.g., quantity of flour, amount of chocolate chips, and cooking conditions, e.g., oven temperature, time to bake. In other words, \(x\) encodes the recipe.</p>
<p>Why is this a good use case for BayesOpt?</p>
<ol>
<li>\(f(x)\) is not available for us to inspect (forget differentiability) - it exists in the heads of the human raters. But it can be invoked: ask people to rate cookies.</li>
<li>Does a person rate the same cookie in the same way in a blind test? Mostly yes, but sometimes, may be not. Also, external factors such as humidity might affect the taste for the same \(x\). Safe to assume \(f(x)\) is noisy.</li>
<li>To evaluate \(f(x)\), you need cookies baked according to the recipe \(x\). Baking a batch of cookies and distributing them for assessment consumes time (order of hours or days) and effort, making it expensive to evaluate \(f(x)\). The only upside is an unending supply of cookies would probably make your raters happy!</li>
</ol>
<p>One way to optimize \(f(x)\) is to create the most promising recipe based on a thorough examination of the history of recipes and assessments. Loosely speaking, you tease out parts of past recipes that worked or didn’t, and use that knowledge to assemble a new recipe. Bake a batch of cookies based on it and have it evaluated by the human raters. Add the current recipe and assessment to your history, and repeat. BayesOpt formalizes this idea.</p>
<p>Of course, as you might suspect, baking cookies is not all this is good for ;-).</p>
<p><strong>Hyperparameter Optimization (HPO)</strong> is probably the flagship use-case of BayesOpt <a class="citation" href="#bayesopt_is_superior">(Turner et al., 2021)</a>. We want to increase the predictive accuracy of a model \(f\), by tuning its hyperparams. <!-- *Cross-validation* is common but it sacrifices coverage of the hyperparam space by looking at a predetermined grid of settings. If we want to be relatively thorough, BayesOpt is a good fit. --> Note the apt prerequisites:</p>
<ol>
<li>\(f\) is effectively black-box if we want our solution to be generic, i.e., it should be usable as much for Decision Trees as for Neural Networks.</li>
<li>It is computationally expensive to evaluate (since evaluating \(f\) for a hyperparam setting would require us to retrain the model).</li>
<li>It can also be noisy, e.g., randomly initialized learning like with gradient descent.</li>
</ol>
<p><a href="http://hyperopt.github.io/hyperopt/">Hyperopt</a> is a popular library that implements this approach, using a specific form of BayesOpt known as the <em>Tree-structured Parzen Estimator (TPE)</em> <a class="citation" href="#Bergstra:2011:AHO:2986459.2986743">(Bergstra et al., 2011)</a>. There are others such as <a href="https://botorch.org/">BoTorch</a>, <a href="https://docs.ray.io/en/latest/tune/examples/bayesopt_example.html">Ray Tune</a> and <a href="https://github.com/bayesian-optimization/BayesianOptimization">BayesianOptimization</a>.</p>
<p>These kinds of problems are ubiquitous - and the preamble to this article listed diverse settings where BayesOpt crops up. Applications aside, BayesOpt itself is evolving, e.g., <em>InfoBAX</em> <a class="citation" href="#infobax">(Neiswanger et al., 2021)</a> uses BayesOpt-like ideas to compute arbitrary function properties (as opposed to just extrema) in a sample-efficient manner, <em>grey-box optimization</em> <a class="citation" href="#astudillo2022thinking">(Astudillo & Frazier, 2022)</a> relaxes the black-box setting to utilize additional information like gradients.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/section_start.png" alt="test" />
<p class="image-caption"></p>
</div>
<h2 id="key-ideas">Key Ideas</h2>
<p>So what makes BayesOpt tick? Recall that BayesOpt is an iterative process, i.e., at each iteration a specific \(x\) is proposed and \(f(x)\) is evaluated. Essentially, BayesOpt primarily performs these two steps:</p>
<ol>
<li>At a given iteration \(t\), you’ve the function evaluations \(f(x_1), f(x_2), ..., f(x_{t-1})\). You can fit a regression model to this <em>history</em> \(\mathcal{H}=\{(x_i, f(x_i))\}_{i=1}^{i=t-1}\) of \(t-1\) evaluated pairs \((x_i, f(x_i))\), to approximate the actual objective function over the entire input space \(x\in \mathcal{X}\). We will refer to this approximation as our <strong>surrogate model</strong> \(\hat{f}\).
<br />Would any regression model do, e.g., Decision Trees, Neural Networks? Almost - but there is an additional property we require of our regressor, as we’ll shortly see.</li>
<li>Use \(\hat{f}\) as the function to optimize (instead of \(f\)) and propose a minima \(x_t\). Evaluate \(f(x_t)\), and if it isn’t satisfactory or you haven’t exhausted your optimization budget, you can add \(f(x_t)\) to the history \(\mathcal{H}\), refit \(\hat{f}\) on these \(t\) evaluations, and repeat these steps.</li>
</ol>
<p>This is a simplified picture, and there are at least two <strong>important</strong> nuances:</p>
<ul>
<li>First, the surrogate model needs to have the property that for \(x\in \mathcal{X}\) where the exact value is not known, i.e., \(f(x)\) is not available in history \(\mathcal{H}\), we need to be aware of <em>how uncertain</em> is our guess \(\hat{f}(x)\).</li>
<li>And second, in Step 2 above, we don’t always guess the minima based on the current \(\hat{f}\). Sometimes, we might choose to propose an \(x_t\) in the interest of exploration, i.e., in a region of \(\mathcal{X}\) where we haven’t evaluated \(f\) enough. This is the famous <strong>explore-exploit</strong> tradeoff you might’ve heard in the context of <em>Reinforcement Learning</em>; it applies to BayesOpt too. Given the current surrogate model \(\hat{f}\), that was fit on history \(\mathcal{H}\), the next point \(x_t\) on which we wish to <em>acquire</em> \(f(x_t)\) is provided by an <strong>acquisition function</strong>. <br /> In fact, its the acquisition function that needs to know how uncertain \(\hat{f}\) is at various inputs, so it can judiciously make the explore-exploit trade-off.</li>
</ul>
<p>Let’s visualize these ideas for maximization of a univariate function, i.e., \(\mathcal{X}\) is a line.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/BO_iters_EI_1_iters.png" alt="test" />
<p class="image-caption">At an iteration of BayesOpt. The bottom plot shows the acquisition function values over the input space.</p>
</div>
<p>The image above show two plots stacked vertically. Start with the top plot, and note the legend there. This shows what things look like with BayesOpt at its \(4^{th}\) iteration. \(f\) is shown with a black solid line and red dots mark the \(3\) prior evaluations of \(f\). A surrogate model \(\hat{f}\) fitted to these dots is shown as a blue dashed line. Outside of the red dots, we’re uncertain of the precise values, and this is shown as an envelope of somewhat likely values, i.e., those that fall within <em>one standard deviation</em> (denoted as \(1\sigma\)) of our <em>mean</em> guess (the dashed blue line). We’ll discuss how to obtain these estimates in the next section, using <em>Gaussian Processes</em>. At the red dots, the envelopes disappear because there is no guesswork involved. The envelope balloons up as we move into regions away from the red dots. The current discovered max. is shown by the horizontal red line.</p>
<p>The bottom plot shows what the acquisition function looks like over \(\mathcal{X}\). We’ll see how its exactly calculated later (spoiler: there are many ways), but the red dot in this plot denotes the greatest acquisition value.</p>
<p>Lets move onto the \(5^{th}\) iteration, as shown below. Now we have evaluated \(f\) on \(x\) the acquisition function had suggested, and this recent evaluation is shown with a square box. This leads to a higher estimated maxima - horizontal red line. The other steps are repeated: \(\hat{f}\) is refit on this data with 4 points, which means its uncertainty envelopes are recomputed as well. Note how the envelope nearly disappears in the interval \(\sim [11.5, 12.25]\) because of two evaluations, i.e., red dots, in close proximity. We also re-calculate the acquisition values over \(\mathcal{X}\). Note that this time though, the acquisition function has a single hump - it seems reasonably sure about where the maxima lies.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/BO_iters_EI_2_iters.png" alt="test" />
<p class="image-caption">The next iteration. The square box shows the new evaluation based on the previous iteration.</p>
</div>
<p>OK, now lets look at iterations 4, 5, 6 and 7 together - shown below.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/BO_iters_EI_4_iters.jpg" alt="test" />
<p class="image-caption">Multiple iterations. Note the multiple high-density regions in the final acquisition value plot.</p>
</div>
<p>Notice something interesting? In the last iteration the acquisition function has grown <em>less</em> sure about where valuable points are located. This “jumping around” is characteristic of BayesOpt: as newer evaluations become available, the acquisition function revises its notion of valuable locations. In fact, the amount of “fat” envelopes might increase as well, when regions of surprising behavior are discovered. Uncertainty decreases asymptotically but occasional fluctuations are to be expected.</p>
<p>For this example, I’d argue this is a good thing because we still haven’t discovered the actual global maxima. More exploration is great if we have the budget for it.</p>
<p>Unfortunately, this also makes it challenging to have a simple <em>stopping criteria</em> for BayesOpt (there are some, e.g., <a class="citation" href="#makarova2022automatic">(Makarova et al., 2022)</a>), relative to, say, Gradient Descent (GD), where you might stop when the changes in the extrema become vanishingly small. Typically, you decide on an <em>optimization budget</em>, e.g., number of iterations \(T\), run it all the way through, and pick the best value (again, this is different from GD, where we might pick the last value).</p>
<p>Here’s the high-level BayesOpt algo summarized:</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/bayesopt_algo-upscaled.png" alt="test" />
<p class="image-caption">BayesOpt pseudocode.</p>
</div>
<p>The first point, i.e., at \(t=1\), is picked via random sampling here: \(\mathcal{U}(\mathcal{X})\) denotes <em>uniform</em> distribution over the domain \(\mathcal{X}\). You can begin with known prior evaluations too - this is helpful if you already know something about the function, e.g., where the maxima is likely to occur. Even prior known evaluations at arbitrary points help, as they save some exploration.</p>
<p>Look at Line 7 - we need to maximize for the acquisition value, and that also would require an optimizer. Here’s a question you might ask: did we just translate one optimization problem into another? What did we gain?</p>
<p>The answer is yes, there is another optimization problem we need to solve <em>per iteration</em> of BayesOpt, using what we refer to as the <strong>auxiliary optimizer</strong>. What we gain is this optimizer works with a tractable function - the surrogate model - instead of the original real-life function, which may or may not have a known form (like in the cookie example). The auxiliary optimization can be performed way quicker, relatively speaking, because the surrogate model exists on your computer, and you’re free to make as many calls to it as you like. It still needs to be fast, since this optimization needs to occur per iteration of BayesOpt.</p>
<p>Common choices for this optimizer are <a href="https://websites.umich.edu/~mdolaboratory/pdf/Jones2020a.pdf"><em>DIRECT</em></a> and <a href="https://en.wikipedia.org/wiki/CMA-ES"><em>CMA-ES</em></a>. We won’t go into details here, but this <a href="https://stats.stackexchange.com/a/199268">StackOverflow answer</a> provides some information. There has been work to get rid of the auxiliary optimization <a class="citation" href="#pmlr-v33-wang14d">(Wang et al., 2014)</a>.</p>
<p><br /></p>
<hr style="height:1px;border-width:0;background-color:#EE2967" />
<p><span style="color:#EE2967"><strong>NOTE</strong></span>:</p>
<p>This gives us another perspective on BayesOpt. There already exist black-box optimizers like DIRECT and CMA-ES. This aspect is not new. What’s different here is evaluating the objective function is expensive, and hence, inputs must be strategically picked so as to discover the optima in the minimum number of evaluations. In this view, BayesOpt provides a way to make other black-box optimizers <em>sample efficient</em>.</p>
<hr style="height:1px;border-width:0;background-color:#EE2967" />
<p><br /></p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/section_start.png" alt="test" />
<p class="image-caption"></p>
</div>
<h2 id="comparison-to-gradient-descent-gd">Comparison to Gradient Descent (GD)</h2>
<p>Speaking of GD, let’s compare with it to better conceptually situate BayesOpt. On a minimization problem this time.</p>
<p>In the plots below, the top row shows BayesOpt and vanilla GD minimizing the same function, with the same starting point of \(0.3\), with a budget of \(T=100\) iterations. Points recently visited are both larger and darker. We use up all of \(T\) for BayesOpt but exit early for GD when the minima value doesn’t change appreciably.</p>
<p>The lower row shows <em>Kernel Density Plots (KDE)</em> visualizing where each optimizer (in the same column) spent its time.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/bo_comp.png" alt="test" />
<p class="image-caption">BayesOpt compared with Gradient Descent.</p>
</div>
<p>We notice some interesting trends:</p>
<ul>
<li>GD (right column) spends most of its time getting to the closest local minima. The KDE plot shows this focus. The minima found is \(f(-0.04)=-0.1\). All recent points visited are concentrated around this minima, and the “shape” of the path is like a comet’s tail.</li>
<li>BayesOpt (left column) is more “global” because of its exploration, and manages to find the global minima \(f(-2)=-9.81\). It doesn’t always find the global minima - just that here it does. The KDE plot shows that while BayesOpt also focuses on the global minima, its attention is fairly diffuse (look at the y-axes of both KDE plots). You can also see this from the scattered placement of the large red circles.</li>
<li>As a follow-up to the previous point, the fact that BayesOpt spends much of its function evaluation budget closer to the minima, is a characteristic behavior of these methods. That’s to say, while it tries to model the whole input space, it allocates a larger budget to regions closer to the minima. This is the region that’s important to us. This is what you would expect from a smart search algorithm.</li>
</ul>
<p>To be fair, you wouldn’t typically employ GD in this naive fashion - you would probably use momentum, use some kind of adaptive learning rate (this was fixed here) and perform multiple starts, but this example is more illustrative than practical.</p>
<p>At this point we have a view of BayesOpt at the, uh, cocktail party level.
Hopefully, all of this has whetted your appetite to learn more about the two critical ingredients of BayesOpt: <strong>surrogate models</strong> and <strong>acquisition functions</strong>. We begin by looking at an example of the former.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/section_start.png" alt="test" />
<p class="image-caption"></p>
</div>
<h2 id="gaussian_processes">Gaussian Processes (GP)</h2>
<p>Recall we said that a good surrogate model needs to be able to quantify uncertainty well - in addition to being a good regressor. In my opinion, <em>Gaussian Processes (GP)</em> are almost the gold standard here. This is the reason why you tend to encounter GPs in conjunction with BayesOpt a lot. You could do BayesOpt without GPs, e.g., you can use Decision Trees <a class="citation" href="#pmlr-v151-kim22b">(Kim & Choi, 2022)</a> or Neural Networks <a class="citation" href="#pmlr-v37-snoek15">(Snoek et al., 2015)</a>, but GPs and BayesOpt are a match made in heaven.</p>
<p>The topic of GPs is interesting and vast enough to merit its own blog post, but we’ll try to cover the relevant bits here.</p>
<hr style="height: 50px; border: 0; box-shadow: inset 0 12px 12px -12px rgba(0, 0, 0, 0.5);" />
<h3 id="intuition">Intuition</h3>
<p>OK, let’s consider the regression angle. You have a bunch of points \((x_i, y_i), i=1...N\) where \(y_i \in \mathbb{R}\), to which you want to fit a function. But when you <em>predict</em>, now that we want to explicitly talk about uncertainties, you don’t want to predict <em>one value</em> for an input.</p>
<p>See (a) below - which is how we typically think about prediction - one input leads to one output only. A line through such outputs represents our model for the function. In (b), we show how predictions might look like when we are not sure yet of what we’ve learned; many values per input. Of course, we might not think all points are equally likely - shown in (c). Many outputs also implies we are effectively predicting <em>many functions</em> - every line that connects some output for each input is a possible function. We can also assign a probability to these function predictions by considering the confidences of the individual points they connect - see (d).</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/gp_abstract.png" alt="test" />
<p class="image-caption"></p>
</div>
<p>That GPs enable (c) and (d) is a good mental model to have. Let’s start getting into details.</p>
<p>Normally, you would posit a function \(f(x;\theta)\) with parameters \(\theta\), e.g., coefficients in Logistic Regression or weights in a Neural Network, and learn \(\theta\) that best explains the data. This is a <em>parametric</em> model, since \(f\)’s behavior depends on the parameters \(\theta\). Once you have learned \(\theta\), you can discard your training data \(\{(x_i, y_i)\}_{i=1}^N\). Importantly, these parameters decide the <em>overall shape</em> of your function.</p>
<p>GPs though, ask you to assume a certain <em>smoothness</em> in the function; which, roughly speaking, is how similar do you want \(y_i\) and \(y_j\) to be given similar \(x_i\) and \(x_j\). So instead of parameterizing the shape of the overall function, you try to define how input similarities dictate output similarities. This is a very different philosophy (something that <em>Support Vector Machines</em> also use), but quite natural. Consider the following function values:</p>
<ul>
<li>\(\;\) \(f(1) = 10\)</li>
<li>\(\;\) \(f(2) = 40\)</li>
<li>\(\;\) \(f(3) = 90\)</li>
<li>\(\;\) \(f(4) = 160\)</li>
</ul>
<p>If I now wonder about the value at \(3.1\), wouldn’t it make sense to guess that \(f(3.1) \approx 90\)? This is our natural sense of smoothness at work. In a way, GPs assemble the overall function from smaller function pieces where each piece honors some notion of local similarity that we define.</p>
<p>How do we operationalize this? Let’s step back for a bit, recall that we also want good uncertainties, and see if we can get both these properties in one shot. This might seem out of the left field but consider the Gaussian distribution \(\mathcal{N}(\mu, \Sigma)\) for a while (it’ll make sense in a while, promise). As a refresher, for a distribution over two variables, for the following covariance matrix:</p>
\[\Sigma=\begin{bmatrix}
\Sigma_{11} & \Sigma_{12}\\
\Sigma_{21} & \Sigma_{22}
\end{bmatrix}\]
<p>we have:</p>
<ul>
<li>\(\Sigma_{11}\) and \(\Sigma_{22}\) are the individual variances of the random variables (which I’ll refer to as \(Y_1\) and \(Y_2\) - for reasons to be clear soon).</li>
<li>\(\Sigma_{12}\) is the covariance between the two variables and signifies what values we might expect of one given the other. A positive value indicates high \(Y_1\) values signal high \(Y_2\) values; higher the \(\Sigma_{12}\), greater is this correlation. And vice-versa, for a negative value. This is a symmetric quantity, i.e., \(\Sigma_{12}=\Sigma_{21}\).</li>
</ul>
<p>The plots below show some examples, with differing covariances (shown in the legend).</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/gaussian_covs.png" alt="test" />
<p class="image-caption">Effect of the covariance matrix.</p>
</div>
<p>Consider the first plot. The “forward slant” of the Gaussian exists because \(\Sigma_{12}=\Sigma_{21}=2\), and thus \(Y_1\) and \(Y_2\) are positively correlated. As possible values for \(Y_1\) increase (going from left to right), possible values \(Y_2\) also increase (goes from bottom to top).</p>
<p>Now, here’s an outlandish idea: what if we let the <em>Gaussian random variable</em> \(Y_i\) represent all possible outputs for \(x_i\)? Wait, why is there more than one output? Because we are explicitly modeling uncertainty in our knowledge of the function; which is to say, \(Y_i\) can take on a range of different values given our current state of knowledge. But the situation is not direly hopeless: we have some idea about the relative likelihood of these values of \(Y_i\). So we might be able to say things like “Given input \(x_i\), \(Y_i=2\) is more likely than \(Y_i=3\), but not as likely as \(Y_i=1\).”. This may be expressed as a distribution. Here’s what that might look like for \(Y_1\) (the rugplot, i.e., vertical bars shows the values, and the blue line shows their density):</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/var_coupling_1D_1var.png" alt="test" />
<p class="image-caption">Two possible distributions of output for the same input.</p>
</div>
<p>We will <em>assume</em> that our uncertainty distribution may be expressed as a Gaussian - like in the second plot above. This is not unreasonable - you probably have a certain value (or a set of them in a small neighborhood) that you think are likely - this is the region around the mean of the Gaussian. The values for any \(i\) maybe expressed similarly (of course, with different Gaussians).</p>
<p>OK, now consider two outputs \(Y_1\) and \(Y_2\). They each have their Gaussians, right? But if \(x_1\) and \(x_2\) are close together, the output values cannot differ that much (remember: smoothness), even accounting for their uncertainties. What might these, somewhat <em>coupled</em>, values look like? I am going to place the values of \(Y_1\) and \(Y_2\) on the two axes of a 2D plot. And we get something like this:</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/var_coupling_2D_2var_no_contour.png" alt="test" />
<p class="image-caption">Visualization of two output distributions.</p>
</div>
<p>You might now think of the two outputs \(Y_1\) and \(Y_2\) as <em>jointly</em> following a <em>bivariate</em>, i.e., two-dimensional, distribution. Do you see where this is headed? No? Let me add some contours.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/var_coupling_2D_2var_contour.png" alt="test" />
<p class="image-caption">Visualization of two output distributions, with contours.</p>
</div>
<p>A 2D Gaussian! We have seen these before! It looks like if we express uncertainties in outputs for a given input as an 1D Gaussian, the joint distribution across outputs turns out to be a high-dimensional Gaussian. Actually, as we shall see in the next section, we just assume the latter and the former is a happy consequence of that.</p>
<p>And, since the coupling in the joint Gaussian distribution is controlled by the covariance matrix, we allow similarities in inputs decide its entries. Specifically, if \(k(x_i, x_j)\) denotes the similarity between \(x_i\) and \(x_j\), we let \(\Sigma_{ij} = k(x_i, x_j)\). Controlling the covariance matrix is a great way to regulate the variation in one output wrt another, and letting input similarities decide entries in this matrix works out perfectly for us: it gives us a precise way to enforce smoothness.</p>
<p>Can we visualize more than two output distributions? This is doable with a different visualization (I saw this <a href="https://www.youtube.com/watch?v=92-98SYOdlY">here</a> first): we’ll stack lines representing \(Y_i\)s next to each other, and for every set of values sampled from their joint distribution, we’ll draw a line through them. Let’s see what the above plot might look like in this form:</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/2D_variable_coupling.gif" alt="test" />
<p class="image-caption">A different representation of the joint distribution.</p>
</div>
<p>The line currently shown corresponds to the current sample from the joint 2D Gaussian. Note the positive slope of the annimated line; this is because of the positive correlation between the \(Y_i\)s in the joint Gaussian.</p>
<p>This gives us a way to visualize the behavior of more than just two \(Y_i\)s. And we’ll do exactly that next. But with a new twist: we’ll allow the gap between the vertical bars to be decided by the corresponding \(x_i\)s. This will give us an idea of the similarity of the inputs. Shown below for three \(Y_i\)s.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/multiple_y_anim_3.gif" alt="test" />
<p class="image-caption">Visualization of three output distributions.</p>
</div>
<p>Note how \(Y_1\) and \(Y_2\) seem to take on similar values, while \(Y_1\) (or \(Y_2\)) and \(Y_3\) can end up with very different values. This is because of the large distance of \(x_3\) from \(x_1\) or \(x_2\).</p>
<p>Just because now we can, lets go ahead and look at <em>ten</em> ouput distributions.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/multiple_y_anim_10.gif" alt="test" />
<p class="image-caption">Visualization of ten output distributions.</p>
</div>
<p>This almost looks like a full-fledged function! Again, note how \(x_i\) that are close together, take on similar values. This makes the function appear smooth. What you are seeing here is a <em>distribution of functions</em> as modeled by a GP. Sounds complex, but what we are really saying is that each line through \(Y_1-Y_{10}\) is a function, and since we are seeing multiple functions above - because of the uncertainty of \(Y_i\)s - we are effectively looking at a distribution of functions.</p>
<p>Sticking with this visual metaphor, what does regression mean here? A training instance \((x_i, y_i)\) implies the corresponding variable \(Y_i\) is now fixed at the value of \(y_i\). Training data effectively <em>clamps</em> down the function at observations, leaving it to vary elsewhere. In the next plot, the clamped down points are shown in red.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/multiple_y_anim_10_clamped.gif" alt="test" />
<p class="image-caption">Training data clamps down the y values - shown with red circles.</p>
</div>
<p>How much does the function sway “elsewhere” - we’ll think of these as test locations - depends on (a) the degree/kind of smoothness we have enforced, and (b) how far a test input is from a training instance. In general, you might imagine that the effect of clamping reduces as you move away from the clamped points. There are exceptions to this, like if you wanted your smoothness to have some kind of periodicity (<a href="https://peterroelants.github.io/posts/gaussian-process-kernels/#Local-periodic-kernel">example</a>) - but this is a good mental model for now.</p>
<p>The benefit to all this that when are predicting the output for an input, we end up predicting a <em>distribution</em> instead of a single value. This is not as scary as it sounds: consider any vertical \(Y_i\) bar in the above plot, and think about the animated line tracing values on it. Shown below.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/multiple_y_anim_10_clamped_trace.gif" alt="test" />
<p class="image-caption">The multiple values produced for each input are shown.</p>
</div>
<p>The distributions of these traced values are what we are looking for. And since we know them to be Gaussian, all we are looking for are their means and variances. In fact, we can visually see that these <em>might</em> be Gaussians (of course, we will confirm this in the next section). Here I have a drawn a KDE plot for the various \(Y_i\) values that begins to get plotted once we’ve had three iterations.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/multiple_y_anim_10_clamped_trace_kde.gif" alt="test" />
<p class="image-caption">KDE plots over the various outputs - they start to approximate Gaussians as values accumulate. Not shown for the clamped points, since they are fixed values.</p>
</div>
<p>Yes, all of this is unusual, but also (I think) awesome. Intuitively, the few plots above convey the essence of GPs. What remains is to make these ideas rigorous, specifically, (a) which are the core assumptions and which are their consequences?, and (b) how do we effectively compute the quanities of interest?</p>
<p><br /></p>
<hr style="height:1px;border-width:0;background-color:#EE2967" />
<p><span style="color:#EE2967"><strong>IMPORTANT</strong></span>: we aren’t saying the outputs <em>across inputs</em> form an univariate Gaussian. Mathematically, this is what we <strong>ARE NOT</strong> saying:</p>
\[Y \sim \mathcal{N}(\mu, \sigma^2)\]
<p>Where, \(Y\) is a random variable that takes on output values across <em>different</em> inputs, where we assume exactly one “best-guess” output for an input, i.e., for an \(x_i\), we just have one specific \(y_i\), and \(Y \in \{y_i\}_{i=1}^N\). Such a \(y_i\) is known as a <a href="https://en.wikipedia.org/wiki/Point_estimation">point estimate</a>. This is also why the \(y_i\)s are in lower case - they are not random variables. This is how regression models are commonly used: you plug in an input and receive an output. Instead of a covariance matrix \(\Sigma\), I am using the standard notation for variance \(\sigma^2\), since this is a 1D Gaussian.</p>
<p>So, what <em>are</em> we saying? We have one Gausian random variable dedicated to possible outputs for one given input. The <em>distributions</em> of the outputs together is a multivariate Gaussian, and it describes our current knowledge (or uncertainty) of the function. The extent of coupling between individual distributions is decided by the similarity in their inputs. Mathematically, this is what we <strong>ARE</strong> saying:</p>
\[p(Y_1, Y_2, Y_3, ...) \sim \mathcal{N}(\mu, \Sigma)\]
<p>The precise coupling between \(Y_i\) and \(Y_j\) is enforced by the similarity in the respective inputs, \(\Sigma_{ij}=k(x_i, x_j)\).
<!--Here, $$Y_i$$ is a random variable that represents possible outputs *just for* $$x_i$$. A specific observed $$y_i$$ is understood to be the value of the corresponding random variable $$Y_i$$ given our data. In other words, $$Y_i$$ can take other values too, according to some univariate distribution (spoiler: as we shall soon see, this is *also* a Gaussian) - its just that, as per our data, $$Y_i=y_i$$. Also, now we've a covariance matrix $$\Sigma$$, which is defined as described earlier: based on input similarities.--></p>
<hr style="height:1px;border-width:0;background-color:#EE2967" />
<p><br /></p>
<hr style="height: 50px; border: 0; box-shadow: inset 0 12px 12px -12px rgba(0, 0, 0, 0.5);" />
<h3 id="deep-dive">Deep Dive</h3>
<p>Let’s formalize all of this stuff. But fear not - there will be pictures!</p>
<p>We’ll talk about the following sets of points:</p>
<ol>
<li>Training data \((x_i, y_i)_{i=1}^N\).</li>
<li>Test data or inputs where we want to evaluate the learned function: \(x_i^*, i=1 ... M\).</li>
<li>All other points in the input space \(\tilde{x}_k\).</li>
</ol>
<p>The corresponding uppercase letters would indicate a set of points, e.g., \(X\), \(X^*\), \(\tilde{X}\). This notation would be overloaded to imply matrices at times, e.g., \(X\in \mathbb{R}^{N\times d}\) where \(d\) is the number of dimensions of the data. The number of elements in \(\tilde{X}\) is <em>potentially infinite</em>. Also, as mentioned earlier, we want to talk about a range of outcomes for a particular input, and hence the outcomes would be denoted by random variables \(Y_i\), \(Y_i^*\) and \(\tilde{Y}_i\). We can obviously talk about specific values these random variables can take, e.g., \(Y_i=y_i\).</p>
<p>The model we’re proposing above is:</p>
\[p(Y_1, ..., Y_N, Y_1^*, ..., Y_M^*, \tilde{Y}_1, ...) \sim \mathcal{N}(\mu, \Sigma)\]
<p>where \(\Sigma_{ij}=k(z_i, z_j)\), and each \(z\) comes from one of \(X\), \(X^*\) or \(\tilde{X}\), based on the positions \(i\) and \(j\).</p>
<p><br /></p>
<hr style="height:1px;border-width:0;background-color:#EE2967" />
<p><span style="color:#EE2967"><strong>NOTE</strong></span>:</p>
<ul>
<li>While the uppercase letters for inputs, e.g., \(X\), denote collections, uppercase <em>indexed</em> letters for the output denote individual random variables, e.g., \(Y_i\). <em>Non-indexed</em> uppercase letters denote a collection of random variables (analogous to the input), e.g., \(Y\), \(Y^*\), \(\tilde{Y}\).</li>
<li>Many references explicitly mention the input as conditioning variables: \(p(Y_1, ..., Y_N, Y_1^*, ..., Y_M^*, \tilde{Y}_1, ...\vert {\color{WildStrawberry}{X}},{\color{WildStrawberry}{X^*}},{\color{WildStrawberry}{\tilde{X}}})\). But I’ll avoid it here to avoid clutter.</li>
</ul>
<hr style="height:1px;border-width:0;background-color:#EE2967" />
<p><br /></p>
<p>At this point, I can almost see this conversation between us:</p>
<ul>
<li><strong>You</strong>: Whoa! Infinite points in \(\tilde{X}\)? How do we compute an infinite-dimensional distribution?</li>
<li>
<p><strong>Me</strong>: We are only going to be dealing with \(X\) (training data) and \(X^*\) (test data) which are finite. I mentioned \(\tilde{X}\) to write down the overall model.</p>
</li>
<li><strong>You</strong>: Hmm, and what distribution do the finite parts follow?</li>
<li><strong>Me</strong>: Gaussian. Here’s why. Gaussian distributions have this interesting property that if you were to look at the variations in only a subset of the variables, then you’d see they also follow a Gaussian distribution (known as <a href="https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Marginal_distributions">marginalization</a>). By this property the finite parts of the original infinite-dimensional Gaussian are also a Gaussian. The parameters for this <em>marginalized</em> Gaussian are easy to obtain: only retain the subset of the original parameters that involve the current subset of variables. For example, if the random variables \(t_1, t_2, t_3, t_4\) follow a Gaussian distribution, then \([t_1, t_3]\) is a Gaussian with the parameters in <span style="color:#EE2967">this</span> color.</li>
</ul>
\[[{\color{WildStrawberry}{t_1}}, t_2, {\color{WildStrawberry}{t_3}}, t_4] \sim \mathcal{N}([{\color{WildStrawberry}{\mu_1}}, \mu_2, {\color{WildStrawberry}{\mu_3}}, \mu_4],
\begin{bmatrix} \color{WildStrawberry}{\Sigma_{11}} & \Sigma_{12} & \color{WildStrawberry}{\Sigma_{13}} & \Sigma_{14}\\
\Sigma_{21} & \Sigma_{22} & \Sigma_{23} & \Sigma_{24}\\
\color{WildStrawberry}{\Sigma_{31}} & \Sigma_{32} & \color{WildStrawberry}{\Sigma_{33}} & \Sigma_{34} \\
\Sigma_{41} & \Sigma_{42} & \Sigma_{43} &\Sigma_{44} \\
\end{bmatrix})\]
<ul>
<li><strong>You</strong>: This helps us how?</li>
<li>
<p><strong>Me</strong>: We can talk only about the train and test data as coming from a Gaussian. Yes, this can be seen as a marginalization of the “larger” infinite-dimensional Gaussian that conceptually exists. So, what we’ll really focus on is this distribution:
\(p(Y_1, ..., Y_N, Y_1^*, ..., Y_M^*) \sim \mathcal{N}(\mu, \Sigma)\). You don’t see the inputs in this expression, but remember they’re hiding within the \(\Sigma\).</p>
</li>
<li><strong>You</strong>: I see. So we worry only about the train and test data. But how do derive the test outcomes given the training data? This regression problem is why we’re here after all.</li>
<li>
<p><strong>Me</strong>: If you look at the expression above, we’re asking for the <em>conditional</em> distribution given training outcomes, i.e., \(p(Y_1^*, ..., Y_M^* \vert {\color{WildStrawberry}{Y_1=y_1}}, {\color{WildStrawberry}{Y_2=y_2}}, ..., {\color{WildStrawberry}{Y_N=y_N}})\). Guess what this conditional distribution is … yes, also a Gaussian. This has nothing to do with GPs; its a property of the Gaussian distribution that its <a href="https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions">conditionals are Gaussian</a> as well. And its parameters are expressible in terms of \({\color{WildStrawberry}{X}}\), \({\color{WildStrawberry}{Y}}\), \({\color{WildStrawberry}{X^*}}\), \({\color{WildStrawberry}{\mu}}\) and \({\color{WildStrawberry}{\Sigma}}\) . This is it - <strong>this is the big reveal</strong>. The random variables corresponding to our test outputs are jointly Gaussian, whose parameters are calculable in closed form.</p>
<p>You can also calculate the distribution of just one \(Y_i^*\). No points for guessing what this distribution is: by the marginalization property this is a (univariate) Gaussian since \(p(Y_1^*,..., Y_M^*)\) is a Gaussian. With a distribution for \(Y_i^*\) we can now do the following:</p>
<ul>
<li>Calculate the <em>mean value</em> for \(Y_i\) - this serves as the regression value for input \(x_i\).</li>
<li>Assign an <em>uncertainty</em> score to \(Y_i\), based on the flatness of its Gaussian. <br /></li>
</ul>
</li>
</ul>
<p><br /></p>
<hr style="height:1px;border-width:0;background-color:#EE2967" />
<p><span style="color:#EE2967"><strong>NOTE</strong></span>:</p>
<p>That’s a lot of Gaussians! <a href="https://en.wikipedia.org/wiki/Normal_distribution#Operations_and_functions_of_normal_variables">Quite a few operations</a> on Gaussians result in Gaussians; which is why they’re considered to be mathematically convenient. You can develop tools and strategies for Gaussians alone, and that will take you far. But be careful here in distinguishing between these operations:</p>
<ul>
<li>We used marginalization to say that its OK to think about a finite slice of the overall model as a Gaussian. The convenient form of marginalization helped here: we could just parameterize this finite distribution in a self-contained manner, while assuming that these parameters were plucked out of an infinitely larger parameter space.</li>
<li>Then, we used conditioning wrt training data, on this marginal finite Gaussian distribution, to describe the test outputs.</li>
</ul>
<p>So, really, the central assumption here is that the output distributions are jointly Gaussian, with a covariance matrix that encodes input similarities. All other properties are consequences.</p>
<hr style="height:1px;border-width:0;background-color:#EE2967" />
<p><br /></p>
<p>Hopefully, it starting to look like things are coming together. A few more details should seal the deal:</p>
<ol>
<li>We set \({\color{WildStrawberry}{\mu=0}}\). This is a convenient <em>assumption</em> but works well in practice because (we shall see soon) the conditional distribution we want is strongly influenced by the actual training data. If you’re interested, <a href="https://math.stackexchange.com/questions/1193772/why-the-mean-value-of-a-gaussian-process-is-usually-set-to-zero">this</a> discussion on StackExchange discusses the validity of such an assumption.</li>
<li>Lets start referring to the covariance matrix as the <em>kernel matrix</em> \(K\) to emphasize that it consists of similarity values. \(k()\) itself is known as the <em>kernel function</em>. To rephrase something we saw earlier, \({\color{WildStrawberry}{K_{ij}}}=k(z_i, z_j)\), where \(z\) comes from one of \(X\), \(X^*\) or \(\tilde{X}\), based on the positions \(i\) and \(j\).</li>
<li>
<p>What is an example of a kernel function? A very common example is the <em>Radial Basis Function (RBF)</em> kernel \({\color{WildStrawberry}{k(z_i, z_j) = e^{-\vert\vert z_i-z_j \vert\vert^2/(2L^2)}}}\). If \(z_i\) and \(z_j\) are close, say, \({\color{WildStrawberry}{\vert\vert z_i-z_j \vert\vert \approx 0}}\), then we’ve \({\color{WildStrawberry}{k(z_i, z_j) \approx 1}}\) (for a given \(L\)). On the other hand, if \({\color{WildStrawberry}{\vert\vert z_i-z_j \vert\vert \to \infty}}\), then \({\color{WildStrawberry}{k(z_i, z_j) \to 0}}\). This is how we would expect similarities to behave. Here \({\color{WildStrawberry}{L}}\) is known as the <em>length scale</em>, and we’ll explain its role soon. Of course, there are many other kernels, but we’ll use the RBF kernel in our examples here. For others, <a href="https://www.cs.toronto.edu/~duvenaud/cookbook/">this is a good reference</a>.</p>
<p>Can any notion of similarity be written up as a kernel function? Good news bad news kind of situation. Unfortunately not - the matrix \(K\) needs to be <a href="https://en.wikipedia.org/wiki/Definite_matrix">positive semi-definite</a>. This is a property we won’t delve into here. This is not particularly restrictive for these reasons: (a) there are many useful similarity functions that follow this property, and (b) you can use such valid kernel functions to compose new ones in <a href="https://mlweb.loria.fr/book/en/constructingkernels.html">certain ways</a> that guarantee that the new kernel also possesses this property.</p>
</li>
<li>We’ll think of \(K\) as being partitioned into four sections based on the data among which similarities are computed. For example, the partition \(K_{XX^*}\) implies this submatrix contains similarities between instances in \(X\) and \(X^*\). This is shown below.</li>
</ol>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/kernel_partition.png" alt="test" />
<p class="image-caption">Partitioning of the kernel matrix.</p>
</div>
<p>Now we’re ready to write down the exact form of the test distributions:
\(p(Y^*_1, ..., Y^*_M \vert Y_1, ..., Y_N) = \mathcal{N}({\color{WildStrawberry}{K_{X^*X} K^{-1}_{XX}y}}, {\color{PineGreen}{K_{X^*X^*} - K_{X^*X} K^{-1}_{XX} K_{XX^*} }} )\)</p>
<p>Here,</p>
<ul>
<li>The above expression is just the <a href="https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions">standard formula</a> for the conditional distribution for a Gaussian.</li>
<li>\(y\) is the vector of training outputs \([y_1, y_2, ..., y_N]\).</li>
<li>The new mean is \({\color{WildStrawberry}{K_{X^*X} K^{-1}_{XX}y}}\). This is a \(M\)-dimensional vector, where the \(i^{th}\) value is the mean value for \(Y_i\). The different matrix sizes are visualized below.</li>
</ul>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/gp_mean.png" alt="test" />
<p class="image-caption"></p>
</div>
<ul>
<li>The new covariance matrix is \({\color{PineGreen}{K_{X^*X^*} - K_{X^*X} K^{-1}_{XX} K_{XX^*} }}\).</li>
</ul>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/gp_cov.png" alt="test" />
<p class="image-caption"></p>
</div>
<p>These look complicated, but the important thing here is \(p(Y)\) can be <strong>computed in closed form</strong>! With just that equation. Think of what we’re getting - an entire joint distribution over outputs \(Y_1, Y_2,..., Y_N\).</p>
<p>Let’s visualize this in the syle we saw in the previous section. We sample once from the distribution \(p(Y^*_1, ..., Y^*_M \vert Y_1, ..., Y_N)\) for <em>one realization</em> of the learned function. You can sample multiple times to obtain different such realizations.</p>
<p>The image below shows this (zooming in recommended), given some <span style="border-width:0px;background-color:#e38fa6;">training data</span> and <span style="border-width:0px;background-color:#abdff8;">test input locations</span>. It shows <span style="color:#1b3bff">three realizations</span> of \(p(Y)\). It also shows, in gory detail, what the kernel matrix looks like - note that its color coded to show interactions between \(X\) and \(X^*\).</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/GP_show_sampling.jpg" alt="test" />
<p class="image-caption">The big picture: given evaluated (red) points and points where we want to plot (blue), what the overall
matrix K looks like. Note the color coding in K: values created using the red and blue values are in purple.</p>
</div>
<p>Since we’re effectively sampling realizations of the learned function, we conventionally say we’re “sampling functions”, and that the GP provides a <em>distribution over functions</em> (a phrasing we encountered earlier). There is nothing stopping us from having a lot of closely-spaced test inputs, plotting against which actually lends credence to that terminology. The image below shows this - we’ve a lot of test inputs in \([0, 10]\), which are not explicitly marked to avoid cluttering. The training data is shown in <span style="color:#c44e52;">red dots</span>. This time I have also plotted the mean output values for each \(Y_i\) and connected them with a <span style="color:#0047ab;">dashed line</span> - this acts like a standard regression curve.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/GP_lines.jpg" alt="test" />
<p class="image-caption">Multiple sampled functions. Evaluated locations act like constrictions. We noted this earlier in the animations in the previous section.</p>
</div>
<p>Can we control for how smooth we want the functions to be? One way to do it would be pick an appropriate kernel. But even with an RBF kernel, you can exercise control using the length scale parameter \(L\). Think about the role it plays in \(k(z_i, z_j) = e^{-\vert\vert z_i-z_j \vert\vert^2/(2{\color{WildStrawberry}{L}}^2)}\). A large value suppresses the difference \(\vert\vert z_i-z_j \vert\vert^2\). This is to say, points that are both close and somewhat farther apart, would end up with similarity scores \(\approx 1\). Since this means the corresponding outputs should also be similar, such functions would seem less “bendy”, i.e., we won’t see drastic changes over short spans. The following images show the effect of changing the length scale (see the titles).</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/GP_length_scales.jpg" alt="test" />
<p class="image-caption">Effect of the length scale - shown increasing from top to bottom. Lower length scales lead to functions that can 'bend' more.</p>
</div>
<p>We noted above that the distribution for a single output random variable \(Y^*_i\) is also a Gaussian due to marginalization. Let’s plot KDEs here to see this; similar to an exercise we had done earlier. I have done this for a couple of locations below.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/GP_y_dist_kde.jpg" alt="test" />
<p class="image-caption">The empirical distribution of y shown as a KDE plot.</p>
</div>
<p>Since we know the parameters of these Gaussians, an alternative way to visualize them is to mark the points that are one standard deviation away from the mean, and connect these for different \(Y^*_i\). If we denote the distribution parameters at a given \(x^*_i\) as \(\mu_{x^*_i}\) and \(\sigma_{x^*_i}\), then we’re drawing curves through (1) \(\mu_{x^*_i}\), (2) \(\mu_{x^*_i} + \sigma_{x^*_i}\) and (3) \(\mu_{x^*_i} - \sigma_{x^*_i}\), as shown below. Indeed, this is the standard way of visualizing these distributions.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/GP_y_dist_gaussian.png" alt="test" />
<p class="image-caption">How the bands are drawn.</p>
</div>
<p>Of course, you can draw these bands for other multipliers of the standard deviation, e.g., \(\mu_{x^*_i} + 2\sigma_{x^*_i}\) and \(\mu_{x^*_i} - 2\sigma_{x^*_i}\). This is shown below.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/GP_bands.png" alt="test" />
<p class="image-caption">Bands representing values at different standard deviations are shown.</p>
</div>
<p>This wraps up the overview of GPs. We’ll dwell over it just a bit more before we move on.</p>
<hr style="height: 50px; border: 0; box-shadow: inset 0 12px 12px -12px rgba(0, 0, 0, 0.5);" />
<h3 id="connection-with-bayesian-linear-regression">Connection with Bayesian Linear Regression</h3>
<p>In the previous section we motivated GPs by tying in our requirements with properties of the multivariate Gaussian. That might’ve seemed a bit out of the left field. But there is another way to look at GPs that is closer to familiar territory.</p>
<p>Consider <em>Linear Regression</em> for the case that we want to learn model parameters in a Bayesian fashion.<br />
As before, let test data be \(X^* \in \mathbb{R}^{M \times d}\), \(Y^* \in \mathbb{R}^{M \times 1}\). Let train data be denoted by \(D\). We assume a linear model where \(y^*_i \sim \mathcal{N}(W_{[1...d]}x^*_i+W_0, \sigma^2)\). This denotes that \(y^*_i\) is almost linear in \(x_i\), i.e., \(y^*_i=W_{[1...d]}x^*_i\), but exhibits some perturbation or noise in the real world. This noisy linear relationship is concisely represented using a Normal distribution with a mean that linearly depends on \(x^*\) with variance \(\sigma^2\). Effectively, every \(y^*\) gets its own distribution centered at \(W_{[1...d]}x^*\), but the level of noise - captured by \(\sigma\) - is assumed fixed for the data. We also assume the prior over weights \(p(W)\) is Gaussian. Here \(W\) is a vector of all weights, i.e., \(W=[W_0 W_1 ... W_{d}]\).</p>
<p>If we write down the expression for \(p(Y^* \vert X^*, D)\), we’ll notice that it involves a bunch of Gaussians that combine together to finally simplify into a Gaussian. This is shown below.</p>
\[\begin{aligned}
p(Y^*|X^*, D)& = \int_W p(Y^*|X^*, W) p(W|D) dW \\
&= \underbrace{\int_W \overbrace{p(Y^*|X^*, W)}^{prod. \; of\; Gauss.} \overbrace{\frac{\overbrace{p(D|W)}^{prod. \; of\; Gauss.\leftarrow i.i.d.} \times \overbrace{p(W)}^{Gaussian\; prior}}{Z}}^{Gaussian \leftarrow prod. \; of\; Gauss.} dW}_{Gaussian \leftarrow marginalization \; of\; a\; Gaussian}
\end{aligned}\]
<p>Here:</p>
<ul>
<li>
<p>\(p(D \vert W)\) is a product of Gaussians because we assume the data to be <em>independent and identically distributed (iid)</em>. Which is to say \(p(D \vert W) = \prod^N_{i=1} p(y_i \vert x_i, W)\). This results in a joint Gaussian distribution over \(Y_1, Y_2, ..., Y_N\). For the same reason, \(p(Y^* \vert X^*, W)\) is a Gaussian.</p>
</li>
<li>
<p>\(Z\) is a normalizing factor. We don’t require its exact value here.</p>
</li>
</ul>
<p>What we’re eventually saying above is that the output distribution is jointly Gaussian, much like the GP. And very similar to an earlier plot, we can visually guess this might be true. In the following plot, I draw a bunch of lines, whose parameters - slope \({\color{WildStrawberry}{m}}\;(=W_1)\) and intercept \({\color{WildStrawberry}{c}}\;(=W_0)\) - are sampled from a Gaussian. For each of these lines, if I now determine the value of \(y^*_i=mx^*_i+c\) (this is equivalent to sampling from \(\mathcal{N}(mx^*_i+c, \sigma^2)\), where we have set \(\sigma=0\) for convenience), and show their spread using a KDE plot (shown only for some arbitrarily-picked locations), this is what we get:</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/lin_reg_gaussian.gif" alt="test" />
<p class="image-caption">Distribution of outputs at given locations for various lines. The slopes and intercepts come from a bivariate Gaussian - shown in the inset image. A white dot that appears corresponds to the parameters - slope and intercept - of the current line.</p>
</div>
<p>It looks like the marginal output distributions per \(i\) are Gaussians, which is what we would expect from the joint begin Gaussian. We have seen this before! Let’s bring up that plot.</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/multiple_y_anim_10_clamped_trace_kde.gif" alt="test" />
<p class="image-caption">A plot we have seen earlier. Use of GPs to learn a general function. The outputs for a given input are Gaussian.</p>
</div>
<p>See how close this is to Linear Regression case? But this is better because the function doesn’t have to be linear.
In a way, GPs “cut out the middleman” from the linear model, by replacing whatever covariances the linear model leads to, with an arbitrarily expressive kernel matrix.</p>
<hr style="height: 50px; border: 0; box-shadow: inset 0 12px 12px -12px rgba(0, 0, 0, 0.5);" />
<h3 id="minimal-code-for-gp">Minimal Code for GP</h3>
<p>GPs are conceptually unusual, but getting some code up and running is surprisingly easy. This is because the output distribution comes from just one equation. I’ll use the RBF kernel implementation in <code class="language-plaintext highlighter-rouge">sklearn.metrics.pairwise.rbf_kernel</code>. If we don’t worry about learning the length scale, and use the default, the following code is all you need for a functional GP.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics.pairwise</span> <span class="kn">import</span> <span class="n">rbf_kernel</span> <span class="k">as</span> <span class="n">rbf</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">multivariate_normal</span> <span class="k">as</span> <span class="n">mvn</span>
<span class="k">def</span> <span class="nf">gp_predict</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">num_sample_functions</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
<span class="n">return_params</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
<span class="c1"># calculate the various partitions
</span> <span class="n">K_x_x</span> <span class="o">=</span> <span class="n">rbf</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_train</span><span class="p">)</span>
<span class="n">K_x_x_star</span> <span class="o">=</span> <span class="n">rbf</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">)</span>
<span class="n">K_x_star_x</span> <span class="o">=</span> <span class="n">K_x_x_star</span><span class="p">.</span><span class="n">T</span>
<span class="n">K_x_star_x_star</span> <span class="o">=</span> <span class="n">rbf</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">X_test</span><span class="p">)</span>
<span class="c1"># calculate the mean and cov. for the joint output distribution
</span> <span class="n">mean</span> <span class="o">=</span> <span class="n">K_x_star_x</span> <span class="o">@</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">K_x_x</span><span class="p">)</span> <span class="o">@</span> <span class="n">y_train</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">mean</span> <span class="o">=</span> <span class="n">mean</span><span class="p">.</span><span class="n">flatten</span><span class="p">()</span>
<span class="n">cov</span> <span class="o">=</span> <span class="n">K_x_star_x_star</span> <span class="o">-</span> <span class="n">K_x_star_x</span> <span class="o">@</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">K_x_x</span><span class="p">)</span> <span class="o">@</span> <span class="n">K_x_x_star</span>
<span class="c1"># sample some functions
</span> <span class="c1"># y_star is an array of size num_sample_functions x M
</span> <span class="n">y_star</span> <span class="o">=</span> <span class="n">mvn</span><span class="p">.</span><span class="n">rvs</span><span class="p">(</span><span class="n">mean</span><span class="o">=</span><span class="n">mean</span><span class="p">,</span> <span class="n">cov</span><span class="o">=</span><span class="n">cov</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">num_sample_functions</span><span class="p">)</span>
<span class="c1"># That's it folks!
</span> <span class="k">if</span> <span class="n">return_params</span><span class="p">:</span>
<span class="k">return</span> <span class="n">y_star</span><span class="p">,</span> <span class="n">mean</span><span class="p">,</span> <span class="n">cov</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">y_star</span></code></pre></figure>
<p>This is it. Really. The above code is training, prediction and prediction probabilities all rolled into one.</p>
<p>Something to note here is that because of practical issues with numerical stability, such as those that might arise in calculating the covariance matrix, an alternative is to sample from the <em>standard Normal</em> (one with zero mean and unit variance) and adjust the output later (for an example, see how the <a href="https://en.wikipedia.org/wiki/Cholesky_decomposition">Cholesky decomposition</a> is used <a href="https://youtu.be/4vGiHC35j9s?si=w4iDRlLPSsZJ1jhy&t=3809">here</a>).</p>
<p>Do we want to take it out for a spin? Of course we do! Let’s generate some data on a sine curve, and sample some functions.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># NOTE: please ensure previous imports and the gp_predict() function
# are available to this code!
</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">num_train</span><span class="p">,</span> <span class="n">num_test</span> <span class="o">=</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">100</span>
<span class="n">X_train</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">pi</span><span class="p">,</span> <span class="n">num_train</span><span class="p">)</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sin</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="o">*</span><span class="n">np</span><span class="p">.</span><span class="n">pi</span><span class="p">,</span> <span class="n">num_test</span><span class="p">)</span>
<span class="c1"># we need to reshape the arrays since 2D inputs are expected
</span><span class="n">multiple_y_star</span> <span class="o">=</span> <span class="n">gp_predict</span><span class="p">(</span><span class="n">X_train</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">y_train</span><span class="p">,</span>
<span class="n">X_test</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">num_sample_functions</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="c1"># let's plot the sampled outputs
</span><span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">multiple_y_star</span><span class="p">):</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mf">0.75</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"f_</span><span class="si">{</span><span class="n">idx</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="c1"># let's also show the training data using a scatter plot
</span><span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">'k'</span><span class="p">,</span> <span class="n">s</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'train'</span><span class="p">)</span>
<span class="c1"># let's now plot the true test function
</span><span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">sin</span><span class="p">(</span><span class="n">X_test</span><span class="p">),</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">'k'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"true fn"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'x'</span><span class="p">);</span> <span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'y'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p>This is what we obtain:</p>
<!-- _includes/image.html -->
<div class="image-wrapper">
<img src="/assets/bayesopt/minimal_gp.png" alt="test" />
<p class="image-caption">Output from the above codes.</p>
</div>
<p>Think of another model thats this concise but that affords us the same expressivity as a GP - both in terms of the complexity of functions it can fit, and uncertainty quantification. I can think of <em>k-Nearest Neighbors</em> and <em>Linear Regression</em> as candidates for conciseness, but these are limited in their expressivity.</p>
<hr style="height: 50px; border: 0; box-shadow: inset 0 12px 12px -12px rgba(0, 0, 0, 0.5);" />
<h3 id="summary-and-challenges">Summary and Challenges</h3>
<p>We are finally at the end of our whirlwind tour of GPs. I think of them as a neat way to fit complex models and obtain uncertainty quantification. They can also model noise in the observed data - something we haven’t covered here. And they’re concise. So, why haven’t they taken over the world? A big problem is their runtime.</p>
<p>Note that GPs don’t really have a training phase (at least that’s we’ll assume here, but in practice, we do need to learn parameters such as the length scale) and keep around all the training data for test-time predictions. Unfortunately the volume of this data comes into play as expensive matrix operations at test-time. Recall these final parameters:</p>
<ul>
<li>Mean: \(K_{X^*X} {\color{WildStrawberry}{K^{-1}_{XX}}}y\)</li>
<li>Covariance: \(K_{X^*X^*} - K_{X^*X} {\color{WildStrawberry}{K^{-1}_{XX}}} K_{XX^*}\)</li>
</ul>
<p>The inversion is probably the biggest culprit, since its runtime complexity is \(O(n^3)\). You could make this a little faster (and numerically stable) using the Cholesky decomposition (Algorithm 2.1 <a href="https://gaussianprocess.org/gpml/chapters/RW.pdf">here</a>); this reduces the constant factors in the runtime, is numerically stable, but is still \(O(n^3)\). You might precompute \(K^{-1}_{XX}\) - but you’ll still need to keep the training data around (for \(K_{X^*X}\) - so this is memory intensive) and, while the compute would then be lower, its now \(O(n^2)\), which still is expensive for large training datasets.</p>
<p>As an aside, this is why GPs are <em>non-parametric</em> models. Contrast this with parametric models like Logistic Regression, where we distill (and then discard) the training data into a few parameters for use at test-time.</p>
<p>There has been lots of work on speeding up GPs, e.g., by using <em>inducing points</em>, which are a small set of training points that produce approximately the same results as the entire data <a class="citation" href="#kissgp">(Wilson & Nickisch, 2015; Snelson & Ghahramani, 2005)</a>. On a more pragmatic note though, the notion of what’s a “large” model wrt memory and compute has been challenged in the past few years, by Deep Learning, and specifically by generative models like LLMs. It would make sense to evaluate GPs for a task given the actual practical constraints at hand.</p>
<p>I’ve listed some learning resources in the sequel post <a href="/bayesopt_2_acq_fns#summary-a-pet-peeve-and-learning-resources">here</a>.</p>
<hr style="height: 50px; border: 0; box-shadow: inset 0 12px 12px -12px rgba(0, 0, 0, 0.5);" />
<p>We looked at one critical ingredient of BayesOpt here - surrogate models. We’ll discuss the other critical component - acquisition functions - in the next post. That also contains a list of learning resources at the end. See you there!</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="bayesopt_is_superior">Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laaksonen, E., Xu, Z., & Guyon, I. (2021). Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the Black-Box Optimization Challenge 2020. In H. J. Escalante & K. Hofmann (Eds.), <i>Proceedings of the NeurIPS 2020 Competition and Demonstration Track</i> (Vol. 133, pp. 3–26). PMLR. https://proceedings.mlr.press/v133/turner21a.html</span>
<div class="dropDownAbstract" id="bayesopt_is_superior-abstract">
</div>
</li>
<li><span id="white2019bananas">White, C., Neiswanger, W., & Savani, Y. (2021). BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search. <i>Proceedings of the AAAI Conference on Artificial Intelligence</i>.</span>
<div class="dropDownAbstract" id="white2019bananas-abstract">
</div>
</li>
<li><span id="LSBO_paper">Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., & Aspuru-Guzik, A. (2018). Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. <i>ACS Central Science</i>, <i>4</i>(2), 268–276. https://doi.org/10.1021/acscentsci.7b00572</span>
<div class="dropDownAbstract" id="LSBO_paper-abstract">