-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2014-02-06-Preprocessing.html
698 lines (509 loc) · 14.7 KB
/
2014-02-06-Preprocessing.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
<!DOCTYPE html>
<html>
<head>
<title>Data Mining</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<style type="text/css">
@import url(http://fonts.googleapis.com/css?family=Droid+Serif);
@import url(http://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
body {
font-family: 'Droid Serif';
font-size: 25px;
}
.remark-slide-content {
padding: 1em 2em 1em 2em;
}
h1, h2, h3 {
font-family: 'Yanone Kaffeesatz';
font-weight: 400;
margin-top: 0;
margin-bottom: 0;
}
h1 { font-size: 3em; }
h2 { font-size: 1.8em; }
h3 { font-size: 1.4em; }
.footnote {
position: absolute;
bottom: 3em;
}
ul { margin: 8px;}
li p { line-height: 1.25em; }
.red { color: #fa0000; }
.large { font-size: 2em; }
a, a > code {
color: rgb(249, 38, 114);
text-decoration: none;
}
code {
-moz-border-radius: 3px;
-web-border-radius: 3px;
background: #e7e8e2;
color: black;
border-radius: 3px;
}
.tight-code {
font-size: 20px;
}
.white-background {
background-color: white;
padding: 10px;
display: block;
margin-left: auto;
margin-right: auto;
}
.limit-size img {
height: auto;
width: auto;
max-width: 1000px;
max-height: 500px;
}
em { color: #80cafa; }
.pull-left {
float: left;
width: 47%;
}
.pull-right {
float: right;
width: 47%;
}
.pull-right ~ p {
clear: both;
}
#slideshow .slide .content code {
font-size: 1.6em;
}
#slideshow .slide .content pre code {
font-size: 1.6em;
padding: 15px;
}
.inverse {
background: #272822;
color: #e3e3e3;
text-shadow: 0 0 20px #333;
}
.inverse h1, .inverse h2 {
color: #f3f3f3;
line-height: 1.6em;
}
/* Slide-specific styling */
#slide-inverse .footnote {
bottom: 12px;
left: 20px;
}
#slide-how .slides {
font-size: 1.6em;
position: absolute;
top: 151px;
right: 140px;
}
#slide-how .slides h3 {
margin-top: 0.2em;
}
#slide-how .slides .first, #slide-how .slides .second {
padding: 1px 20px;
height: 90px;
width: 120px;
-moz-box-shadow: 0 0 10px #777;
-webkit-box-shadow: 0 0 10px #777;
box-shadow: 0 0 10px #777;
}
#slide-how .slides .first {
background: #fff;
position: absolute;
top: 20%;
left: 20%;
z-index: 1;
}
#slide-how .slides .second {
position: relative;
background: #fff;
z-index: 0;
}
.center {
float: center;
}
/* Two-column layout */
.left-column {
width: 48%;
float: left;
}
.right-column {
width: 48%;
float: right;
}
.right-column img {
max-width: 120%;
max-height: 120%;
}
/* Tables */
table {
border-collapse: collapse;
margin: 0px;
}
table, th, td {
border: 1px solid white;
}
th, td {
padding: 7px;
}
</style>
</head>
<body>
<textarea id="source">
name: inverse
layout: true
class: left, top, inverse
---
# Preprocessing
---
## Real World is Dirty
### Incomplete
missing timestamps for actions
### Noisy
salary = -10
### Inconsistent
age: 42, birthday: 1997-03-07
???
## Types of dirty
### Incomplete
lacking some attribute values, containing only aggregate data.
e.g., We often regret not including timestamps on different actions
like UFCing, instead of tracking total votes (aggregation)
### Noisy
Containing errors, like impossible salary data, or decimals in the
wrong place
### Inconsistent
If two fields depend on each other in a large dataset, you'll find them
disagreeing. Errors often come from failures: processes failing halfway
into updating
---
## Causes of Problems
+ Humans
+ Software
+ Hardware
???
## Problems
+ Berkeley experiment to measure temperature across campus
+ Turned out average on campus much warmer than external weather services
predicted
+ But sample data looked in line with predictions
+ Problem: one monitoring station right next to air conditioning unit!
+ Hardware failure rare, but with large numbers of machines, probable. e.g.,
RAM can suffer ~1 bit/hour/gigabyte (ECC can help)
---
## Inconsistent Different Sources
+ Great value in combining data sources
+ Challenge is merging them together, removing duplicates
+ Example: Business names
???
## Business names
+ Starbucks vs. Starbucks Coffee Shop
+ Buck's vs Bucks
+ Trying to use address? Stackbucks vs. Starbucks across the street
+ Best strategy here is to use DM/ML techniques on the *combination* of
features to determine likelihood of match. We'll discuss specific
algorithms later in the course
---
## Preprocessing
### Cleaning
fill missing values, smooth noisy data, identify or remove
outliers, resolve inconsistencies
### Integration
merging data from multiple sources
### Reduction
obtain a smaller data set that can sufficiently answer
important questions
### Transformation
change data to a form that is easier to mine or analyze
???
## Flu Trend Problems (Questions)
+ We have millisecond search resolution, but will only be plotting on a per day basis
+ We have the exact text of each query, but just care if it is about the flu or not
+ Flu Trends, we sometimes see out of control search bots doing 100,000s of searches per day
+ Mobile phone searches and web searches hit different machines, software, logs
+ We have IPs in the logs, but will be plotting against geographical areas
---
## Missing Values
.left-column[
| Person | Height |
|--------|--------|
| Bob | 6'0 |
| Ashley | - |
| Sam | 5'11 |
| Alice | 5'9 |
| Kate | - |
]
.right-column[
<img src="img/tallest-shortest-man.jpg" width=100% />
]
???
## What to do?
+ (Heights are made up)
+ We want to get an average class height
+ Q: What to do with missing rows?
+ ignore, fill, constant, average, average wrt gender
---
## Fill Missing Values
???
## Details
+ Trade-offs
+ core to engineering
---
## Fill Missing Values
+ Ignore the record
???
## Details
+ Ignore
+ simply drop from data set. Hope there are not too many to affect
answer. Drawbacks? When missing values are all same class (skew data)
---
## Fill Missing Values
+ Ignore the record
+ Find value manually
???
## Details
+ Find value manually
+ Even for a small class, might be difficult. Get
ruler, measure them. For historical data, impossible.
---
## Fill Missing Values
+ Ignore the record
+ Find value manually
+ Global constant
???
## Details
+ Global constant
+ replace with "N/A" or "6 foot". Can skew data, or cause
data to pop in other analysis (all grouped together)
---
## Fill Missing Values
+ Ignore the record
+ Find value manually
+ Global constant
+ Average
???
## Details
+ Average
+ Mean or median. Either one has potential problems.
---
## Fill Missing Values
+ Ignore the record
+ Find value manually
+ Global constant
+ Average
+ Average with respect to class
???
## Details
+ Average with respect to class
+ gender. Average female/male height to fill
in values
---
## Fill Missing Values
+ Ignore the record
+ Find value manually
+ Global constant
+ Average
+ Average with respect to class
+ "Most probable"
???
## Details
+ "Most probable"
+ Think of as another step from avg -> class avg. Now
throw in other details: age, family history, shoe size. Then weight
depending on how much those factors are correlated. Pretty soon you have a
regression or Bayesian model, which will cover later
---
## Normalization
+ Type of data transformation to make reasoning and comparison easier
+ Is 6' tall?
+ Coefficients on attributes in regressions understandable
???
## Context, Comparison
+ 6' Might be tall for this class, but not on a basketball team
+ How to know when a data point is "average" or towards the top of a range?
+ For our housing model, we wanted to use sq. footage and # of bedrooms. But
the sq. footage number is huge compared to bedrooms. If we didn't
normalize, a formula for determine house price might seem to indicate that #
of bedrooms was way more important
---
## Min-max
.white-background[
<img src="img/min-max.gif">
]
???
## New Range
+ Typically new range is
+ [0-1] (thought of as %)
+ [-1-1] (though of as bad->good
---
## Z-score
.white-background[
<img src="img/z-score.gif" width=100% />
]
???
## Uses
+ When you want a relative measure of deviation
+ When you have a distribution estimate, but are unsure of absolute min-max
---
## Comparison
<img src="img/outliers.png" width=100% />
<img src="img/outliers-minmax-zscore.png" width=100% />
???
## Min-max vs Z-score
+ Min-max: Known range
+ Z-score: more expressive range
+ Min-max: requires knowing min-max
+ Z-score: can estimate with sampling or informed guess
---
## Removing Noise
### Binning
create B bins << N data samples, use aggregate statistic of bin
for value
### Regression
fit data to a function, use function value
### Outlier analysis
find outlying points, understand and/or ignore them
???
## Monitoring Problem
+ For the problem encountered in temperature monitoring, which makes the most
sense?
---
## Trade-offs
### Binning
Simple way to remove outliers, but difficult to pick buckets
correctly
### Regression
If one metric is a direct function of another, what extra
information does the value provide?
### Outlier analysis
Manual process of understanding outliers, ignoring them
can obscure some analysis (e.g., income disparity)
???
## Trade-offs again
+ Remember: this class is exposing you to potential tools, it's up to you
to be asking the right questions, selecting the appropriate algorithms,
interpreting results
---
## Data integration
+ Merging two data sources
+ Problem: uniquely identify a concept in both sources
+ Find data points that are very "close" to each other, call them the same
with some probability
+ Example: [Yelp Menu Data](http://www.yelp.com/menu/tartine-bakery-san-francisco)
???
## Yelp Menu Data
+ Launched menu data in 2012
+ Takes data about the restaurant menu, find reviews & pictures referring to
the menu item
+ Joins them together
+ Many different metrics for "close": remember them?
---
## Other measures of "close"
Are ```A``` and ```B``` close?
| A | B |
|----|-----|
| 2 | 60 |
| 5 | 150 |
| 6 | 180 |
| 10 | 300 |
| 13 | 390 |
???
## Correlation
+ Imagine ```A``` and ```B``` have several different dimensions, maybe things like
length, height, width, radius
+ Are they similar?
+ On one hand no: clearly different order of magnitude
+ Another way to think about similarity is correlation
+ All of ```B``` dimensions are 30x of ```A```
+ Maybe just using different units!
+ If I plotted ```A``` and ```B``` as x,y, what would the result look like?
---
## χ<sup>2</sup> Correlation Test
<img src="img/correlation.png" width=100%/>
<img src="img/chiequation.jpg" width=100%/>
???
## Motivation
+ Answer: a straight line
+ So a correlation coefficient gives a sense of how closely *linearly*
related two data sets are
+ Note, besides positive & negative, the slope does not affect the correlation
score, just how well fit the data is
+ Also note I said linear: patterns may still be exhibited, but they are not
linearly related, eg 30x
+ Details of test are in book, you are expected to understand it
+ Motivation: how different are the observed values from the expected?
+ Expected is calculated using probability with the assumptions that the sets
are *independent*
---
## Covariance & Correlation
+ Correlation is "normalized" covariance
+ Covariance describes the degree to which two data sets track each other in
units of the two data sets
+ Correlations describes the degree of similarity without units
???
## Use in industry
+ χ<sup>2</sup> used most commonly, handy to have an expected [0-1] range
+ "Correlation does not imply causation"
+ A->B, B->A, C->A,B, A->B->A..., coincidence
---
## Data Reduction
### Dimensionality
remove attributes that are the same or similar to other attributes
### Numerosity
represent or aggregate the data, sometimes with precision loss
### Compression
generalized techniques to decrease the number of bytes needed
to store data
???
## Deep Dive
+ We're only going to cover selected topics in these areas.
+ When reading, make sure to understand the intuition behind the other
techniques, but if we don't cover it in lecture, you won't need to
calculate it in midterm
+ Ask questions about the concepts you don't understand! That's what
separates this class from a book :)
+ But still potentially useful for your projects!
---
## Subset Selection
+ Two many attributes?
+ *Ignore some*
+ Tricky part: which to ignore?
+ height x width = area
???
## Simple to Sophisticated
+ Ignore the ones that are not helpful
+ Ignore an attribute highly correlated with another (cm, in)
+ Ignore an attribute that can be built from others
---
## Principal Component Analysis
+ Map data to a location along a few vectors
<img src="img/GaussianScatterPCA.png"/>
???
## Higher dimensions
+ Remember, 2 dimensions might not make much sense, but becomes useful in
higher number of dimensions
+ These points described by two attributes, <x,y>
+ What if we wanted to describe them in just 1 dimension?
+ Pick some good vectors (in our case 1)
+ Describe where a point is located using only those vectors
---
## Netflix and PCA
+ A user may have many preferences: Mission Impossible, Love Actually, Man
from Nowhere, ...
+ Instead of keeping track of every preference, we can summarize
+ Action, RomCom, Foreign
???
## Summarize in discovered dimensions
+ With 3 or more "categories", we can reconstruct the user's likely
preferences
+ Dimensions don't necessarily fit into human notions: probably is not a
"foreign" dimension, but a subtle combination of other aspects
</textarea>
<script src="production/remark-0.5.9.min.js" type="text/javascript">
</script>
<script type="text/javascript">
var slideshow = remark.create();
</script>
</body>
</html>