-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathders5.qmd
337 lines (226 loc) · 13.3 KB
/
ders5.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
title: "Classification Methods"
subtitle: "FEF3001 Yapay zekaya giriş - Ders5"
author: "Alper Yılmaz"
date: 2024-07-19
format:
revealjs:
chalkboard: true
css: custom.css
smaller: true
scrollable: true
controls: true
touch: true
history: false
progress: true
slide-number: c/t
typora-copy-images-to: images
---
## Contents
:::: {.columns}
::: {.column width="60%"}
* Introduction to supervised learning
* Definition and applications of classification
* Preparing the data
* Feature selection and preprocessing ✅
* Methods
* Decision Trees
* Random Forest
* Support Vector Machines (SVM)
* Logistic Regression
* K-nearest neighbor
* Naive Bayes
* Artificial Neural Networks ✅
:::
::: {.column width="40%"}
* Ensemble methods
* Evaluation ✅
* Confusion Matrix ✅
* Accuracy, precision, recall, F1-score ✅
* ROC curves ✅
* Overfitting and underfitting ✅
* Cross-validation
:::
::::
##
![](attachments/11AW_2jPV1YoXlt7wUWf-lA.webp)
[Source](https://medium.com/@metehankozan/supervised-and-unsupervised-learning-an-intuitive-approach-cd8f8f64b644)
##
![](attachments/1PaGnLVKs6lCuL3r9p1c2LQ.webp)
[Source](https://medium.com/@metehankozan/supervised-and-unsupervised-learning-an-intuitive-approach-cd8f8f64b644)
## Definition of classification
* A type of supervised learning
* **Goal**: Categorize input data into predefined classes or categories
* The model learns to draw decision boundaries between classes
* Output is a discrete class label (unlike regression, which predicts continuous values)
## Applications
:::: {.columns}
::: {.column width="50%"}
1. Text Classification
* Spam detection in emails
* Sentiment analysis of product reviews
* News article categorization
2. Image Classification
* Medical imaging for disease detection
* Facial recognition systems
* Plant or animal species identification
3. Financial Applications
* Credit scoring (approve/deny loan applications)
* Fraud detection in transactions
:::
::: {.column width="50%"}
4. Healthcare
* Disease diagnosis based on symptoms and test results
* Predicting patient readmission risk
5. Environmental Science
* Climate pattern classification
* Species habitat prediction
6. Literature and Linguistics
* Authorship attribution
* Genre classification of texts
* Language identification
:::
::::
## Your Turn
In the zoom chat window please write down your department and an example of classification task related to your domain
## Your Turn
Pick one example and discuss about the data
Visit Kaggle and find related dataset
## Decision Trees
Decision Trees are a classification method that uses a tree-like model of decisions and their possible consequences. The algorithm learns a series of if-then-else decision rules that split the data based on feature values, creating a structure that resembles a flowchart. Each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or decision.
**branch**, **test**, **leaf**
## Example
<!--
flowchart TD
A["Hours<br>studied?"]
B["Previous<br>score?"]
C["Attended<br>review?"]
D[Fail]
E[Pass]
F[Pass]
G[Pass]
A ==>|"< 5"| B
A ==>|"≥ 5"| C
B ==>|"< 70"| D
B ==>|"≥ 70"| E
C ==>|No| F
C ==>|Yes| G
classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px;
classDef decision fill:#e1ecf7,stroke:#333,stroke-width:2px;
classDef outcome fill:#f0f0f0,stroke:#333,stroke-width:2px;
class A,B,C decision;
class D,E,F,G outcome;
style A color:#333,font-weight:bold
style B color:#333,font-weight:bold
style C color:#333,font-weight:bold
style D color:#d9534f,font-weight:bold
style E color:#5cb85c,font-weight:bold
style F color:#5cb85c,font-weight:bold
style G color:#5cb85c,font-weight:bold
-->
![](images/decision-tree-example.png){fig-align="center"}
. . .
| Hours studied | Previous Score | Attended Review | Pass? |
|:---------------:|:----------------:|:-----------------:|:------------:|
| 3 | 60 | No | ? |
| 4 | 75 | No | ? |
| 7 | 80 | Yes | ? |
## Key Metrics for Decision Tree Construction
*Questions*: Which feature is the first branch? At what value we create a branch (5 hours, 70 points, etc.)
. . .
* Entropy
* Information Gain
* Gini Impurity
## Entropy
* Entropy is a measure of impurity or uncertainty in a set of examples. In the context of decision trees, it quantifies the disorder in the class labels of a dataset.
Formula: $H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)$
Where $S$ is the dataset, $c$ is the number of classes, and $p_i$ is the proportion of examples belonging to class $i$.
* Ranges from 0 (completely pure, all examples belong to one class) to $\log_2(c)$ (completely impure, equal distribution across all classes).
* Used to calculate information gain.
## Information Gain
* Information gain measures the reduction in entropy achieved by splitting the data on a particular feature. It helps determine which feature to split on at each node of the decision tree.
Formula: $IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$
Where $S$ is the dataset, $A$ is the feature being considered for splitting, $Values(A)$ are the possible values of feature $A$, and $S_v$ is the subset of $S$ where feature $A$ has value $v$.
* Higher information gain indicates a more useful feature for classification.
* The feature with the highest information gain is typically chosen for splitting at each node.
## Gini Impurity
* Gini impurity is an alternative to entropy for measuring the impurity of a set of examples. It represents the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the distribution of labels in the subset.
Formula: $Gini(S) = 1 - \sum_{i=1}^{c} (p_i)^2$
Where $S$ is the dataset, $c$ is the number of classes, and $p_i$ is the proportion of examples belonging to class $i$.
* Ranges from 0 (completely pure) to $1 - \frac{1}{c}$ (completely impure).
* Often used in algorithms like CART (Classification and Regression Trees).
The choice between using entropy (with information gain) or Gini impurity often depends on the specific implementation of the decision tree algorithm. In practice, they often yield similar results.
## Algoritms for decision tree construction
* ID3
* CART
Please visit [this link](id3-decision-tree.qmd) for details about algoritms
## An example with R
https://www.dataspoof.info/post/decision-tree-classification-in-r/
https://forum.posit.co/t/decision-tree-in-r/5561/5
##
Advantages of Decision Trees:
1. **Interpretability**: Easy to understand and explain, even for non-experts. The decision-making process can be visually represented.
2. **No or little data preprocessing required**: Can handle both numerical and categorical data without the need for normalization or scaling.
3. **Computationally efficient**: Generally fast to train and make predictions, especially with small to medium-sized datasets.
Disadvantages:
1. **Overfitting**: Prone to overfitting, especially with deep trees, leading to poor generalization on new data.
2. **Instability**: Small changes in the data can result in a completely different tree being generated.
3. **Difficulty with high-dimensional data**: Can become computationally expensive and prone to overfitting with many features.
## Quiz time
## Random Forest
Random forest is a commonly-used machine learning algorithm, trademarked by Leo Breiman and Adele Cutler, that combines the output of multiple decision trees to reach a single result.
Random forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.
## Kaggle example
Please visit: https://www.kaggle.com/code/lara311/diabetes-prediction-using-machine-learning
## Support Vector Machines
**The Basic Idea**
Imagine you're trying to separate different types of objects, like apples and oranges, based on their characteristics, such as color, shape, and size. You want to find a way to draw a line (or a hyperplane in higher dimensions) that separates the two types of objects as accurately as possible.
## The SVM Method
A Support Vector Machine is a type of supervised learning algorithm that aims to find the best hyperplane that separates the data into different classes. Here's how it works:
1. **Data Preparation**: Collect a dataset of objects (e.g., apples and oranges) with their corresponding characteristics (features) and labels (e.g., "apple" or "orange").
2. **Plotting the Data**: Plot the data points in a feature space, where each axis represents a feature (e.g., color, shape, size).
3. **Finding the Hyperplane**: The goal is to find a hyperplane that separates the data points into different classes. A hyperplane is a line (in 2D) or a plane (in 3D) that divides the feature space into two regions.
4. **Maximizing the Margin**: The SVM algorithm tries to find the hyperplane that maximizes the margin between the two classes. The margin is the distance between the hyperplane and the nearest data points (called support vectors) on either side of the hyperplane.
5. **Support Vectors**: The support vectors are the data points that lie closest to the hyperplane and have the most influence on its position. They are the "support" that helps define the hyperplane.
## Support Vector Machines
**Key Concepts**
* **Hyperplane**: A line (in 2D) or a plane (in 3D) that separates the data into different classes.
* **Margin**: The distance between the hyperplane and the nearest data points (support vectors) on either side of the hyperplane.
* **Support Vectors**: The data points that lie closest to the hyperplane and have the most influence on its position.
## Why SVMs are Useful
SVMs are powerful because they:
* Can handle high-dimensional data
* Are robust to noise and outliers
* Can be used for both classification and regression tasks
* Provide a clear geometric interpretation of the decision boundary
##
![](images/Svm_separating_hyperplanes.png)
H1 does not separate the classes. H2 does, but only with a small margin. H3 separates them with the maximal margin. [Source](https://en.wikipedia.org/wiki/Support_vector_machine)
##
![](images/741px-SVM_margin.png)
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors. [Source](https://en.wikipedia.org/wiki/Support_vector_machine)
##
Please visit [SVM demo](https://greitemann.dev/svm-demo) site for an online interactive demo for SVM
## Logistic Regression
**The Basic Idea**
Logistic regression is a supervised machine learning algorithm that accomplishes binary classification tasks by predicting the probability of an outcome, event, or observation. The model delivers a binary outcome limited to two possible outcomes: yes/no, 0/1, or true/false.
## The Logistic Regression Method
Logistic Regression is a type of supervised learning algorithm that models the probability of an event occurring (e.g., passing an exam) based on a set of input variables (e.g., scores). Here's how it works:
1. **Data Preparation**: Collect a dataset of input variables (e.g., scores) and output variables (e.g., pass/fail).
2. **Logistic Function**: The logistic function, also known as the sigmoid function, is used to model the probability of the event occurring. It maps the input variables to a probability between 0 and 1.
3. **Log-Odds**: The logistic function is based on the log-odds of the event occurring, which is the logarithm of the ratio of the probability of the event occurring to the probability of the event not occurring.
4. **Coefficients**: The algorithm learns the coefficients (weights) for each input variable, which determine the importance of each variable in predicting the output.
5. **Decision Boundary**: The algorithm uses the coefficients and the logistic function to create a decision boundary, which separates the input space into two regions: one for each class (e.g., pass and fail).
6. **Prediction**: For a new input, the algorithm calculates the probability of the event occurring using the logistic function and the learned coefficients. If the probability is above a certain threshold (e.g., 0.5), the algorithm predicts the event will occur (e.g., the student will pass).
## Key Concepts
* **Logistic Function**: A mathematical function that maps input variables to a probability between 0 and 1.
* **Log-Odds**: The logarithm of the ratio of the probability of the event occurring to the probability of the event not occurring.
* **Coefficients**: The weights learned by the algorithm for each input variable, which determine their importance in predicting the output.
* **Decision Boundary**: The boundary that separates the input space into two regions, one for each class.
## Why Logistic Regression is Useful
Logistic Regression is a popular algorithm because it:
* Is easy to implement and interpret
* Can handle multiple input variables
* Provides a probability estimate for each prediction
* Is widely used in many fields, such as medicine, finance, and marketing
##