Linear Regression을 사용하여 고객별 연간 지출액을 예측
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("data/ecommerce.csv")
data.head()
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | ||
---|---|---|---|---|---|---|---|---|
0 | mstephenson@fernandez.com | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 | 587.951054 |
1 | hduke@hotmail.com | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 | 392.204933 |
2 | pallen@yahoo.com | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 | 487.547505 |
3 | riverarebecca@gmail.com | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 | 581.852344 |
4 | mstephens@davidson-herman.com | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 | 599.406092 |
data.head(10)
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | ||
---|---|---|---|---|---|---|---|---|
0 | mstephenson@fernandez.com | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 | 587.951054 |
1 | hduke@hotmail.com | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 | 392.204933 |
2 | pallen@yahoo.com | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 | 487.547505 |
3 | riverarebecca@gmail.com | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 | 581.852344 |
4 | mstephens@davidson-herman.com | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 | 599.406092 |
5 | alvareznancy@lucas.biz | 645 Martha Park Apt. 611\nJeffreychester, MN 6... | FloralWhite | 33.871038 | 12.026925 | 34.476878 | 5.493507 | 637.102448 |
6 | katherine20@yahoo.com | 68388 Reyes Lights Suite 692\nJosephbury, WV 9... | DarkSlateBlue | 32.021596 | 11.366348 | 36.683776 | 4.685017 | 521.572175 |
7 | awatkins@yahoo.com | Unit 6538 Box 8980\nDPO AP 09026-4941 | Aqua | 32.739143 | 12.351959 | 37.373359 | 4.434273 | 549.904146 |
8 | vchurch@walter-martinez.com | 860 Lee Key\nWest Debra, SD 97450-0495 | Salmon | 33.987773 | 13.386235 | 37.534497 | 3.273434 | 570.200409 |
9 | bonnie69@lin.biz | PSC 2734, Box 5255\nAPO AA 98456-7482 | Brown | 31.936549 | 11.814128 | 37.145168 | 3.202806 | 427.199385 |
data.tail()
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | ||
---|---|---|---|---|---|---|---|---|
495 | lewisjessica@craig-evans.com | 4483 Jones Motorway Suite 872\nLake Jamiefurt,... | Tan | 33.237660 | 13.566160 | 36.417985 | 3.746573 | 573.847438 |
496 | katrina56@gmail.com | 172 Owen Divide Suite 497\nWest Richard, CA 19320 | PaleVioletRed | 34.702529 | 11.695736 | 37.190268 | 3.576526 | 529.049004 |
497 | dale88@hotmail.com | 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... | Cornsilk | 32.646777 | 11.499409 | 38.332576 | 4.958264 | 551.620145 |
498 | cwilson@hotmail.com | 680 Jennifer Lodge Apt. 808\nBrendachester, TX... | Teal | 33.322501 | 12.391423 | 36.840086 | 2.336485 | 456.469510 |
499 | hannahwilson@davidson.com | 49791 Rachel Heights Apt. 898\nEast Drewboroug... | DarkMagenta | 33.715981 | 12.418808 | 35.771016 | 2.735160 | 497.778642 |
data.info()
RangeIndex: 500 entries, 0 to 499 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Email 500 non-null object 1 Address 500 non-null object 2 Avatar 500 non-null object 3 Avg. Session Length 500 non-null float64 4 Time on App 500 non-null float64 5 Time on Website 500 non-null float64 6 Length of Membership 500 non-null float64 7 Yearly Amount Spent 500 non-null float64 dtypes: float64(5), object(3) memory usage: 31.4+ KB
non-null이 총 데이터 수와 같아 결측치가 없음을 의미한다.
data.describe()
Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | |
---|---|---|---|---|---|
count | 500.000000 | 500.000000 | 500.000000 | 500.000000 | 500.000000 |
mean | 33.053194 | 12.052488 | 37.060445 | 3.533462 | 499.314038 |
std | 0.992563 | 0.994216 | 1.010489 | 0.999278 | 79.314782 |
min | 29.532429 | 8.508152 | 33.913847 | 0.269901 | 256.670582 |
25% | 32.341822 | 11.388153 | 36.349257 | 2.930450 | 445.038277 |
50% | 33.082008 | 11.983231 | 37.069367 | 3.533975 | 498.887875 |
75% | 33.711985 | 12.753850 | 37.716432 | 4.126502 | 549.313828 |
max | 36.139662 | 15.126994 | 40.005182 | 6.922689 | 765.518462 |
데이터들의 스케일을 알 수 있다.
스케일 : min ~ max
Outlier도 검출할 수 있다.
75%와 max 사이에 엄청난 차이가 있다면 Outlier가 있을 수 있다.
data['Length of Membership']
0 4.082621 1 2.664034 2 4.104543 3 3.120179 4 4.446308 ... 495 3.746573 496 3.576526 497 4.958264 498 2.336485 499 2.735160 Name: Length of Membership, Length: 500, dtype: float64
두 개 이상의 컬럼을 불러올 때는 대괄호를 하나 더
data[['Length of Membership', 'Yearly Amount Spent']]
Length of Membership | Yearly Amount Spent | |
---|---|---|
0 | 4.082621 | 587.951054 |
1 | 2.664034 | 392.204933 |
2 | 4.104543 | 487.547505 |
3 | 3.120179 | 581.852344 |
4 | 4.446308 | 599.406092 |
... | ... | ... |
495 | 3.746573 | 573.847438 |
496 | 3.576526 | 529.049004 |
497 | 4.958264 | 551.620145 |
498 | 2.336485 | 456.469510 |
499 | 2.735160 | 497.778642 |
500 rows × 2 columns
data = data[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership', 'Yearly Amount Spent']]
data.head()
Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | |
---|---|---|---|---|---|
0 | 34.497268 | 12.655651 | 39.577668 | 4.082621 | 587.951054 |
1 | 31.926272 | 11.109461 | 37.268959 | 2.664034 | 392.204933 |
2 | 33.000915 | 11.330278 | 37.110597 | 4.104543 | 487.547505 |
3 | 34.305557 | 13.717514 | 36.721283 | 3.120179 | 581.852344 |
4 | 33.330673 | 12.795189 | 37.536653 | 4.446308 | 599.406092 |
정의 : 데이터를 테스트와 학습 용으로 나누는 것
idea : 학습할 때 사용한 데이터 이외의 데이터가 들어왔을 때 잘 처리할 수 있는가를 확인할 수 있어야 함
from sklearn.model_selection import train_test_split
X는 독립변수, y는 종속변수
X_train, X_test, y_train, y_test = train_test_split(X, y)
X와 y 정의
X = data[['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']]
y = data['Yearly Amount Spent']
데이터 분리
분리되는 비율을 지정할 수 있고, 디폴트는 test에 1/3
X_train, X_test, y_train, y_test = train_test_split(X, y)
비율을 지정할 때 train data의 양이 충분한지 확인해야 한다.
test_size가 비율, random_state는 랜덤 시드 값 정도로 생각하면 된다.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)
X_train
Avg. Session Length | Time on App | Time on Website | Length of Membership | |
---|---|---|---|---|
205 | 34.967610 | 13.919494 | 37.952013 | 5.066697 |
404 | 32.278443 | 12.527472 | 36.688367 | 3.531402 |
337 | 31.827979 | 12.461147 | 37.428997 | 2.974737 |
440 | 33.200616 | 11.965980 | 36.831536 | 3.549036 |
55 | 33.925297 | 11.588655 | 35.252242 | 3.392050 |
... | ... | ... | ... | ... |
343 | 32.302748 | 12.815393 | 37.957810 | 4.615426 |
359 | 32.054262 | 13.149670 | 37.650400 | 4.195614 |
323 | 32.762456 | 10.952353 | 37.646292 | 4.019470 |
280 | 32.271848 | 13.485009 | 37.550880 | 3.086337 |
8 | 33.987773 | 13.386235 | 37.534497 | 3.273434 |
400 rows × 4 columns
import statsmodels.api as sm
model = sm.OLS(y_train, X_train)
model = model.fit()
model.summary()
Dep. Variable: | Yearly Amount Spent | R-squared (uncentered): | 0.998 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.998 |
Method: | Least Squares | F-statistic: | 4.798e+04 |
Date: | Wed, 23 Mar 2022 | Prob (F-statistic): | 0.00 |
Time: | 18:18:14 | Log-Likelihood: | -1820.0 |
No. Observations: | 400 | AIC: | 3648. |
Df Residuals: | 396 | BIC: | 3664. |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Avg. Session Length | 11.9059 | 0.869 | 13.696 | 0.000 | 10.197 | 13.615 |
Time on App | 34.3257 | 1.121 | 30.610 | 0.000 | 32.121 | 36.530 |
Time on Website | -14.1405 | 0.812 | -17.405 | 0.000 | -15.738 | -12.543 |
Length of Membership | 61.0149 | 1.144 | 53.318 | 0.000 | 58.765 | 63.265 |
Omnibus: | 0.490 | Durbin-Watson: | 1.987 |
---|---|---|---|
Prob(Omnibus): | 0.783 | Jarque-Bera (JB): | 0.606 |
Skew: | -0.022 | Prob(JB): | 0.739 |
Kurtosis: | 2.814 | Cond. No. | 55.4 |
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
R-squared
-
클수록 좋은 모델
-
Adj.가 더욱 적절한 평가 기준
-
그냥 R-squared는 독립변수의 개수가 많아지면 증가할 수 밖에 없음
-
Adj.는 독립변수의 개수를 고려 가중치를 통해 수치를 표현
Coef
-
변수의 영향력 (강도와 방향)
-
기울기(1이 증가할 때 증가하는 수치)
-
스케일이 다른 경우 공정한 비교가 되지 않을 수도 있음 (e.g. 연봉의 1원과 시간의 1년)
P-value
-
신뢰할 수 있는 결과인가 평가한 척도
-
0~1 범위
-
0.05 이하면 양호
-
0.05 이상이면 데이터를 신뢰할 수 없다고 판단
SST : 평균값과 실제값의 오차
SSE : 실제값과 예측값의 오차
SSR : 예측값과 실제값의 오차
R-Squared = SSR/SST
pred = model.predict(X_test)
plt.figure(figsize=(10, 10))
sns.scatterplot(x=y_test, y=pred)
MSE(Mean Squared Error) : 예측값과 실제 테스트 값과의 오차의 제곱의 평균
문제
- 그냥 오차를 더해서 평균을 내는 경우 방향성 때문에 제대로 된 오차를 계산하기 어렵다.
해결 방안
-
절댓값
-
제곱
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, pred)
482.28901390889246
위의 수치만 보고 수치의 좋고 나쁨을 판단할 수 없다.
다른 모델의 수치와 비교할 때 의미를 갖는다.
MSE는 오차가 제곱만큼 커져서 계산이 되기 때문에 결과 값이 원래 y 값의 스케일보다 너무 큰 느낌을 준다.
이를 해결하기 위해 루트를 씌워주는 것
np.sqrt(mean_squared_error(y_test, pred))
21.961079525125637
MSE와 마찬가지로 이 값 하나만으로는 좋고 나쁨을 판단할 수 없다.
Numpy array는 연산 속도가 빠름
Numpy와 Pandas는 상호 호환
a = [1, 2, 3]
b = [4, 5, 6]
np.array(a)
array([1, 2, 3])
np.array([a, b])
array([[1, 2, 3], [4, 5, 6]])
pd.DataFrame([a, b])
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
컬럼명과 행 이름도 바꿀 수 있음
pd.DataFrame([a, b], columns=['a','b','c'], index=['x','y'])
a | b | c | |
---|---|---|---|
x | 1 | 2 | 3 |
y | 4 | 5 | 6 |
Series는 1차원
pd.Series(a)
0 1 1 2 2 3 dtype: int64
data = pd.read_csv('data/eCommerce.csv')
DataFrame에서 한 컬럼만 불러오면 Series type임
type(data)
pandas.core.frame.DataFrame
type(data['Yearly Amount Spent'])
pandas.core.series.Series
Series를 DataFrame으로 형변환이 가능하다.
pd.DataFrame(data['Yearly Amount Spent'])
Yearly Amount Spent | |
---|---|
0 | 587.951054 |
1 | 392.204933 |
2 | 487.547505 |
3 | 581.852344 |
4 | 599.406092 |
... | ... |
495 | 573.847438 |
496 | 529.049004 |
497 | 551.620145 |
498 | 456.469510 |
499 | 497.778642 |
500 rows × 1 columns
Pandas와 Numpy간의 호환
pd.DataFrame(np.array([a, b]))
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
np.array(data)
array([['mstephenson@fernandez.com', '835 Frank Tunnel\nWrightmouth, MI 82180-9605', 'Violet', ..., 39.57766801952616, 4.0826206329529615, 587.9510539684005], ['hduke@hotmail.com', '4547 Archer Common\nDiazchester, CA 06566-8576', 'DarkGreen', ..., 37.268958868297744, 2.66403418213262, 392.2049334443264], ['pallen@yahoo.com', '24645 Valerie Unions Suite 582\nCobbborough, DC 99414-7564', 'Bisque', ..., 37.110597442120856, 4.104543202376424, 487.54750486747207], ..., ['dale88@hotmail.com', '0787 Andrews Ranch Apt. 633\nSouth Chadburgh, TN 56128', 'Cornsilk', ..., 38.33257633196044, 4.958264472618699, 551.6201454762477], ['cwilson@hotmail.com', '680 Jennifer Lodge Apt. 808\nBrendachester, TX 05000-5873', 'Teal', ..., 36.84008572976701, 2.336484668112853, 456.469510066298], ['hannahwilson@davidson.com', '49791 Rachel Heights Apt. 898\nEast Drewborough, OR 55919-9528', 'DarkMagenta', ..., 35.771016191612965, 2.7351595670822757, 497.7786422156802]], dtype=object)
data['Yearly Amount Spent']
0 587.951054 1 392.204933 2 487.547505 3 581.852344 4 599.406092 ... 495 573.847438 496 529.049004 497 551.620145 498 456.469510 499 497.778642 Name: Yearly Amount Spent, Length: 500, dtype: float64
두 컬럼 이상을 불러올 때 []를 한번 더 사용하는 이유는 리스트라는 하나의 객체로 만들어 전달하기 위함
data[['Time on App', 'Time on Website']]
Time on App | Time on Website | |
---|---|---|
0 | 12.655651 | 39.577668 |
1 | 11.109461 | 37.268959 |
2 | 11.330278 | 37.110597 |
3 | 13.717514 | 36.721283 |
4 | 12.795189 | 37.536653 |
... | ... | ... |
495 | 13.566160 | 36.417985 |
496 | 11.695736 | 37.190268 |
497 | 11.499409 | 38.332576 |
498 | 12.391423 | 36.840086 |
499 | 12.418808 | 35.771016 |
500 rows × 2 columns
drop은 원래 행에서 이름을 찾아 제거한다.
컬럼을 제거하기 위해서는 axis=1을 옵션으로 주면 된다.
data.drop('Yearly Amount Spent', axis=1)
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | ||
---|---|---|---|---|---|---|---|
0 | mstephenson@fernandez.com | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 |
1 | hduke@hotmail.com | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 |
2 | pallen@yahoo.com | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 |
3 | riverarebecca@gmail.com | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 |
4 | mstephens@davidson-herman.com | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 |
... | ... | ... | ... | ... | ... | ... | ... |
495 | lewisjessica@craig-evans.com | 4483 Jones Motorway Suite 872\nLake Jamiefurt,... | Tan | 33.237660 | 13.566160 | 36.417985 | 3.746573 |
496 | katrina56@gmail.com | 172 Owen Divide Suite 497\nWest Richard, CA 19320 | PaleVioletRed | 34.702529 | 11.695736 | 37.190268 | 3.576526 |
497 | dale88@hotmail.com | 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... | Cornsilk | 32.646777 | 11.499409 | 38.332576 | 4.958264 |
498 | cwilson@hotmail.com | 680 Jennifer Lodge Apt. 808\nBrendachester, TX... | Teal | 33.322501 | 12.391423 | 36.840086 | 2.336485 |
499 | hannahwilson@davidson.com | 49791 Rachel Heights Apt. 898\nEast Drewboroug... | DarkMagenta | 33.715981 | 12.418808 | 35.771016 | 2.735160 |
500 rows × 7 columns
drop도 마찬가지로 두 개 이상의 컬럼을 제거할 때는 []를 써준다.
data.drop(['Email', 'Yearly Amount Spent'], axis=1)
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | |
---|---|---|---|---|---|---|
0 | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 |
1 | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 |
2 | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 |
3 | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 |
4 | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 |
... | ... | ... | ... | ... | ... | ... |
495 | 4483 Jones Motorway Suite 872\nLake Jamiefurt,... | Tan | 33.237660 | 13.566160 | 36.417985 | 3.746573 |
496 | 172 Owen Divide Suite 497\nWest Richard, CA 19320 | PaleVioletRed | 34.702529 | 11.695736 | 37.190268 | 3.576526 |
497 | 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... | Cornsilk | 32.646777 | 11.499409 | 38.332576 | 4.958264 |
498 | 680 Jennifer Lodge Apt. 808\nBrendachester, TX... | Teal | 33.322501 | 12.391423 | 36.840086 | 2.336485 |
499 | 49791 Rachel Heights Apt. 898\nEast Drewboroug... | DarkMagenta | 33.715981 | 12.418808 | 35.771016 | 2.735160 |
500 rows × 6 columns
drop은 원본 데이터를 수정하지 않는다.
data
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | ||
---|---|---|---|---|---|---|---|---|
0 | mstephenson@fernandez.com | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 | 587.951054 |
1 | hduke@hotmail.com | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 | 392.204933 |
2 | pallen@yahoo.com | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 | 487.547505 |
3 | riverarebecca@gmail.com | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 | 581.852344 |
4 | mstephens@davidson-herman.com | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 | 599.406092 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | lewisjessica@craig-evans.com | 4483 Jones Motorway Suite 872\nLake Jamiefurt,... | Tan | 33.237660 | 13.566160 | 36.417985 | 3.746573 | 573.847438 |
496 | katrina56@gmail.com | 172 Owen Divide Suite 497\nWest Richard, CA 19320 | PaleVioletRed | 34.702529 | 11.695736 | 37.190268 | 3.576526 | 529.049004 |
497 | dale88@hotmail.com | 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... | Cornsilk | 32.646777 | 11.499409 | 38.332576 | 4.958264 | 551.620145 |
498 | cwilson@hotmail.com | 680 Jennifer Lodge Apt. 808\nBrendachester, TX... | Teal | 33.322501 | 12.391423 | 36.840086 | 2.336485 | 456.469510 |
499 | hannahwilson@davidson.com | 49791 Rachel Heights Apt. 898\nEast Drewboroug... | DarkMagenta | 33.715981 | 12.418808 | 35.771016 | 2.735160 | 497.778642 |
500 rows × 8 columns
drop으로 원본 수정
-
data에 다시 대입
-
inplace=True 옵션 추가
data.drop(['Email', 'Yearly Amount Spent'], axis=1, inplace=True)
data
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | |
---|---|---|---|---|---|---|
0 | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 |
1 | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 |
2 | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 |
3 | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 |
4 | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 |
... | ... | ... | ... | ... | ... | ... |
495 | 4483 Jones Motorway Suite 872\nLake Jamiefurt,... | Tan | 33.237660 | 13.566160 | 36.417985 | 3.746573 |
496 | 172 Owen Divide Suite 497\nWest Richard, CA 19320 | PaleVioletRed | 34.702529 | 11.695736 | 37.190268 | 3.576526 |
497 | 0787 Andrews Ranch Apt. 633\nSouth Chadburgh, ... | Cornsilk | 32.646777 | 11.499409 | 38.332576 | 4.958264 |
498 | 680 Jennifer Lodge Apt. 808\nBrendachester, TX... | Teal | 33.322501 | 12.391423 | 36.840086 | 2.336485 |
499 | 49791 Rachel Heights Apt. 898\nEast Drewboroug... | DarkMagenta | 33.715981 | 12.418808 | 35.771016 | 2.735160 |
500 rows × 6 columns
df = data.head(10)
df.index = ['a', 'b','c','d','e','f','g','h','i','j']
df
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | |
---|---|---|---|---|---|---|
a | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 |
b | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 |
c | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 |
d | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 |
e | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 |
f | 645 Martha Park Apt. 611\nJeffreychester, MN 6... | FloralWhite | 33.871038 | 12.026925 | 34.476878 | 5.493507 |
g | 68388 Reyes Lights Suite 692\nJosephbury, WV 9... | DarkSlateBlue | 32.021596 | 11.366348 | 36.683776 | 4.685017 |
h | Unit 6538 Box 8980\nDPO AP 09026-4941 | Aqua | 32.739143 | 12.351959 | 37.373359 | 4.434273 |
i | 860 Lee Key\nWest Debra, SD 97450-0495 | Salmon | 33.987773 | 13.386235 | 37.534497 | 3.273434 |
j | PSC 2734, Box 5255\nAPO AA 98456-7482 | Brown | 31.936549 | 11.814128 | 37.145168 | 3.202806 |
행을 추출할 때는 loc를 사용
df.loc['d']
Address 1414 David Throughway\nPort Jason, OH 22070-1220 Avatar SaddleBrown Avg. Session Length 34.305557 Time on App 13.717514 Time on Website 36.721283 Length of Membership 3.120179 Name: d, dtype: object
여러 행을 추출할 때는 인덱싱 사용
df.loc['d':'h']
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | |
---|---|---|---|---|---|---|
d | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 |
e | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 |
f | 645 Martha Park Apt. 611\nJeffreychester, MN 6... | FloralWhite | 33.871038 | 12.026925 | 34.476878 | 5.493507 |
g | 68388 Reyes Lights Suite 692\nJosephbury, WV 9... | DarkSlateBlue | 32.021596 | 11.366348 | 36.683776 | 4.685017 |
h | Unit 6538 Box 8980\nDPO AP 09026-4941 | Aqua | 32.739143 | 12.351959 | 37.373359 | 4.434273 |
인덱스 번호로 추출할 때는 iloc 사용
마지막 번호는 포함 안됨
df.iloc[3:7]
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | |
---|---|---|---|---|---|---|
d | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 |
e | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 |
f | 645 Martha Park Apt. 611\nJeffreychester, MN 6... | FloralWhite | 33.871038 | 12.026925 | 34.476878 | 5.493507 |
g | 68388 Reyes Lights Suite 692\nJosephbury, WV 9... | DarkSlateBlue | 32.021596 | 11.366348 | 36.683776 | 4.685017 |
df.iloc[3:]
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | |
---|---|---|---|---|---|---|
d | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 | 13.717514 | 36.721283 | 3.120179 |
e | 14023 Rodriguez Passage\nPort Jacobville, PR 3... | MediumAquaMarine | 33.330673 | 12.795189 | 37.536653 | 4.446308 |
f | 645 Martha Park Apt. 611\nJeffreychester, MN 6... | FloralWhite | 33.871038 | 12.026925 | 34.476878 | 5.493507 |
g | 68388 Reyes Lights Suite 692\nJosephbury, WV 9... | DarkSlateBlue | 32.021596 | 11.366348 | 36.683776 | 4.685017 |
h | Unit 6538 Box 8980\nDPO AP 09026-4941 | Aqua | 32.739143 | 12.351959 | 37.373359 | 4.434273 |
i | 860 Lee Key\nWest Debra, SD 97450-0495 | Salmon | 33.987773 | 13.386235 | 37.534497 | 3.273434 |
j | PSC 2734, Box 5255\nAPO AA 98456-7482 | Brown | 31.936549 | 11.814128 | 37.145168 | 3.202806 |
df.iloc[:3]
Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | |
---|---|---|---|---|---|---|
a | 835 Frank Tunnel\nWrightmouth, MI 82180-9605 | Violet | 34.497268 | 12.655651 | 39.577668 | 4.082621 |
b | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 | 11.109461 | 37.268959 | 2.664034 |
c | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 | 11.330278 | 37.110597 | 4.104543 |
iloc으로 컬럼도 인덱싱할 수 있다.
df.iloc[1:4, 0:3]
Address | Avatar | Avg. Session Length | |
---|---|---|---|
b | 4547 Archer Common\nDiazchester, CA 06566-8576 | DarkGreen | 31.926272 |
c | 24645 Valerie Unions Suite 582\nCobbborough, D... | Bisque | 33.000915 |
d | 1414 David Throughway\nPort Jason, OH 22070-1220 | SaddleBrown | 34.305557 |
데이터를 표현하기 위한 선들이 여러 개 그려지는데 이 중에서 가장 잘 설명할 수 있는 선을 찾는 것이 Linear Regression이다.
이 선을 찾는 방법은 예측값과 실제값의 오차의 평균이 가장 작은 선을 찾는 것이다.
Gradient Descent(경사하강법)을 사용해서 선을 찾는데 도움을 준다.
model.summary()
Dep. Variable: | Yearly Amount Spent | R-squared (uncentered): | 0.998 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.998 |
Method: | Least Squares | F-statistic: | 4.798e+04 |
Date: | Wed, 23 Mar 2022 | Prob (F-statistic): | 0.00 |
Time: | 20:11:49 | Log-Likelihood: | -1820.0 |
No. Observations: | 400 | AIC: | 3648. |
Df Residuals: | 396 | BIC: | 3664. |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Avg. Session Length | 11.9059 | 0.869 | 13.696 | 0.000 | 10.197 | 13.615 |
Time on App | 34.3257 | 1.121 | 30.610 | 0.000 | 32.121 | 36.530 |
Time on Website | -14.1405 | 0.812 | -17.405 | 0.000 | -15.738 | -12.543 |
Length of Membership | 61.0149 | 1.144 | 53.318 | 0.000 | 58.765 | 63.265 |
Omnibus: | 0.490 | Durbin-Watson: | 1.987 |
---|---|---|---|
Prob(Omnibus): | 0.783 | Jarque-Bera (JB): | 0.606 |
Skew: | -0.022 | Prob(JB): | 0.739 |
Kurtosis: | 2.814 | Cond. No. | 55.4 |
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
y = sum(coef*column)
y = 11.9059Avg. Session Length + 34.3257Time on App + ...
Logistic Regression을 사용하여 고객별 광고 반응율을 예측
Logistic Regression은 이진 분류
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('data/advertising.csv')
data.head()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | NaN | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
컬럼 뜻
-
Daily Time on Site
- 사이트에 머문 시간
-
Age
- 나이
-
Area Income
- 개인의 수입을 특정할 수 없어 해당 지역의 평균 수입을 나타냄
-
Daily Internet Usage
- 하루 인터넷 사용 시간
-
Ad Topic Line
- 광고에 대한 설명
-
City
- 사용자 도시
-
Male
- 성별 (여자면 0, 남자면 1)
-
Country
- 사용자 국가
-
Timestamp
- 시간과 관련됨
-
Clicked on Ad
- 광고를 클릭하지 않았으면 0, 했으면 1
NaN은 비어있는 값이다.
data.tail()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
995 | 72.97 | 30.0 | 71384.57 | 208.58 | Fundamental modular algorithm | Duffystad | 1 | Lebanon | 2/11/2016 21:49 | 1 |
996 | 51.30 | 45.0 | 67782.17 | 134.42 | Grass-roots cohesive monitoring | New Darlene | 1 | Bosnia and Herzegovina | 4/22/2016 2:07 | 1 |
997 | 51.63 | 51.0 | 42415.72 | 120.37 | Expanded intangible solution | South Jessica | 1 | Mongolia | 2/1/2016 17:24 | 1 |
998 | 55.55 | 19.0 | 41920.79 | 187.95 | Proactive bandwidth-monitored policy | West Steven | 0 | Guatemala | 3/24/2016 2:35 | 0 |
999 | 45.01 | 26.0 | 29875.80 | 178.35 | Virtual 5thgeneration emulation | Ronniemouth | 0 | Brazil | 6/3/2016 21:43 | 1 |
data.info()
RangeIndex: 1000 entries, 0 to 999 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Daily Time Spent on Site 1000 non-null float64 1 Age 916 non-null float64 2 Area Income 1000 non-null float64 3 Daily Internet Usage 1000 non-null float64 4 Ad Topic Line 1000 non-null object 5 City 1000 non-null object 6 Male 1000 non-null int64 7 Country 1000 non-null object 8 Timestamp 1000 non-null object 9 Clicked on Ad 1000 non-null int64 dtypes: float64(4), int64(2), object(4) memory usage: 78.2+ KB
Clicked on Ad가 종속변수로 우리가 예측할 값
Age만 916개가 non-null로 빈 값이 있음을 의미
data.describe()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Male | Clicked on Ad | |
---|---|---|---|---|---|---|
count | 1000.000000 | 916.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.00000 |
mean | 65.000200 | 36.128821 | 55000.000080 | 180.000100 | 0.481000 | 0.50000 |
std | 15.853615 | 9.018548 | 13414.634022 | 43.902339 | 0.499889 | 0.50025 |
min | 32.600000 | 19.000000 | 13996.500000 | 104.780000 | 0.000000 | 0.00000 |
25% | 51.360000 | 29.000000 | 47031.802500 | 138.830000 | 0.000000 | 0.00000 |
50% | 68.215000 | 35.000000 | 57012.300000 | 183.130000 | 0.000000 | 0.50000 |
75% | 78.547500 | 42.000000 | 65470.635000 | 218.792500 | 1.000000 | 1.00000 |
max | 91.430000 | 61.000000 | 79484.800000 | 269.960000 | 1.000000 | 1.00000 |
Area Income이 min과 25%의 차이가 커서 왼쪽으로 좀 치우친 것을 알 수 있다.
Male의 mean값이 0.48인 것으로 48%정도가 남자라는 것을 알 수 있다.
sns.displot(data['Area Income'])
sns.displot(data['Age'])
고유값 갯수 확인
data['Country'].nunique()
237
data['City'].nunique()
969
data['Ad Topic Line'].nunique()
1000
Missing Value
-
na
-
NaN
-
Null
data.isna()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | True | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | False | False | False | False | False | False | False | False | False | False |
996 | False | False | False | False | False | False | False | False | False | False |
997 | False | False | False | False | False | False | False | False | False | False |
998 | False | False | False | False | False | False | False | False | False | False |
999 | False | False | False | False | False | False | False | False | False | False |
1000 rows × 10 columns
data.isna().sum()
Daily Time Spent on Site 0 Age 84 Area Income 0 Daily Internet Usage 0 Ad Topic Line 0 City 0 Male 0 Country 0 Timestamp 0 Clicked on Ad 0 dtype: int64
data.isna().sum() / len(data)
Daily Time Spent on Site 0.000 Age 0.084 Area Income 0.000 Daily Internet Usage 0.000 Ad Topic Line 0.000 City 0.000 Male 0.000 Country 0.000 Timestamp 0.000 Clicked on Ad 0.000 dtype: float64