import numpy as np
import pandas as pd
df_csv = pd.read_csv(r"C:\Users\zhoukaiwei\Desktop\CSV.csv")
df_csv
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | clum1 | clum2 | clum3 | time | |
---|---|---|---|---|---|
0 | 0 | a | A | 1 | 2020.1.1 |
1 | 1 | b | B | 2 | 2020.1.1 |
2 | 2 | c | C | 3 | 2020.1.1 |
3 | 3 | d | D | 4 | 2020.1.1 |
4 | 4 | e | E | 5 | 2020.1.1 |
5 | 5 | f | F | 6 | 2020.1.1 |
df_excel = pd.read_excel(r"C:\Users\zhoukaiwei\Desktop\my_excel.xlsx")
df_excel
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | clum1 | clum2 | clum3 | time | |
---|---|---|---|---|---|
0 | 0 | a | A | 1 | 2020.1.1 |
1 | 1 | b | B | 2 | 2020.1.1 |
2 | 2 | c | C | 3 | 2020.1.1 |
3 | 3 | d | D | 4 | 2020.1.1 |
4 | 4 | e | E | 5 | 2020.1.1 |
5 | 5 | f | F | 6 | 2020.1.1 |
这里有一些常用的公共参数, header=None 表示第一行不作为列名, index_col 表示把某一列或 几列作为索引,索引的内容将会在第三章进行详述, usecols 表示读取列的集合,默认读取所有的 列, parse_dates 表示需要转化为时间的列,关于时间序列的有关内容将在第十章讲解, nrows 表示读取的数据行数。上面这些参数在上述的三个函数里都可以使用。
df_excel = pd.read_excel(r"C:\Users\zhoukaiwei\Desktop\my_excel.xlsx",header = None)
df_excel
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | NaN | clum1 | clum2 | clum3 | time |
1 | 0.0 | a | A | 1 | 2020.1.1 |
2 | 1.0 | b | B | 2 | 2020.1.1 |
3 | 2.0 | c | C | 3 | 2020.1.1 |
4 | 3.0 | d | D | 4 | 2020.1.1 |
5 | 4.0 | e | E | 5 | 2020.1.1 |
6 | 5.0 | f | F | 6 | 2020.1.1 |
df_excel = pd.read_excel(r"C:\Users\zhoukaiwei\Desktop\my_excel.xlsx",usecols=['clum3'])
df_excel
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
clum3 | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
5 | 6 |
df_excel = pd.read_excel(r"C:\Users\zhoukaiwei\Desktop\my_excel.xlsx",parse_dates=['time'])
df_excel
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | clum1 | clum2 | clum3 | time | |
---|---|---|---|---|---|
0 | 0 | a | A | 1 | 2020-01-01 |
1 | 1 | b | B | 2 | 2020-01-01 |
2 | 2 | c | C | 3 | 2020-01-01 |
3 | 3 | d | D | 4 | 2020-01-01 |
4 | 4 | e | E | 5 | 2020-01-01 |
5 | 5 | f | F | 6 | 2020-01-01 |
pandas 中具有两种基本的数据存储结构,存储一维 values 的 Series 和存储二维 values 的 DataFrame .
Series 一般由四个部分组成,分别是序列的值 data 、索引 index 、存储类型 dtype 、 序列的名字 name 。其中,索引也可以指定它的名字,默认为空。
A = pd.Series(data = [100,'A',{'index':5}],
index = pd.Index(['id1','id2','id3'],name = 'index'),
dtype = 'object',name = 'my_name')
A
index
id1 100
id2 A
id3 {'index': 5}
Name: my_name, dtype: object
获取属性
A.values
array([100, 'A', {'index': 5}], dtype=object)
A.index
Index(['id1', 'id2', 'id3'], dtype='object', name='index')
A.dtype
dtype('O')
A.shape
(3,)
A['id2']
'A'
DataFrame 在 Series 的基础上增加了列索引,一个数据框可以由二维的 data 与行列索引来构造:
import pandas as pd
data = [[1,'a',1.2],[2,'b',2.2],[3,'c',3.2]]
df = pd.DataFrame(data = data,index = ['a_%d'%i for i in range(3)],columns = ['b_%d'%i for i in range(3)])
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
b_0 | b_1 | b_2 | |
---|---|---|---|
a_0 | 1 | a | 1.2 |
a_1 | 2 | b | 2.2 |
a_2 | 3 | c | 3.2 |
用从列索引名到数据的映射来构造数据框,同时再加上行索引:
data = pd.DataFrame(data = {'col_0':[1,2,3],'col_1':list('abc'),'col_2':[1.2,2.2,3.2]},
index = ['row_%d'%i for i in range(3)])
data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
col_0 | col_1 | col_2 | |
---|---|---|---|
row_0 | 1 | a | 1.2 |
row_1 | 2 | b | 2.2 |
row_2 | 3 | c | 3.2 |
data['col_1']
row_0 a
row_1 b
row_2 c
Name: col_1, dtype: object
data.values
array([[1, 'a', 1.2],
[2, 'b', 2.2],
[3, 'c', 3.2]], dtype=object)
data.shape#大小
(3, 3)
data.T#转置
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
row_0 | row_1 | row_2 | |
---|---|---|---|
col_0 | 1 | 2 | 3 |
col_1 | a | b | c |
col_2 | 1.2 | 2.2 | 3.2 |
import pandas as pd
df = pd.read_csv(r'C:\Users\zhoukaiwei\Desktop\joyful-pandas\data\learn_pandas.csv')
df.columns
Index(['School', 'Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer',
'Test_Number', 'Test_Date', 'Time_Record'],
dtype='object')
head, tail 函数分别表示返回表或者序列的前 n 行和后 n 行,其中 n 默认为5:
df = df[df.columns[:7]]
df.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
0 | Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 158.9 | 46.0 | N |
1 | Peking University | Freshman | Changqiang You | Male | 166.5 | 70.0 | N |
2 | Shanghai Jiao Tong University | Senior | Mei Sun | Male | 188.9 | 89.0 | N |
df.tail(5)
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
0 | Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 158.9 | 46.0 | N |
1 | Peking University | Freshman | Changqiang You | Male | 166.5 | 70.0 | N |
2 | Shanghai Jiao Tong University | Senior | Mei Sun | Male | 188.9 | 89.0 | N |
3 | Fudan University | Sophomore | Xiaojuan Sun | Female | NaN | 41.0 | N |
4 | Fudan University | Sophomore | Gaojuan You | Male | 174.0 | 74.0 | N |
... | ... | ... | ... | ... | ... | ... | ... |
195 | Fudan University | Junior | Xiaojuan Sun | Female | 153.9 | 46.0 | N |
196 | Tsinghua University | Senior | Li Zhao | Female | 160.9 | 50.0 | N |
197 | Shanghai Jiao Tong University | Senior | Chengqiang Chu | Female | 153.9 | 45.0 | N |
198 | Shanghai Jiao Tong University | Senior | Chengmei Shen | Male | 175.3 | 71.0 | N |
199 | Tsinghua University | Sophomore | Chunpeng Lv | Male | 155.7 | 51.0 | N |
200 rows × 7 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 200 non-null object
1 Grade 200 non-null object
2 Name 200 non-null object
3 Gender 200 non-null object
4 Height 183 non-null float64
5 Weight 189 non-null float64
6 Transfer 188 non-null object
dtypes: float64(2), object(5)
memory usage: 11.1+ KB
df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Height | Weight | |
---|---|---|
count | 183.000000 | 189.000000 |
mean | 163.218033 | 55.015873 |
std | 8.608879 | 12.824294 |
min | 145.400000 | 34.000000 |
25% | 157.150000 | 46.000000 |
50% | 161.900000 | 51.000000 |
75% | 167.500000 | 65.000000 |
max | 193.900000 | 89.000000 |
df['School'].unique()#得到唯一值组成的列表
array(['Shanghai Jiao Tong University', 'Peking University',
'Fudan University', 'Tsinghua University'], dtype=object)
df['School'].nunique()#得到唯一值的个数
4
df['School'].value_counts()#value_counts 可以得到唯一值和其对应出现的频数:
Tsinghua University 69
Shanghai Jiao Tong University 57
Fudan University 40
Peking University 34
Name: School, dtype: int64
使用 drop_duplicates得到多个列组合的唯一值,其中的关键参数是 keep ,默认值 first 表示每 个组合保留第一次出现的所在行, last 表示保留最后一次出现的所在行, False 表示把所有重 复组合所在的行剔除。
df_A = df[['School','Transfer','Name']]
df_A
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
School | Transfer | Name | |
---|---|---|---|
0 | Shanghai Jiao Tong University | N | Gaopeng Yang |
1 | Peking University | N | Changqiang You |
2 | Shanghai Jiao Tong University | N | Mei Sun |
3 | Fudan University | N | Xiaojuan Sun |
4 | Fudan University | N | Gaojuan You |
... | ... | ... | ... |
195 | Fudan University | N | Xiaojuan Sun |
196 | Tsinghua University | N | Li Zhao |
197 | Shanghai Jiao Tong University | N | Chengqiang Chu |
198 | Shanghai Jiao Tong University | N | Chengmei Shen |
199 | Tsinghua University | N | Chunpeng Lv |
200 rows × 3 columns
df_A.drop_duplicates(['School','Transfer'])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
School | Transfer | Name | |
---|---|---|---|
0 | Shanghai Jiao Tong University | N | Gaopeng Yang |
1 | Peking University | N | Changqiang You |
3 | Fudan University | N | Xiaojuan Sun |
5 | Tsinghua University | N | Xiaoli Qian |
12 | Shanghai Jiao Tong University | NaN | Peng You |
36 | Peking University | Y | Xiaojuan Qin |
43 | Tsinghua University | Y | Gaoli Feng |
69 | Tsinghua University | NaN | Chunquan Xu |
84 | Fudan University | NaN | Yanjuan Lv |
102 | Peking University | NaN | Chengli Zhao |
131 | Fudan University | Y | Chengpeng Qian |
一般而言,替换操作是针对某一个列进行的,因此下面的例子都以 Series 举例。 pandas 中的 替换函数可以归纳为三类:映射替换、逻辑替换、数值替换。其中映射替换包含 replace 方法、 str.replace 方法以及cat.codes 方法,此处介绍 replace 的用法。 在 replace 中,可以通过字典构造,或者传入两个列表来进行替换:
#df = pd.read_csv(r'C:\Users\zhoukaiwei\Desktop\joyful-pandas\data\learn_pandas.csv')
#df = df[df.columns[:7]]
df['Gender'].replace({'Female':0, 'Male':1}).head()
0 0
1 1
2 1
3 0
4 1
Name: Gender, dtype: int64
排序共有两种方式,其一为值排序,其二为索引排序,对应的函数是 sort_values 和 sort_index 。 参数 ascending=True 为升序:
df_A = df[['Grade', 'Name', 'Height',
....: 'Weight']].set_index(['Grade','Name'])
df_A.sort_values('Height').head()#升序排列
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Junior | Xiaoli Chu | 145.4 | 34.0 |
Senior | Gaomei Lv | 147.3 | 34.0 |
Sophomore | Peng Han | 147.8 | 34.0 |
Senior | Changli Lv | 148.7 | 41.0 |
Sophomore | Changjuan You | 150.5 | 40.0 |
df_A.sort_values('Height', ascending=False).head()#降序排列
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Senior | Xiaoqiang Qin | 193.9 | 79.0 |
Mei Sun | 188.9 | 89.0 | |
Gaoli Zhao | 186.5 | 83.0 | |
Freshman | Qiang Han | 185.3 | 87.0 |
Senior | Qiang Zheng | 183.9 | 87.0 |
df_A.sort_values(['Weight','Height'],ascending=[True,False]).head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Sophomore | Peng Han | 147.8 | 34.0 |
Senior | Gaomei Lv | 147.3 | 34.0 |
Junior | Xiaoli Chu | 145.4 | 34.0 |
Sophomore | Qiang Zhou | 150.5 | 36.0 |
Freshman | Yanqiang Xu | 152.4 | 38.0 |
Qiang Han | 151.8 | 38.0 | |
Senior | Chengpeng Zheng | 151.7 | 38.0 |
Sophomore | Mei Xu | 154.2 | 39.0 |
Freshman | Xiaoquan Sun | 154.6 | 40.0 |
Sophomore | Qiang Sun | 154.3 | 40.0 |
df_A.sort_index(level = ['Grade','Name'],ascending=[True,False]).head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Freshman | Yanquan Wang | 163.5 | 55.0 |
Yanqiang Xu | 152.4 | 38.0 | |
Yanqiang Feng | 162.3 | 51.0 | |
Yanpeng Lv | NaN | 65.0 | |
Yanli Zhang | 165.1 | 52.0 | |
Yanjuan Zhao | NaN | 53.0 | |
Yanjuan Han | 163.7 | 49.0 | |
Xiaoquan Sun | 154.6 | 40.0 | |
Xiaopeng Zhou | 174.1 | 74.0 | |
Xiaopeng Zhao | 161.0 | 53.0 |
apply 方法常用于 DataFrame 的行迭代或者列迭代,它的 axis 含义与第2小节中的统计聚合函 数一致, apply 的参数往往是一个以序列为输入的函数。例如对于 .mean() ,使用 apply 可 以如下地写出:
#df = pd.read_csv(r'C:\Users\zhoukaiwei\Desktop\joyful-pandas\data\learn_pandas.csv')
df_A = df[['Height','Weight']]
df_A
def A_mean(x):
res = x.mean()
return res
df_A.apply(A_mean)
Height 163.218033
Weight 55.015873
dtype: float64
df_A.apply(lambda x: x.mean())#使用lambda表达式
Height 163.218033
Weight 55.015873
dtype: float64
pandas 中有3类窗口,分别是滑动窗口 rolling 、扩张窗口 expanding 以及指数加权窗口 ewm 。
要使用滑窗函数,就必须先要对一个序列使用 .rolling 得到滑窗对象,
其最重要的参数为窗口大小 window 。例如:
s = pd.Series([1,2,3,4,5])
A = s.rolling(window = 3)
A
Rolling [window=3,center=False,axis=0]
在得到了滑窗对象后,能够使用相应的聚合函数进行计算,需要注意的是窗口包含当前行所在的元素,例如在第四个位置进行均值运算时,应当计算(2+3+4)/3,
而不是(1+2+3)/3:
A.mean()
0 NaN
1 NaN
2 2.0
3 3.0
4 4.0
dtype: float64
A.sum()
0 NaN
1 NaN
2 6.0
3 9.0
4 12.0
dtype: float64
#计算滑动窗口的相关系数和协方差
s = pd.Series([1,2,6,16,30])
A.cov(s)
0 NaN
1 NaN
2 2.5
3 7.0
4 12.0
dtype: float64
A.corr(s)
0 NaN
1 NaN
2 0.944911
3 0.970725
4 0.995402
dtype: float64
shift, diff, pct_change 是一组类滑窗函数,它们的公共参数为 periods=n ,默认为1 ,分别表示取向前第 n 个元素的值、与向前第 n 个元素做差(与 Numpy 中不同, 后者表示 n 阶差分)、与向前第 n 个元素相比计算增长率。这里的 n 可以为负,表示反方 向的类似操作。
s = pd.Series([1,3,6,10,15])
s.shift(1)
0 NaN
1 1.0
2 3.0
3 6.0
4 10.0
dtype: float64
s.diff(2)
0 NaN
1 NaN
2 5.0
3 7.0
4 9.0
dtype: float64
s.pct_change()
0 NaN
1 2.000000
2 1.000000
3 0.666667
4 0.500000
dtype: float64
s.shift(-1)
0 3.0
1 6.0
2 10.0
3 15.0
4 NaN
dtype: float64
s = pd.Series([1, 3, 6, 10])
s.expanding().mean()
0 1.000000
1 2.000000
2 3.333333
3 5.000000
dtype: float64
现有一份口袋妖怪的数据集,下面进行一些背景说明: #代表全国图鉴编号,不同行存在相同数字则表示为该妖怪的不同状态 妖怪具有单属性和双属性两种,对于单属性的妖怪, Type 2 为缺失值 Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed 分别代表种族值、体力、物攻、防御、 特攻、特防、速度,其中种族值为后6项之和
df = pd.read_csv(r'C:\Users\zhoukaiwei\Desktop\joyful-pandas\data\pokemon.csv')
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 |
1.对 HP, Attack, Defense, Sp. Atk, Sp. Def, Speed 进行加总,验证是否为 Total 值。
A = (df[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']]).sum(1)
A
0 318
1 405
2 525
3 625
4 309
...
795 600
796 700
797 600
798 680
799 600
Length: 800, dtype: int64
x= (A != df['Total']).mean()
x
0.0
对于 # 重复的妖怪只保留第一条记录,解决以下问题:
求第一属性的种类数量和前三多数量对应的种类
df_A = df.drop_duplicates('#',keep='first')
df_A.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 |
5 | 5 | Charmeleon | Fire | NaN | 405 | 58 | 64 | 58 | 80 | 65 | 80 |
df_A['Type 1'].nunique()
18
df_A['Type 1'].value_counts().head(3)
Water 105
Normal 93
Grass 66
Name: Type 1, dtype: int64
求第一属性和第二属性的组合种类
df_B = df_A.drop_duplicates(['Type 1','Type 2'])
df_B.shape[0]
143
求尚未出现过的属性组合
import numpy as np
L_full = [i+' '+j for i in df['Type 1'].unique() for j in (
df['Type 1'].unique().tolist() + [''])]
L_part = [i+' '+j for i, j in zip(df['Type 1'], df['Type 2'
].replace(np.nan, ''))]
res = set(L_full).difference(set(L_part))
len(res)
188
按照下述要求,构造 Series :
取出物攻,超过120的替换为 high ,不足50的替换为 low ,否则设为 mid111
res = df['Attack'].mask(df['Attack'] > 120, 'high').mask(df['Attack']<50, 'low'
).mask((50<=df['Attack'])&(df['Attack']<=120), 'mid')
res
0 low
1 mid
2 mid
3 mid
4 mid
...
795 mid
796 high
797 mid
798 high
799 mid
Name: Attack, Length: 800, dtype: object
取出第一属性,分别用 replace 和 apply 替换所有字母为大写
df['Type 1'].replace({i:str.upper(i) for i in df['Type 1']})
0 GRASS
1 GRASS
2 GRASS
3 GRASS
4 FIRE
...
795 ROCK
796 ROCK
797 PSYCHIC
798 PSYCHIC
799 FIRE
Name: Type 1, Length: 800, dtype: object
df['Type 1'].apply(lambda x:str.upper(x))
0 GRASS
1 GRASS
2 GRASS
3 GRASS
4 FIRE
...
795 ROCK
796 ROCK
797 PSYCHIC
798 PSYCHIC
799 FIRE
Name: Type 1, Length: 800, dtype: object
求每个妖怪六项能力的离差,即所有能力中偏离中位数最大的值,添加到 df 并从大到小排序
df['Deviation'] = df[['HP', 'Attack', 'Defense', 'Sp. Atk',
'Sp. Def', 'Speed']].apply(lambda x:np.max(
(x-x.median()).abs()), 1)
df.sort_values('Deviation', ascending=False).head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Deviation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
230 | 213 | Shuckle | Bug | Rock | 505 | 20 | 10 | 230 | 10 | 230 | 5 | 215.0 |
121 | 113 | Chansey | Normal | NaN | 450 | 250 | 5 | 5 | 35 | 105 | 50 | 207.5 |
261 | 242 | Blissey | Normal | NaN | 540 | 255 | 10 | 10 | 75 | 135 | 55 | 190.0 |
333 | 306 | AggronMega Aggron | Steel | NaN | 630 | 70 | 140 | 230 | 60 | 80 | 50 | 155.0 |
224 | 208 | SteelixMega Steelix | Steel | Ground | 610 | 75 | 125 | 230 | 55 | 95 | 30 | 145.0 |
在扩张窗口中,用户可以使用各类函数进行历史的累计指标统计,但这些内置的统计函数往往把 窗口中的所有元素赋予了同样的权重。事实上,可以给出不同的权重来赋给窗口中的元素, 指数加权窗口就是这样一种特殊的扩张窗口。 其中,最重要的参数是 alpha ,它决定了默认情况下的窗口权重为 wi=(1−α)i,i∈{0,1,...,t} , 其中 i=t 表示当前元素, i=0 表示序列的第一个元素。 从权重公式可以看出,离开当前值越远则权重越小,若记原序列为 x ,更新后的当前元素为 yt ,此时通过加权公式归一化后可知:
\begin{split}y_t &=\frac{\sum_{i=0}^{t} w_i x_{t-i}}{\sum_{i=0}^{t} w_i} \ &=\frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ...
- (1 - \alpha)^{t} x_{0}}{1 + (1 - \alpha) + (1 - \alpha)^2 + ...
- (1 - \alpha)^{t}}\\end{split}
np.random.seed(0)
s = pd.Series(np.random.randint(-1,2,30).cumsum())
s.head()
0 -1
1 -1
2 -2
3 -2
4 -2
dtype: int32
s.ewm(alpha=0.2).mean().head()
0 -1.000000
1 -1.000000
2 -1.409836
3 -1.609756
4 -1.725845
dtype: float64
请用 expanding 窗口实现。
从第1问中可以看到, ewm 作为一种扩张窗口的特例,只能从序列的第一个元素开始加权。 现在希望给定一个限制窗口 n ,只对包含自身的最近的 n 个元素作为窗口进行滑动加权平滑。 请根据滑窗函数,给出新的 wi 与 yt 的更新公式,并通过 rolling窗口实现这一功能。