This article exploits statistical technique method of moments (MOM) and generalized method of moments (GMM) to estimate database statistics from the past data. Both GMM and MOM's are parametric unsupervised form of learning. I am restricting this to Oracle Database in this example, but it can be extended to any database system.
We initially select a distribution to model our random variable. Our random variables in database statistical parameters could be any statistic that we collect say in AWR report. To start our analysis, we collected AWR reports over several months. We then scraped these data and stored our metrics in csv formats. I have presented some csv files corresponding to these statistics in my github page.
Lets say we want to use the method of moments to predict the following AWR statistics for the database Load Profile in AWR statistics. This is given in this section.
%matplotlib inline
from IPython.display import Image
import pandas as pd
import random
import matplotlib.pyplot as plt
from scipy.stats.kde import gaussian_kde
from plotly import tools
import scipy.stats as scs
from scipy.stats import *
from scipy import stats
import plotly.plotly as py
from plotly.graph_objs import *
import numpy as np
pd.set_option('display.max_columns', 10)
Image(filename='images/load_profile.png')
Reading our historic load profile data in the dataframe.
df=pd.read_csv('data/TMP_AWR_LOAD_PROFILE_AGG.csv', sep='|',parse_dates=True)
df.head()
site_id | config_id | run_id | stat_id | stat_name | stat_per_sec | stat_per_txn | stat_per_exec | stat_per_call | start_time | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 37 | 1 | 1 | 265418 | Global Cache blocks received | 4467.21 | 16.36 | NaN | NaN | 03-OCT-14 08.00.00.000000 AM |
1 | 37 | 1 | 1 | 265427 | Transactions | 274.49 | NaN | NaN | NaN | 03-OCT-14 08.00.00.000000 AM |
2 | 37 | 1 | 1 | 265410 | Logical read (blocks) | 291880.19 | 1064.87 | NaN | NaN | 03-OCT-14 08.00.00.000000 AM |
3 | 37 | 1 | 1 | 265419 | Global Cache blocks served | 4467.02 | 16.34 | NaN | NaN | 03-OCT-14 08.00.00.000000 AM |
4 | 37 | 1 | 1 | 265421 | Parses (SQL) | 1131.04 | 4.14 | NaN | NaN | 03-OCT-14 08.00.00.000000 AM |
As we can we have the data for several statistics viz. Global Cache blocks received, Transactions, Logical read (blocks), Global Cache blocks served, etc from the AWR load profile start from date 03-OCT-14 until 02-NOV-15. I am interested in forecasting only stat_per_sec information on this AWR Load profile, so using pivot function in pandas we get the following reconstructed dataframe.
df=df.pivot_table(index='start_time', columns='stat_name', values='stat_per_sec')
df.head()
stat_name | Background CPU(s) | Block changes | DB CPU(s) | DB Time(s) | Executes (SQL) | ... | Session Logical Read IM | Transactions | User calls | Write IO (MB) | Write IO requests |
---|---|---|---|---|---|---|---|---|---|---|---|
start_time | |||||||||||
01-APR-15 08.00.00.000000 AM | 2.34 | 7171.66 | 8.34 | 21.33 | 7295.85 | ... | NaN | 342.13 | 3172.36 | 7.38 | 791.23 |
01-APR-15 10.00.00.000000 PM | 1.88 | 5829.02 | 5.75 | 14.98 | 4943.24 | ... | NaN | 243.01 | 2288.04 | 5.59 | 603.88 |
01-JAN-15 08.00.00.000000 AM | NaN | 1746.34 | 1.80 | 4.07 | 2277.13 | ... | NaN | 139.59 | 970.47 | 1.61 | 162.10 |
01-JUL-15 08.00.00.000000 AM | 2.07 | 6530.64 | 6.92 | 18.46 | 6485.98 | ... | NaN | 334.05 | 3064.53 | 6.76 | 721.93 |
01-JUL-15 10.00.00.000000 PM | 1.80 | 5853.14 | 5.19 | 14.20 | 4787.97 | ... | NaN | 259.02 | 2370.28 | 5.70 | 606.48 |
5 rows × 24 columns
print (df.columns)
Index([u'Background CPU(s)', u'Block changes', u'DB CPU(s)', u'DB Time(s)',
u'Executes (SQL)', u'Global Cache blocks received',
u'Global Cache blocks served', u'Hard parses (SQL)', u'IM scan rows',
u'Logical read (blocks)', u'Logons', u'Parses (SQL)',
u'Physical read (blocks)', u'Physical write (blocks)', u'Read IO (MB)',
u'Read IO requests', u'Redo size (bytes)', u'Rollbacks',
u'SQL Work Area (MB)', u'Session Logical Read IM', u'Transactions',
u'User calls', u'Write IO (MB)', u'Write IO requests'],
dtype='object', name=u'stat_name')
All of these database statistics are continous random variables, so we model using some continous distributions. scipy offers several continous distributions for the purpose of modelling, viz Gamma, Normal, Anglit, Arcsine, Beta, Cauchy etc.
Once we have the estimated model, we can then use the Kolmogorov-Smirov test to evaluate your fit using scipy.stats.kstest. Lets start with the most popular gamma distribution.
Say I am interested in Background CPU(s) information, calculating the gamma distribution parameters alpha and beta.
sample_mean=df['Background CPU(s)'].mean()
sample_var=np.var(df['Background CPU(s)'], ddof=1)
(sample_mean,sample_var)
(2.157353689567431, 0.23047410162538273)
Alphas and Betas of Gamma distribution:
###############
#### Gamma ####
###############
alpha = sample_mean**2 / sample_var
beta = sample_mean / sample_var
alpha,beta
(20.19391727342622, 9.360503737092497)
Using estimated parameters and plot the distribution on top of data
gamma_rv = scs.gamma(a=alpha, scale=1/beta)
# Get the probability for each value in the data
x_vals = np.linspace(df['Background CPU(s)'].min(), df['Background CPU(s)'].max())
gamma_p = gamma_rv.pdf(x_vals)
df.head()
stat_name | Background CPU(s) | Block changes | DB CPU(s) | DB Time(s) | Executes (SQL) | ... | Session Logical Read IM | Transactions | User calls | Write IO (MB) | Write IO requests |
---|---|---|---|---|---|---|---|---|---|---|---|
start_time | |||||||||||
01-APR-15 08.00.00.000000 AM | 2.34 | 7171.66 | 8.34 | 21.33 | 7295.85 | ... | NaN | 342.13 | 3172.36 | 7.38 | 791.23 |
01-APR-15 10.00.00.000000 PM | 1.88 | 5829.02 | 5.75 | 14.98 | 4943.24 | ... | NaN | 243.01 | 2288.04 | 5.59 | 603.88 |
01-JAN-15 08.00.00.000000 AM | NaN | 1746.34 | 1.80 | 4.07 | 2277.13 | ... | NaN | 139.59 | 970.47 | 1.61 | 162.10 |
01-JUL-15 08.00.00.000000 AM | 2.07 | 6530.64 | 6.92 | 18.46 | 6485.98 | ... | NaN | 334.05 | 3064.53 | 6.76 | 721.93 |
01-JUL-15 10.00.00.000000 PM | 1.80 | 5853.14 | 5.19 | 14.20 | 4787.97 | ... | NaN | 259.02 | 2370.28 | 5.70 | 606.48 |
5 rows × 24 columns
# Generate random colormap
def rand_cmap(nlabels, type='bright', first_color_black=True, last_color_black=False, verbose=True):
"""
Creates a random colormap to be used together with matplotlib. Useful for segmentation tasks
:param nlabels: Number of labels (size of colormap)
:param type: 'bright' for strong colors, 'soft' for pastel colors
:param first_color_black: Option to use first color as black, True or False
:param last_color_black: Option to use last color as black, True or False
:param verbose: Prints the number of labels and shows the colormap. True or False
:return: colormap for matplotlib
"""
from matplotlib.colors import LinearSegmentedColormap
import colorsys
import numpy as np
if type not in ('bright', 'soft'):
print ('Please choose "bright" or "soft" for type')
return
if verbose:
print('Number of labels: ' + str(nlabels))
# Generate color map for bright colors, based on hsv
if type == 'bright':
randHSVcolors = [(np.random.uniform(low=0.0, high=1),
np.random.uniform(low=0.2, high=1),
np.random.uniform(low=0.9, high=1)) for i in xrange(nlabels)]
# Convert HSV list to RGB
randRGBcolors = []
for HSVcolor in randHSVcolors:
randRGBcolors.append(colorsys.hsv_to_rgb(HSVcolor[0], HSVcolor[1], HSVcolor[2]))
if first_color_black:
randRGBcolors[0] = [0, 0, 0]
if last_color_black:
randRGBcolors[-1] = [0, 0, 0]
random_colormap = LinearSegmentedColormap.from_list('new_map', randRGBcolors, N=nlabels)
# Generate soft pastel colors, by limiting the RGB spectrum
if type == 'soft':
low = 0.6
high = 0.95
randRGBcolors = [(np.random.uniform(low=low, high=high),
np.random.uniform(low=low, high=high),
np.random.uniform(low=low, high=high)) for i in xrange(nlabels)]
if first_color_black:
randRGBcolors[0] = [0, 0, 0]
if last_color_black:
randRGBcolors[-1] = [0, 0, 0]
random_colormap = LinearSegmentedColormap.from_list('new_map', randRGBcolors, N=nlabels)
# Display colorbar
if verbose:
from matplotlib import colors, colorbar
from matplotlib import pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(15, 0.5))
bounds = np.linspace(0, nlabels, nlabels + 1)
norm = colors.BoundaryNorm(bounds, nlabels)
cb = colorbar.ColorbarBase(ax, cmap=random_colormap, norm=norm, spacing='proportional', ticks=None,
boundaries=bounds, format='%1i', orientation=u'horizontal')
return random_colormap
bright_cmap = rand_cmap(50, type='bright', first_color_black=False, last_color_black=True, verbose=True)
Number of labels: 50
soft_cmap = rand_cmap(50, type='soft', first_color_black=False, last_color_black=True, verbose=True)
Number of labels: 50
'''Use the estimated parameters and plot the distribution on top of data'''
fig, ax = plt.subplots()
scol=soft_cmap(random.choice(range(50)))
bcol1=bright_cmap(random.choice(range(50)))
bcol2=bright_cmap(random.choice(range(50)))
# Plot those values on top of the real data
ax = df['Background CPU(s)'].hist(bins=20, normed=1, color=bcol1, figsize=(10, 7))
ax.set_xlabel('Statistics')
ax.set_ylabel('Probability Density')
ax.set_title('Background CPU(s)')
scol2=bright_cmap(random.choice(range(50)))
ax.plot(x_vals, gamma_p, c=bcol2, label='Gamma')
ax.set_axis_bgcolor(scol)
ax.legend()
#py.iplot_mpl(fig, filename='s6_log-scales')
<matplotlib.legend.Legend at 0x11184a3d0>
# Lets put everything in a function
# Define a function that plots distribution fitted to one month's of data
def plot_mom(df, col):
sample_mean=df[col].mean()
sample_var=np.var(df[col], ddof=1)
alpha = sample_mean**2 / sample_var
beta = sample_mean / sample_var
gamma_rv = scs.gamma(a=alpha, scale=1/beta)
# Get the probability for each value in the data
x_vals = np.linspace(df[col].min(), df[col].max())
gamma_p = gamma_rv.pdf(x_vals)
fig, ax = plt.subplots()
# Plot those values on top of the real data
scol=soft_cmap(random.choice(range(50)))
bcol1=bright_cmap(random.choice(range(50)))
bcol2=bright_cmap(random.choice(range(50)))
ax = df[col].hist(bins=20, normed=1, color=bcol1, figsize=(10, 7))
ax.set_axis_bgcolor(scol)
ax.set_xlabel('Statistics')
ax.set_ylabel('Probability Density')
ax.set_title(col)
ax.plot(x_vals, gamma_p, color=bcol2, label='Predicted Values')
ax.legend()
df.columns
Index([u'Background CPU(s)', u'Block changes', u'DB CPU(s)', u'DB Time(s)',
u'Executes (SQL)', u'Global Cache blocks received',
u'Global Cache blocks served', u'Hard parses (SQL)', u'IM scan rows',
u'Logical read (blocks)', u'Logons', u'Parses (SQL)',
u'Physical read (blocks)', u'Physical write (blocks)', u'Read IO (MB)',
u'Read IO requests', u'Redo size (bytes)', u'Rollbacks',
u'SQL Work Area (MB)', u'Session Logical Read IM', u'Transactions',
u'User calls', u'Write IO (MB)', u'Write IO requests'],
dtype='object', name=u'stat_name')
Lets just pick the following important statistics to estimate: * 'Background CPU(s)' * 'DB CPU(s)', * 'Executes (SQL)', * 'IM scan rows', * 'Hard parses (SQL)', * 'Logical read (blocks)', * 'Parses (SQL)', * 'Physical read (blocks)', * 'Physical write (blocks)'
columns_of_interest=['Background CPU(s)', 'DB CPU(s)', 'Executes (SQL)', 'User calls', 'Hard parses (SQL)', 'Logical read (blocks)', 'Parses (SQL)', 'Physical read (blocks)', 'Physical write (blocks)']
for col in columns_of_interest:
plot_mom(df, col)
fig.savefig('images/'+col)
df.head()
stat_name | Background CPU(s) | Block changes | DB CPU(s) | DB Time(s) | Executes (SQL) | ... | Session Logical Read IM | Transactions | User calls | Write IO (MB) | Write IO requests |
---|---|---|---|---|---|---|---|---|---|---|---|
start_time | |||||||||||
01-APR-15 08.00.00.000000 AM | 2.34 | 7171.66 | 8.34 | 21.33 | 7295.85 | ... | NaN | 342.13 | 3172.36 | 7.38 | 791.23 |
01-APR-15 10.00.00.000000 PM | 1.88 | 5829.02 | 5.75 | 14.98 | 4943.24 | ... | NaN | 243.01 | 2288.04 | 5.59 | 603.88 |
01-JAN-15 08.00.00.000000 AM | NaN | 1746.34 | 1.80 | 4.07 | 2277.13 | ... | NaN | 139.59 | 970.47 | 1.61 | 162.10 |
01-JUL-15 08.00.00.000000 AM | 2.07 | 6530.64 | 6.92 | 18.46 | 6485.98 | ... | NaN | 334.05 | 3064.53 | 6.76 | 721.93 |
01-JUL-15 10.00.00.000000 PM | 1.80 | 5853.14 | 5.19 | 14.20 | 4787.97 | ... | NaN | 259.02 | 2370.28 | 5.70 | 606.48 |
5 rows × 24 columns
This above function would generate the MOM estimate for any database statistic that we would want. For example if we want to predict MOM's for "Global Cache blocks served", we just need to call this function:
plot_mom(df,'Global Cache blocks received')
fig.savefig('images/cache_blocks.png')
For "Rollbacks" example our estimation is:
plot_mom(df,'Rollbacks')
fig.savefig('images/rollbacks.png')
As we can we our estimations (plotted in red) are pretty good estimation of our actual values that we observe in our real outputs. This concludes our discussion on MOM estimators on database statistics.