MMA is a rapidly growing sport in which participants, referred to as fighters, utilize various martial arts such as jiu jitsu, muay thai, kickboxing, wrestling, judo, karate, taw kwon do, and many more in an attempt to beat their opponent either by (unanimous, split, or majority) decision (UD/SD/MD), knockout/technical knockout (KO/TKO), or submission (SUB). Decisions are decided when a finish (KO/TKO or SUB) doesn't occur through the length of the fight in which a winner is chosen based on the scorecards of a panel of three judges. A unanimous decision, as per the name, is when all three judges agree on the winner, a split decision is when two out of the three judges agree and the third scored the fight for the losing fighter, and a majority decision is when two judges agree on a winner and the third scored it a draw. Although rare, a split draw can occur in a fight where all three judges scored the fight evenly, or in an instance where two judges scored the fight for opposing fighters and the third judge scored a draw.
The UFC happens to be the most popular mixed martial arts (MMA) promotion and has gone from a taboo cage-fighting organization to a world-reknown and respected sports-entertainment corporation. Like any other sport, statistical analysis and data science play a huge role in determining not only betting odds, props, and moneylines for events, but also in the UFC's matchmakers' jobs of setting up fights.
The aim of this tutorial is to organize and analyze UFC statistics in order to see what qualities or attributes may contribute and correlate to winning fights the most.
Below are all the imports utilized throughout this tutorial. The third-party libraries' documentations can be found at their respective website: pandas, numpy, matplotlib, scikit-learn, and seaborn. In addition, the popular matplotlib style "ggplot" that mimics the ggplot style in R is used for aesthetic sugar.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import f
import seaborn as sns
from sklearn import model_selection
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier as xgb
from IPython.display import display, FileLink
from datetime import datetime
import random
import re
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
The data I'll be utilizing was obtained from scraping the statistics found on ufcstats.com. Alternatively, pre-assembled data sets can be found across the internet, but many of these data sets aren't completely up-to-date, which is what manually scraping the data allowed me to ensure. Using scrapy and a couple custom spider scripts, I was able to get two data sets, one representing fighter stats by individual and the other representing fight stats by fight card (also referred to as event). Both of these CSV files as well as the zipped folders containing the scrapy scripts, instructions, etc. can be found below. As the instructions for scraping the data are found in the README files of the spiders, that won't be separately touched on in this tutorial.
display(FileLink('fighters.csv', result_html_prefix="Fighter Data: "))
display(FileLink('fightCards.csv', result_html_prefix="Fight Card Data: "))
display(FileLink('fighterSpider.zip', result_html_prefix="Fighter Spider: "))
display(FileLink('fightcardsSpider.zip', result_html_prefix="Fight Card Spider: "))
Fighter Data: fighters.csv
Fight Card Data: fightCards.csv
Fighter Spider: fighterSpider.zip
Fight Card Spider: fightcardsSpider.zip
We first start with preprocessing the data, which includes cleaning and reorganizing the data as a pandas dataframe so that it's more readable, accessible, and so that it only contains the most relevant information we are going to utilize. Below you can see the first five rows of each dataframe along with the column headers.
df_fighters = pd.read_csv('fighters.csv')
df_cards = pd.read_csv('fightCards.csv')
display(df_fighters.head())
display(df_cards.head())
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
DoB | SApM | SLpM | height | name | reach | record | stance | strAcc | strDef | subAvg | tdAcc | tdAvg | tdDef | weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Nov 24 1985 | 3.72 | 1.65 | 5' 9" | John Gunther | 72.0 | 5-1-0 | Orthodox | 37% | 46% | 0.0 | 42% | 7.08 | 0% | 155 |
1 | Jul 05 1995 | 2.80 | 1.93 | 6' 0" | Joe Giannetti | 74.0 | 6-1-0 | Southpaw | 38% | 40% | 0.0 | 16% | 1.00 | 0% | 155 |
2 | Aug 25 1974 | 0.92 | 0.92 | 5' 8" | Allen Berube | NaN | 4-3-0 | Orthodox | 80% | 33% | 3.4 | 100% | 6.87 | 0% | 155 |
3 | Nov 27 1991 | 4.49 | 3.80 | 5' 11" | Daichi Abe | 71.0 | 6-2-0 | Orthodox | 33% | 56% | 0.0 | 50% | 0.33 | 0% | 170 |
4 | Jun 26 1996 | 6.18 | 6.43 | 5' 7" | Diana Belbita | 68.0 | 14-7-0 | Orthodox | 42% | 50% | 0.0 | 50% | 0.63 | 68% | 115 |
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
card_name | f1 | f1_sig_strike_per | f1_sig_strike_total | f1_td_attempt | f1_td_succeed | f2 | f2_sig_strike_per | f2_sig_strike_total | f2_td_attempt | f2_td_succeed | fight_date | fights_location | round_format | round_fought | weight_class | winner | winning_method | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | UFC Fight Night: Gane vs. Volkov | Charles Rosa | 28% | 182 | 2 | 2 | Justin Jaynes | 47% | 92 | 2 | 2 | June 26 2021 | Las Vegas, Nevada, USA | 3 | 3 | Featherweight | Charles Rosa | S-DEC |
1 | UFC Fight Night: Gane vs. Volkov | Damir Hadzovic | 47% | 219 | 2 | 2 | Yancy Medeiros | 51% | 237 | 3 | 2 | June 26 2021 | Las Vegas, Nevada, USA | 3 | 3 | Lightweight | Damir Hadzovic | U-DEC |
2 | UFC Fight Night: Font vs. Garbrandt | Damir Ismagulov | 47% | 63 | 1 | 0 | Rafael Alves | 44% | 126 | 3 | 2 | May 22 2021 | Las Vegas, Nevada, USA | 3 | 3 | Lightweight | Damir Ismagulov | U-DEC |
3 | UFC Fight Night: Gane vs. Volkov | Julia Avila | 52% | 91 | 4 | 1 | Julija Stoliarenko | 42% | 94 | 3 | 1 | June 26 2021 | Las Vegas, Nevada, USA | 3 | 3 | Women's Bantamweight | Julia Avila | SUB |
4 | UFC Fight Night: Hall vs. Strickland | Jinh Yu Frey | 47% | 185 | 1 | 0 | Ashley Yoder | 38% | 236 | 8 | 0 | July 31 2021 | Las Vegas, Nevada, USA | 3 | 3 | Women's Strawweight | Jinh Yu Frey | U-DEC |
Let's go over what each of these column headers mean:
Fighter Data | Fight Card Data | ||
---|---|---|---|
DoB | Date of birth | card_name | Name of the card/event |
SApM | Significant strikes absorbed per minute | f1 | Fighter 1 |
SLpM | Significant strikes landed per minute | f1_sig_strike_per | Fighter 1's significant strike percentage |
Height | Height of fighter | f1_sig_strike_total | Fighter 1's total significant strikes |
Name | Name of fighter | f1_td_attempt | # of takedown attempts from fighter 1 |
Reach | Wingspan (inches) | f1_td_succeed | # of successful takedowns from fighter 1 |
Record | Professional fight record | f2 | Fighter 2 |
Stance | Fighter's preferred stance | f2_sig_strike_per | Fighter 2's significant strike percentage |
strAcc | Significant striking accuracy | f2_sig_strike_total | Fighter 2's total significant strikes |
strDef | Significant strike defence | f2_td_attempt | # of takedown attempts from fighter 2 |
subAvg | Average submissions attempted per 15 minutes | f2_td_succeed | # of successful takedowns from fighter 2 |
tdAcc | Takedown accuracy | fight_date | Date of fight |
tdAvg | Average takedowns landed per 15 minutes | fights_location | Location of fight |
tdDef | Takedown defence (% of opponent TD attempts that did not succeed) | round_format | Max # of rounds to be fought |
Weight | Most previously fought weight class | round_fought | # of rounds fought |
weight_class | Weight class of bout | ||
winner | Winning fighter | ||
winning_method | Method of victory |
Now let's see if there's any missing data we need to take care of.
df_fighters.isnull().sum()
DoB 0
SApM 0
SLpM 0
height 0
name 0
reach 1919
record 0
stance 819
strAcc 0
strDef 0
subAvg 0
tdAcc 0
tdAvg 0
tdDef 0
weight 0
dtype: int64
As you can see, there are a ton of missing values for reach and stance. We will get to this in a moment, but let's check the other dataframe also.
df_cards.isnull().sum()
card_name 0
f1 0
f1_sig_strike_per 0
f1_sig_strike_total 0
f1_td_attempt 0
f1_td_succeed 0
f2 0
f2_sig_strike_per 0
f2_sig_strike_total 0
f2_td_attempt 0
f2_td_succeed 0
fight_date 0
fights_location 0
round_format 0
round_fought 0
weight_class 0
winner 0
winning_method 0
dtype: int64
Perfect! There aren't any missing values we have to tend to in the fight cards dataframe. Finally, let's check to see if there are any fighters with the same name that we should differentiate between.
df_fighters[df_fighters.duplicated(subset='name', keep=False)]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
DoB | SApM | SLpM | height | name | reach | record | stance | strAcc | strDef | subAvg | tdAcc | tdAvg | tdDef | weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
281 | Feb 06 1965 | 0.40 | 0.00 | 5' 11" | Michael McDonald | NaN | 1-1-0 | Orthodox | 0% | 50% | 0.0 | 0% | 0.00 | 0% | 205 |
615 | Jan 15 1991 | 2.76 | 2.69 | 5' 9" | Michael McDonald | 70.0 | 17-4-0 | Orthodox | 42% | 57% | 1.4 | 66% | 1.09 | 52% | 135 |
1384 | -- | 4.73 | 2.00 | 6' 1" | Tony Johnson | NaN | 11-3-0 | NaN | 53% | 31% | 0.0 | 22% | 2.00 | 0% | 265 |
1389 | May 02 1983 | 3.67 | 4.00 | 6' 2" | Tony Johnson | 76.0 | 7-2-0 | Orthodox | 92% | 22% | 0.0 | 0% | 0.00 | 90% | 205 |
2276 | Oct 07 1992 | 6.20 | 5.83 | 6' 0" | Mike Davis | 72.0 | 10-2-0 | Orthodox | 52% | 56% | 0.2 | 53% | 3.04 | 69% | 155 |
2393 | -- | 0.00 | 0.00 | -- | Mike Davis | NaN | 2-0-0 | NaN | 0% | 0% | 0.0 | 0% | 0.00 | 0% | -- |
2720 | Aug 29 1989 | 3.33 | 3.73 | 5' 10" | Joey Gomez | 71.0 | 7-1-0 | Orthodox | 49% | 50% | 0.0 | 28% | 2.00 | 0% | 155 |
2881 | Jul 21 1986 | 4.46 | 2.44 | 5' 10" | Joey Gomez | 73.0 | 6-2-0 | Orthodox | 28% | 55% | 0.0 | 100% | 0.62 | 50% | 135 |
3239 | Mar 16 1990 | 3.23 | 2.98 | 5' 4" | Bruno Silva | 65.0 | 12-5-2 (1 NC) | Orthodox | 46% | 58% | 0.0 | 31% | 2.89 | 64% | 125 |
3364 | Jul 13 1989 | 4.58 | 4.31 | 6' 0" | Bruno Silva | 74.0 | 22-8-0 | Orthodox | 48% | 44% | 0.0 | 18% | 0.66 | 71% | 185 |
As you can see, there are five different names across these ten fighters. Let's fix this issue first. We'll add the weight class to one of every two of these fighters to differentiate them and avoid duplicates. Since the Mike Davis entry with a record of 2-0-0 is missing all of the associated data, we'll just drop him.
df_fighters.iloc[446, 4] = "Michael McDonald 135"
df_fighters.iloc[1318, 4] = "Tony Johnson 265"
df_fighters.iloc[2092, 4] = "Joey Gomez 155"
df_fighters.iloc[3300, 4] = "Bruno Silva 185"
df_fighters.drop([2404], inplace=True)
To make things arithmetically easier, we'll convert all the percentages (string objects) to decimal values.
def per2dec(df, columns):
for col in columns:
df[col] = df[col].str.strip('%')
df[col] = pd.to_numeric(df[col]) / 100
per2dec(df_fighters, ['strAcc', 'strDef', 'tdAcc', 'tdDef'])
df_fighters.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
DoB | SApM | SLpM | height | name | reach | record | stance | strAcc | strDef | subAvg | tdAcc | tdAvg | tdDef | weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Nov 24 1985 | 3.72 | 1.65 | 5' 9" | John Gunther | 72.0 | 5-1-0 | Orthodox | 0.37 | 0.46 | 0.0 | 0.42 | 7.08 | 0.00 | 155 |
1 | Jul 05 1995 | 2.80 | 1.93 | 6' 0" | Joe Giannetti | 74.0 | 6-1-0 | Southpaw | 0.38 | 0.40 | 0.0 | 0.16 | 1.00 | 0.00 | 155 |
2 | Aug 25 1974 | 0.92 | 0.92 | 5' 8" | Allen Berube | NaN | 4-3-0 | Orthodox | 0.80 | 0.33 | 3.4 | 1.00 | 6.87 | 0.00 | 155 |
3 | Nov 27 1991 | 4.49 | 3.80 | 5' 11" | Daichi Abe | 71.0 | 6-2-0 | Orthodox | 0.33 | 0.56 | 0.0 | 0.50 | 0.33 | 0.00 | 170 |
4 | Jun 26 1996 | 6.18 | 6.43 | 5' 7" | Diana Belbita | 68.0 | 14-7-0 | Orthodox | 0.42 | 0.50 | 0.0 | 0.50 | 0.63 | 0.68 | 115 |
All of our fighter stat percentages are now decimals! Next, for all the fighters missing a substantial amount of data, which we will define as having 0 strDef, tdAvg, tdAcc, tdDef, and subAvg, we will simply remove them.
df_fighters_clean = df_fighters.loc[~(
(df_fighters["strDef"] == 0) &
(df_fighters["tdAvg"] == 0) &
(df_fighters["tdAcc"] == 0) &
(df_fighters["tdDef"] == 0) &
(df_fighters["subAvg"] == 0))].copy()
Something to be mindful of is noise in your data. Often times, fighters with no DoB in their statistics page means they've only fought one match in the UFC and are no longer with the organization and/or they fought in the very early days of the promotion. The sport has changed drastically since the 90's/early 2000's, so keeping these fighters in the dataframe may just cause more noise than it would benefit us. While we're at it, we'll strip the birth date entries to include just the year the fighter was born since the exact month and day aren't significant.
df_fighters_clean = df_fighters_clean[~(df_fighters_clean['DoB'] == '--')].copy()
def get_birth_year(dob):
return datetime.strptime(dob, '%b %d %Y').year
df_fighters_clean['birth_year'] = df_fighters_clean['DoB'].apply(lambda x: get_birth_year(x))
df_fighters_clean.drop(['DoB'], inplace=True, axis=1)
Another minor tweak we'll make is setting the index of the dataframe as the name column.
df_fighters_clean.set_index('name', inplace=True)
df_fighters_clean.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SApM | SLpM | height | reach | record | stance | strAcc | strDef | subAvg | tdAcc | tdAvg | tdDef | weight | birth_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | ||||||||||||||
John Gunther | 3.72 | 1.65 | 5' 9" | 72.0 | 5-1-0 | Orthodox | 0.37 | 0.46 | 0.0 | 0.42 | 7.08 | 0.00 | 155 | 1985 |
Joe Giannetti | 2.80 | 1.93 | 6' 0" | 74.0 | 6-1-0 | Southpaw | 0.38 | 0.40 | 0.0 | 0.16 | 1.00 | 0.00 | 155 | 1995 |
Allen Berube | 0.92 | 0.92 | 5' 8" | NaN | 4-3-0 | Orthodox | 0.80 | 0.33 | 3.4 | 1.00 | 6.87 | 0.00 | 155 | 1974 |
Daichi Abe | 4.49 | 3.80 | 5' 11" | 71.0 | 6-2-0 | Orthodox | 0.33 | 0.56 | 0.0 | 0.50 | 0.33 | 0.00 | 170 | 1991 |
Diana Belbita | 6.18 | 6.43 | 5' 7" | 68.0 | 14-7-0 | Orthodox | 0.42 | 0.50 | 0.0 | 0.50 | 0.63 | 0.68 | 115 | 1996 |
Now finally we'll revisit the problem of missing values we had before. Reach is a pretty crucial physical trait that can have a big effect on the outcome of a fight. We need to figure out a way to deal with these missing values.
As with other physical characteristics like foot size, hand size, inseam, etc., wingspan has a lot to do with the height of the individual. While not perfect, we can estimate a reach value that's as statistically probable as possible by calculating the median reaches of every height (5'0", 5'1", ...) and attributing that median to the reach value of each fighter that's missing said value.
median_wingspans = df_fighters_clean.groupby('height')['reach'].median()
display(median_wingspans)
df_fighters_clean['reach'] = df_fighters_clean['reach'].fillna(df_fighters_clean['height'].map(median_wingspans))
print("There are still {} missing reach values".format(df_fighters_clean['reach'].isna().sum()))
height
-- 70.0
5' 0" 61.5
5' 1" 62.0
5' 10" 72.0
5' 11" 73.0
5' 2" 63.0
5' 3" 64.0
5' 4" 65.0
5' 5" 66.0
5' 6" 67.0
5' 7" 69.0
5' 8" 70.0
5' 9" 71.0
6' 0" 74.0
6' 1" 75.0
6' 10" NaN
6' 11" 84.0
6' 2" 75.0
6' 3" 77.0
6' 4" 78.0
6' 5" 79.0
6' 6" 79.0
6' 7" 80.0
6' 8" 80.0
7' 2" NaN
7' 5" NaN
Name: reach, dtype: float64
There are still 5 missing reach values
As seen from the output above, we still have 5 missing reach values. Compared to the 1919 that we started with, that's a substantial decrease. Since 5 is so insignificant to the gross total number of entries we have, we can just drop these entries.
df_fighters_clean.dropna(subset=['reach'], inplace=True)
As for the missing stance values, these aren't as weighted in determining the outcome of a fight and are normally just fighter preference. We'll replace these by just getting a percentage of each stance out of the total
display(df_fighters_clean.groupby('stance')['stance'].count())
stance_total = df_fighters_clean.groupby('stance')['stance'].count().sum()
print("There are {} total stance entries".format(stance_total))
display(df_fighters_clean.groupby('stance')['stance'].count() / stance_total)
stance
Open Stance 5
Orthodox 2060
Sideways 1
Southpaw 477
Switch 154
Name: stance, dtype: int64
There are 2697 total stance entries
stance
Open Stance 0.001854
Orthodox 0.763812
Sideways 0.000371
Southpaw 0.176863
Switch 0.057100
Name: stance, dtype: float64
We've calculated the decimal values that represent the percentage of each stance's frequency relative to the total. Using those as weights, we'll replace the missing stance values with one of the existing stances based on those decimal percentages.
As a demo, you can see a list of 50 random choices. Unsurprisingly, orthodox comes up the most frequently, with some southpaw, swithc, and the occasional open stance if RNG permits.
stance_list = ["Open Stance", "Orthodox", "Sideways", "Southpaw", "Switch"]
weight_list = [0.001854, 0.763812, 0.000371, 0.176863, 0.057100]
df_fighters_clean['stance'].fillna(random.choices(stance_list, weights=weight_list, k = 1)[0], inplace=True)
print(random.choices(stance_list, weights=weight_list, k = 50))
['Southpaw', 'Orthodox', 'Southpaw', 'Switch', 'Southpaw', 'Orthodox', 'Orthodox', 'Southpaw', 'Switch', 'Orthodox', 'Orthodox', 'Southpaw', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Southpaw', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Southpaw', 'Southpaw', 'Southpaw', 'Orthodox', 'Orthodox', 'Southpaw', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Southpaw', 'Orthodox', 'Orthodox', 'Southpaw', 'Southpaw', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox', 'Orthodox']
Wonderful! We have no more missing values, right? Let's check to be sure.
display(df_fighters_clean.isnull().sum())
df_fighters_clean[df_fighters_clean['height'] == '--']
SApM 0
SLpM 0
height 0
reach 0
record 0
stance 0
strAcc 0
strDef 0
subAvg 0
tdAcc 0
tdAvg 0
tdDef 0
weight 0
birth_year 0
dtype: int64
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SApM | SLpM | height | reach | record | stance | strAcc | strDef | subAvg | tdAcc | tdAvg | tdDef | weight | birth_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | ||||||||||||||
Amador Ramirez | 2.07 | 4.93 | -- | 70.0 | 5-4-0 | Orthodox | 0.51 | 0.69 | 0.0 | 0.33 | 1.00 | 0.00 | 135 | 1990 |
Matt Ricehouse | 4.80 | 3.70 | -- | 70.0 | 6-1-0 | Orthodox | 0.44 | 0.47 | 0.0 | 0.22 | 1.00 | 0.81 | 155 | 1987 |
Logan Nail | 2.27 | 1.93 | -- | 70.0 | 1-1-0 | Orthodox | 0.51 | 0.39 | 0.0 | 0.00 | 0.00 | 0.37 | 185 | 1989 |
Lee Higgins | 3.68 | 1.02 | -- | 70.0 | 2-1-0 | Orthodox | 0.26 | 0.40 | 0.0 | 0.00 | 0.00 | 0.00 | 155 | 1980 |
Hiroshi Izumi | 2.65 | 1.95 | -- | 70.0 | 4-2-0 | Orthodox | 0.37 | 0.66 | 0.5 | 0.70 | 3.35 | 1.00 | 205 | 1982 |
Neal Ewing | 1.93 | 2.27 | -- | 70.0 | 6-0-0 | Orthodox | 0.60 | 0.48 | 0.0 | 0.62 | 5.00 | 0.00 | 185 | 1985 |
TJ Cook | 3.18 | 2.30 | -- | 70.0 | 13-5-0 | Orthodox | 0.47 | 0.54 | 0.0 | 0.50 | 1.01 | 0.00 | 205 | 1982 |
Joe Duarte | 4.00 | 2.27 | -- | 70.0 | 10-4-0 | Orthodox | 0.38 | 0.53 | 1.0 | 0.50 | 3.00 | 0.69 | 155 | 1977 |
Billy Goff | 4.15 | 9.95 | -- | 70.0 | 8-2-0 | Switch | 0.45 | 0.59 | 0.0 | 0.50 | 8.29 | 1.00 | 170 | 1998 |
Edward Faaloloto | 6.25 | 2.28 | -- | 70.0 | 2-5-0 | Orthodox | 0.32 | 0.44 | 0.0 | 0.25 | 1.01 | 0.33 | 155 | 1984 |
Bryan Travers | 3.93 | 2.33 | -- | 70.0 | 15-4-0 | Orthodox | 0.48 | 0.55 | 0.0 | 0.28 | 2.00 | 0.63 | 155 | 1983 |
Maka Watson | 1.60 | 0.93 | -- | 70.0 | 4-2-0 | Orthodox | 0.37 | 0.22 | 0.0 | 1.00 | 2.00 | 0.33 | 155 | 1984 |
It turns out we still have some missing height data that wasn't caught before because it's replaced with "--" strings. After some research, most of these fighters can be found to have a height of 5'7", so we'll simply use that number to replace the few missing values we have. We'll then convert height from feet and inches to centimeters since that's a more convenient metric to mathematically work with.
df_fighters_clean['height'].replace({"--": "5\' 7\""}, inplace=True)
# Method to convert feet'inches" to cm
def convert_to_cm(height):
if height is np.NaN:
return height
elif len(height.split("'")) == 2:
feet = float(height.split("'")[0])
inches = int(height.split("'")[1].replace(' ', '').replace('"',''))
return (feet * 30.48) + (inches * 2.54)
else:
return float(height.replace('"','')) * 2.54
df_fighters_clean['height'] = df_fighters_clean['height'].apply(convert_to_cm)
Since we had disguised missing values in the height column, let's check to see if there are any in the weight column.
df_fighters_clean[df_fighters_clean['weight'] == '--']
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SApM | SLpM | height | reach | record | stance | strAcc | strDef | subAvg | tdAcc | tdAvg | tdDef | weight | birth_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name |
Great, the dataframe is empty, which means there are no missing weight values.
Next we'll split up the record so it's more integer friendly. We'll have to specify a new method to split the no contest outcomes because they are encaptured in parentheses, unlike the win-loss-draw numbers. After creating the appropriate columns, we'll drop the defunct record column.
df_fighters_clean['record'] = df_fighters_clean['record'].str.replace(' \(', '-(', regex=True)
df_fighters_clean[['win', 'lose', 'draw', 'nc']] = df_fighters_clean['record'].str.split('-', expand=True)
def split_nc(nc):
return re.findall(r"\d+", nc, re.IGNORECASE)[0]
df_fighters_clean['nc'] = df_fighters_clean['nc'].apply(lambda x: split_nc(x) if x is not None else 0)
df_fighters_clean.drop(['record'], axis=1, inplace=True)
df_fighters_clean.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
SApM | SLpM | height | reach | stance | strAcc | strDef | subAvg | tdAcc | tdAvg | tdDef | weight | birth_year | win | lose | draw | nc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||||
John Gunther | 3.72 | 1.65 | 175.26 | 72.0 | Orthodox | 0.37 | 0.46 | 0.0 | 0.42 | 7.08 | 0.00 | 155 | 1985 | 5 | 1 | 0 | 0 |
Joe Giannetti | 2.80 | 1.93 | 182.88 | 74.0 | Southpaw | 0.38 | 0.40 | 0.0 | 0.16 | 1.00 | 0.00 | 155 | 1995 | 6 | 1 | 0 | 0 |
Allen Berube | 0.92 | 0.92 | 172.72 | 70.0 | Orthodox | 0.80 | 0.33 | 3.4 | 1.00 | 6.87 | 0.00 | 155 | 1974 | 4 | 3 | 0 | 0 |
Daichi Abe | 4.49 | 3.80 | 180.34 | 71.0 | Orthodox | 0.33 | 0.56 | 0.0 | 0.50 | 0.33 | 0.00 | 170 | 1991 | 6 | 2 | 0 | 0 |
Diana Belbita | 6.18 | 6.43 | 170.18 | 68.0 | Orthodox | 0.42 | 0.50 | 0.0 | 0.50 | 0.63 | 0.68 | 115 | 1996 | 14 | 7 | 0 | 0 |
Finally, let's make sure our columns are all of the proper types.
df_fighters_clean.dtypes
SApM float64
SLpM float64
height float64
reach float64
stance object
strAcc float64
strDef float64
subAvg float64
tdAcc float64
tdAvg float64
tdDef float64
weight object
birth_year int64
win object
lose object
draw object
nc object
dtype: object
It seems we have a few columns that are string objects when they should be integers. To amend this, we'll create a simple function that converts strings to int for every appropriate column in the dataframe. The only one that should remain as a string is the stance column. The properly converted data types can be seen below.
def str2int(df, columns):
for column in columns:
df[column] = df[column].astype(int)
str2int(df_fighters_clean, ['win', 'lose', 'draw', 'nc', 'weight'])
df_fighters_clean.dtypes
SApM float64
SLpM float64
height float64
reach float64
stance object
strAcc float64
strDef float64
subAvg float64
tdAcc float64
tdAvg float64
tdDef float64
weight int32
birth_year int64
win int32
lose int32
draw int32
nc int32
dtype: object
Now that our fighter data is completely cleaned up, we can move on to the fight card dataframe. Luckily, we've already written functions that perform most of the cleaning we'll do on the fight card data. We'll start by reusing our per2dec function from before to convert the two significant strike percentage columns to decimal values.
per2dec(df_cards, ['f1_sig_strike_per', 'f2_sig_strike_per'])
We can again reuse the get_birth_year function's structure to get just the year of each fight.
def get_fight_year(dof):
return datetime.strptime(dof, '%B %d %Y').year
df_cards['fight_year'] = df_cards['fight_date'].apply(lambda x: get_fight_year(x))
df_cards.drop(['fight_date'], axis=1, inplace=True)
Another tricky characteristic we have to be weary of here is the fact that UFCStats always attributes the winner to "Fighter 1" (f1). To fix this, we'll just randomly swap f1 and f2 for half of the dataset so that about 50% of winners belong to each f1 and f2. We'll check to make sure they were rearranged properly below.
swap_indices = np.random.choice(len(df_cards), size= len(df_cards) // 2, replace = False)
df_cards.iloc[swap_indices, [1, 6]] = df_cards.iloc[swap_indices, [6, 1]]
df_cards["winner"] = df_cards["winner"] == df_cards["f1"]
df_cards["winner"] = df_cards["winner"].astype(int)
df_cards["winner"].value_counts()
1 3398
0 3397
Name: winner, dtype: int64
Since we had to change some names of fighters earlier due to duplication, we'll repeat the same process here.
df_cards_clean = df_cards.copy()
for col in ['f1', 'f2']:
df_cards_clean.loc[(df_cards_clean[col] == 'Michael McDonald') &
(df_cards_clean['weight_class'] == 'Bantamweight'), col] = "Michael McDonald 135"
df_cards_clean.loc[(df_cards_clean[col] == 'Tony Johnson') &
(df_cards_clean['weight_class'] == 'Heavyweight'), col] = "Tony Johnson 265"
df_cards_clean.loc[(df_cards_clean[col] == 'Joey Gomez') &
(df_cards_clean['weight_class'] == 'Welterweight'), col] = "Joey Gomez 155"
df_cards_clean.loc[(df_cards_clean[col] == 'Bruno Silva') &
(df_cards_clean['weight_class'] == 'Light Heavyweight'), col] = "Bruno Silva 185"
Next we'll compile a list of all the fighters from the df_fighters_clean dataframe. As a limitation of this DF, we'll drop the fights that don't have the fighters from that dataframe.
all_fighters = df_fighters_clean.index.tolist()
df_cards_clean = df_cards_clean.loc[(df_cards_clean["f1"].isin(all_fighters)) & (df_cards_clean["f2"].isin(all_fighters))]
df_cards_clean.reset_index(inplace=True, drop=True)
print("We had {} cards initially. After clean up we have {} cards".format(len(df_cards), len(df_cards_clean)))
We had 6795 cards initially. After clean up we have 6590 cards
We'll create two new dataframes now to get the stats of fighter 1 and fighter 2 separately from the df_fighters_clean dataframe. We'll then rejoin these dataframes and concatenate it with the df_cards_clean dataframe to get a single, final dataframe we can examine.
# Split
df_f1 = df_fighters_clean.loc[df_cards_clean['f1']]
df_f1 = df_f1.add_suffix('_f1')
df_f2 = df_fighters_clean.loc[df_cards_clean['f2']]
df_f2 = df_f2.add_suffix('_f2')
# Join
df_f1.reset_index(inplace=True, drop=True)
df_f2.reset_index(inplace=True, drop=True)
df_final = pd.concat([df_cards_clean, df_f1, df_f2], axis=1, sort=False)
# Rename columns
df_final['f1_age_when_fight'] = df_final['fight_year'] - df_final['birth_year_f1']
df_final['f2_age_when_fight'] = df_final['fight_year'] - df_final['birth_year_f2']
df_final.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
card_name | f1 | f1_sig_strike_per | f1_sig_strike_total | f1_td_attempt | f1_td_succeed | f2 | f2_sig_strike_per | f2_sig_strike_total | f2_td_attempt | ... | tdAvg_f2 | tdDef_f2 | weight_f2 | birth_year_f2 | win_f2 | lose_f2 | draw_f2 | nc_f2 | f1_age_when_fight | f2_age_when_fight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | UFC Fight Night: Gane vs. Volkov | Justin Jaynes | 0.28 | 182.0 | 2.0 | 2.0 | Charles Rosa | 0.47 | 92.0 | 2.0 | ... | 1.67 | 0.38 | 145 | 1986 | 14 | 8 | 0 | 0 | 32.0 | 35.0 |
1 | UFC Fight Night: Gane vs. Volkov | Damir Hadzovic | 0.47 | 219.0 | 2.0 | 2.0 | Yancy Medeiros | 0.51 | 237.0 | 3.0 | ... | 0.33 | 0.73 | 155 | 1987 | 15 | 8 | 0 | 1 | 35.0 | 34.0 |
2 | UFC Fight Night: Font vs. Garbrandt | Damir Ismagulov | 0.47 | 63.0 | 1.0 | 0.0 | Rafael Alves | 0.44 | 126.0 | 3.0 | ... | 0.41 | 0.60 | 155 | 1990 | 20 | 11 | 0 | 0 | 30.0 | 31.0 |
3 | UFC Fight Night: Gane vs. Volkov | Julija Stoliarenko | 0.52 | 91.0 | 4.0 | 1.0 | Julia Avila | 0.42 | 94.0 | 3.0 | ... | 0.67 | 0.61 | 135 | 1988 | 9 | 2 | 0 | 0 | 28.0 | 33.0 |
4 | UFC Fight Night: Hall vs. Strickland | Ashley Yoder | 0.47 | 185.0 | 1.0 | 0.0 | Jinh Yu Frey | 0.38 | 236.0 | 8.0 | ... | 0.61 | 0.88 | 115 | 1985 | 11 | 8 | 0 | 0 | 34.0 | 36.0 |
5 rows × 54 columns
For record's sake, we'll output this final dataframe to a CSV.
df_final.to_csv('cleaned_ufc_stats.csv', index=False)
display(FileLink('cleaned_ufc_stats.csv', result_html_prefix="Cleaned UFC Stats: "))
Cleaned UFC Stats: cleaned_ufc_stats.csv
Now that our data is all cleaned up, we can finally analyze it. Let's start by looking at statistics surrounding the age of fighters. The first set of histograms displays the distribution of fighters ages throughout the UFC. The second set of histograms displays number of wins versus the age of the fighter, in decreasing order.
fig, ax = plt.subplots(1, 2, figsize=(10,8))
sns.histplot(df_final['f1_age_when_fight'], ax=ax[0], kde=True)
ax[0].set_title('Fighter 1 Ages')
ax[0].set_xlabel('Age')
ax[0].set_ylabel('Count')
sns.histplot(df_final['f2_age_when_fight'], ax=ax[1], kde=True)
ax[1].set_title('Fighter 2 Ages')
ax[1].set_xlabel('Age')
ax[1].set_ylabel('Count')
plt.show()
We see that the majority of fighters are between the 25-35 range. This should be unsurprising, because it takes most fighters a few years of fighting amateur fights when they are in their younger twenties and many fighters start to retire after 35+. The career span of athletes to begin with are relatively short, almost always less than twenty years. The career span of combat sports athletes are even shorter because of the physical strain on their bodies and the damage they take in the cage/ring.
fig, ax = plt.subplots(1, 2, figsize=(12,8))
df_final[df_final['winner'] == 0]['f1_age_when_fight'].value_counts().plot.bar(ax=ax[0])
ax[0].set_title('Fighter 1 Wins by Age')
ax[0].set_xlabel('Age')
ax[0].set_ylabel('Wins')
bar = df_final[df_final['winner'] ==1]['f2_age_when_fight'].value_counts().plot.bar(ax=ax[1])
ax[1].set_title('Fighter 2 Wins by Age')
ax[1].set_xlabel('Age')
ax[1].set_ylabel('Wins')
plt.show()
As is consistent with the age distribution, the prime year range of 25-35 is located to the leftmost side of the x-axis with the most wins.
Moving onto physical attributes, let's see what results from analyzing height.
fig, ax = plt.subplots(1, 2, figsize=(10,8))
sns.histplot(df_final['height_f1'], ax=ax[0], kde=True)
ax[0].set_title('Fighter 1 Heights')
ax[0].set_xlabel('Height')
ax[0].set_ylabel('Count')
sns.histplot(df_final['height_f2'], ax=ax[1], kde=True)
ax[1].set_title('Fighter 2 Heights')
ax[1].set_xlabel('Height')
ax[1].set_ylabel('Count')
plt.show()
The heights still appear to be normally distributed, but not as smoothly as the ages. The majority of the data is contained with the 170-190cm range. For reference, 170cm is about 5'7" and 190cm is about 6'3". While this distribution is higher than what it would be for the average population, for professional athletes it makes perfect sense because an increase in height brings many physical advantages.
fig, ax = plt.subplots(1, 2, figsize=(12,8))
df_final[df_final['winner'] == 0]['height_f1'].value_counts().plot.bar(ax=ax[0])
ax[0].set_title('Fighter 1 Wins by Height')
ax[0].set_xlabel('Height')
ax[0].set_ylabel('Wins')
bar = df_final[df_final['winner'] ==1]['height_f2'].value_counts().plot.bar(ax=ax[1])
ax[1].set_title('Fighter 2 Wins by Height')
ax[1].set_xlabel('Height')
ax[1].set_ylabel('Wins')
plt.show()
Once again, since most of the data is between the 170-190cm range, that's where most of the wins are contained as well. What's interesting is that 167cm snuck inbetween a few of the values in that range, namely 170, 188, and 190.
The bar graph below displays the total number of fighters by division in UFC history.
plt.figure(figsize=(12, 8))
sns.countplot(y=df_final['weight_class'])
plt.title('Number of Fighters by Division in UFC History')
plt.xlabel('Count')
plt.ylabel('Weight Division')
plt.show()
For anyone that is familiar with UFC weight classes, lightweight (155 lbs limit) and welterweight (170 lbs limit) being the most popular by fighter count shouldn't be surprising. To be able to be as big as possible, many fighters will cut twenty, even twenty-five pounds or more from their normal walking weight to compete. For example, the normal lightweight fighter will weigh in at 155 lbs or less (156 for non-championship bouts) on the day of the weigh-ins, but weeks or months after the fight, they will probably be walking around at 170-180 lbs. For welterweight they could be anywhere from 185-200 or even more. Obviously this isn't true for every fighter, but one could imagine that most of the gross population (in the US at least) probably weighs between that 170-200 lbs range.
Next let's look at wins by winning method.
x = df_final['winning_method'].value_counts()
y = x.index
plt.figure(figsize=(12, 8))
sns.barplot(x=x, y=y)
plt.title('UFC Fight Outcomes')
plt.xlabel('Count')
plt.ylabel('Outcome')
plt.show()
This may shock a lot of people, UFC fans and non-fans alike. Many may guess unanimous decisions were the most popular winning method, but I don't think many would realize how close KO/TKO is behind it. To expand more on the winning methods, "Overturned" includes situations where fighter A won the bout, then for one reason or another, fighter B was awarded the win post-fight. One example of this could be if fighter A was found to be on performance enhancing drugs and the UFC decided to award the win to fighter B. A "CNC" or "Could Not Continue" is essentially the MMA equivalent of a boxer's coach throwing in the towel. Between rounds, a coach may be trying to save the health of their fighter and decalare a CNC. You're probably familiar with a "DQ" or "Disqualification" in which one fighter does something illegal like knee the head of a grounded opponent or strike the opponent with 12-6 elbows. Finally, "Other" will encompass things like split draws I mentioned at the beginning of the tutorial and no contests, which are fights that have neither a winner nor a loser due to extrenuating circumstances like an accidental headbutt or eye poke.
It would be interesting to examine this data further by weight class.
bar = df_final.groupby(['weight_class', 'winning_method']).size().reset_index().pivot(columns='winning_method', index='weight_class', values=0)
bar.plot(kind='bar',stacked=False, figsize=(15,8))
plt.legend()
plt.title('UFC Fight Outcome by Division')
plt.xlabel('Weight Class')
plt.ylabel('Count')
plt.show()
If you take a look at middleweight (185 lbs), light heavyweight (205 lbs), and heavyweight (260 lbs), you'll notice that KO/TKO has overtaken unanimous decision as the most common outcome. This is because with more weight comes more strength and power. While heavier fighters can usually take harder hits and stay standing compared to lighter fighters, that resilience probably doesn't scale at the same rate as the power that comes with added weight.
Another point of interest is the women's flyweight (125 lbs) and women's strawweight (115 lbs) divisions. The second most common method behind U-DEC is no longer KO/TKO but submissions. This is again probably due to the previous point, where those lighter women don't have as much knockout power but they can still have crisp submission technique that allows them to submit opponents without needing as much strength as the heavier divisions.
Other than those weight classes mentioned, every other one follows the same ranking of outcomes.
Finally, let's see if the distribution of outcomes has changed over the years at all.
bar = df_final.groupby(['fight_year', 'winning_method']).size().reset_index().pivot(columns='winning_method', index='fight_year')
bar.plot(kind='barh', stacked=True, figsize=(15,8))
plt.legend()
plt.title('UFC Fight Outcomes over the Years')
plt.xlabel('Count')
plt.ylabel('Year')
plt.show()
While we see some of these outcomes bounce back and forth, for the most party they've remained surprisingly stable in terms of frequency. In the earlier years of the UFC, there were definitely more knockouts happening, but that quickly change around 2006. It seems like the values taper off once 2014 comes around.
Our null hypothesis would state that none of the fields we've covere in our dataframe would have any correlation with winning fights. The alternate hypothesis would assert that there is indeed qualities that correlate with winning fights. Let's test this.
# Drop irrelevant columns
df = df_final.drop(['fights_location', 'card_name'], axis=1)
# Encode inputs of type object
encoder = LabelEncoder()
encoded_1 = df['weight_class']
encoded_1 = encoder.fit_transform(encoded_1)
encoded_2 = df['stance_f1']
encoded_2 = encoder.fit_transform(encoded_2)
encoded_3 = df['stance_f2']
encoded_3 = encoder.fit_transform(encoded_3)
encoded_1 = pd.DataFrame(encoded_1, columns=['weight_class'])
encoded_2 = pd.DataFrame(encoded_2, columns=['stance_f1'])
encoded_3 = pd.DataFrame(encoded_3, columns=['stance_f2'])
df[['weight_class']] = encoded_1[['weight_class']]
df[['stance_f1']] = encoded_2[['stance_f1']]
df[['stance_f2']] = encoded_3[['stance_f2']]
df = pd.concat([df,pd.get_dummies(df['winning_method'], prefix='winning_method')],axis=1)
df.drop(['winning_method'], axis=1, inplace=True)
display(df.head())
encode = df[['f1', 'f2', 'weight_class']].apply(encoder.fit_transform)
df[['f1', 'f2', 'weight_class']] = encode[['f1', 'f2', 'weight_class']]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
f1 | f1_sig_strike_per | f1_sig_strike_total | f1_td_attempt | f1_td_succeed | f2 | f2_sig_strike_per | f2_sig_strike_total | f2_td_attempt | f2_td_succeed | ... | f2_age_when_fight | winning_method_CNC | winning_method_DQ | winning_method_KO/TKO | winning_method_M-DEC | winning_method_Other | winning_method_Overturned | winning_method_S-DEC | winning_method_SUB | winning_method_U-DEC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Justin Jaynes | 0.28 | 182.0 | 2.0 | 2.0 | Charles Rosa | 0.47 | 92.0 | 2.0 | 2.0 | ... | 35.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | Damir Hadzovic | 0.47 | 219.0 | 2.0 | 2.0 | Yancy Medeiros | 0.51 | 237.0 | 3.0 | 2.0 | ... | 34.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | Damir Ismagulov | 0.47 | 63.0 | 1.0 | 0.0 | Rafael Alves | 0.44 | 126.0 | 3.0 | 2.0 | ... | 31.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | Julija Stoliarenko | 0.52 | 91.0 | 4.0 | 1.0 | Julia Avila | 0.42 | 94.0 | 3.0 | 1.0 | ... | 33.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | Ashley Yoder | 0.47 | 185.0 | 1.0 | 0.0 | Jinh Yu Frey | 0.38 | 236.0 | 8.0 | 0.0 | ... | 36.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 60 columns
To run through the above code, we first drop the irrelevant columns we won't need. Then we encode inputs of type "object" which are weight_class, stance_f1, and stance_f2. We create a new dataframe out of these encoded values and add them to the cleaned dataframe. After concatenating the df dataframe with a dummy dataframe of "winning_method," we drop "winning_method" as it's no longer needed. The head of this ultimate dataframe is displayed above. Finally, we apply the fit_transform function to the object type columns of the dataframe.
Below we see the correlation factors of every header from our dataframe.
plt.figure(figsize=(10,15))
sns.heatmap(df.corr()[['winner']].sort_values(by='winner', ascending=False), annot=True)
plt.title('Correlation Factors of Dataframe Field')
plt.xlabel('Correlation Factor')
plt.ylabel('Dataframe Field')
plt.show()
As we can see, the only value that isn't in the deep purple color is "winner," which is obviously 1 because every winner won their respective fight.
Since every correlation value is less than or equal to abs(0.031) on a scale of -1 to 1, it's safe to say that we fail to reject the null hypothesis in that we don't have enough evidence to assert that any of these fields has a correlation to winning fights.
As one would expect, the reach, weight, and height qualities are found at the top, but what's most interesting is that the field with the highest correlation factor is Fighter 1's average takedowns per 15 minutes. Apparently the amount of successful takedowns you get in a bout helps you win more than anything else. This coincides with the information on fight ouctomes we previously looked at. Unanimous decisions are the most frequent outcome of fights, and this includes going to the judges for scorecards. This might mean that judges are swayed more by successful takedowns than any other measured metric (judges may be swayed more by something like visual damage/blood/cuts but this isn't something that can be quanitified).
Funnily enough, Fighter 1's stance affects the outcome of a fight the least, which is consistent with my earlier assertion that fighter's stanc is somewhat irrelevant.
Although none of these fields seem to correlate much with winning fights on the gross spectrum across the UFC, this doesn't rule out the possibility that there can be correlation factors if we take a deeper dive into individual fighters. For example, if one wants to predict the outcome of a single fight, they can replicate some of these data analyzing steps with the statistics of the two specific fighters involved in the bout, which will probably give much more skewed results and may give you a statistical advantage in predicting winners.
Sports is an extremely broad category, so if you're not as much interested in UFC or MMA, this tutorial can be applied to other sports you may be interested in. If you're not interested in sports at all, you can still apply this to other games like videogames, chess, etc.
I hope this helped you experience a little taste of working in data science.