Exploratory Data Analysis of IPL Dataset

Problem Statement

The task is to analyze ball by ball data from all the way 2008 to 2019. Using this we need to come up with analysis to form our own dream team for IPL.
For year 2016, 2017 and 2018

I.
1. Find out most valuable player - explain why
2. Find out most consistent batsman - explain why
3. Find out most consistent bowler - explain why
II.
1. Find out, worst player - explain why
2. Find out worst batsmen - explain why
3. Find out worst bowler - explain why
III. Rank top 25 players for the year 2016 to 2019
IV. Identify most improved player from 2018 to 2019
V. Find out which stadium had the most runs and which scored the least
VI. Find out where (which stadium or which bowler) bowlers perform better
VII. Create a dream team satisfying all the rules of IPL team making.

About Data

Indian Premier League (IPL) is a Twenty20 cricket format league in India. It is usually played in April and May every year. As of 2019, the title sponsor of the game is Vivo. The league was founded by Board of Control for Cricket India (BCCI) in 2008.

The data contains result of each match and each delivery for 11 seasons.

Ah, here we go again!

Launching in T minus 3

Import the necessary packages needed for EDA


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Read csv(s)


xxxxxxxxxx
Deliveries=pd.read_csv(r'innings_data.csv')
Matches=pd.read_csv(r'match_data.csv')

Displaying the first 5 records of the Deliveries dataframe.


xxxxxxxxxx
Deliveries.head()

Image 1

Checking out [No pun intended] the first 5 records of Matches


xxxxxxxxxx
Matches.head()

IMage2

We can pretty much understand what the data is all about just at a single glance.

Now we might or might not have missing values in our dataset. But better safe than sorry let's check and remove the anomalies.

Data Cleaning


xxxxxxxxxx
Deliveries.isnull().sum()*100/Deliveries.shape[0]

The shape[0] is just to get a percentage.

IMage 3

replacements? (duh!)


xxxxxxxxxx
Matches.isnull().sum()*100/Matches.shape[0]

image 4

eliminator, check!

dl_applied, check!

result, check!

umpire3, double check!

Now let's just get into deliveries' details.


xxxxxxxxxx
print('Deliveries Details')
print('Shape: ',Deliveries.shape)
print('Size: ',Deliveries.size)
print('Dimensions: ',Deliveries.ndim)


xxxxxxxxxx
Deliveries Details
Shape:  (176573, 20)
Size:  3531460
Dimensions:  2

So we have 176573 rows with 20 columns each.

And the matches' details are as follows:


xxxxxxxxxx
print('Matches Details')
print('Shape: ',Matches.shape)
print('Size: ',Matches.size)
print('Dimensions: ',Matches.ndim)


xxxxxxxxxx
Matches Details
Shape:  (746, 24)
Size:  17904
Dimensions:  2

Let's sort index of Deliveries and that of matches


xxxxxxxxxx
Deliveries=Deliveries.sort_index()


xxxxxxxxxx
Matches=Matches.sort_values('id')


xxxxxxxxxx
Deliveries.describe()

IMage 5


xxxxxxxxxx
Matches.describe()

IMage 6


xxxxxxxxxx
Deliveries.dtypes


xxxxxxxxxx
id                      int64
season                  int64
batsman                object
bowler                 object
innings                object
non_striker            object
replacements           object
bowled_over           float64
batsman_team           object
player_out             object
fielder_caught_out     object
type_out               object
extras_wides            int64
extras_legbyes          int64
extras_noballs          int64
extras_byes             int64
extras_penalty          int64
total_extras_runs       int64
batsman_runs            int64
total_runs              int64
dtype: object


xxxxxxxxxx
Matches.dtypes


xxxxxxxxxx
id                    int64
season                int64
city                 object
date                 object
team1                object
team2                object
toss_winner          object
toss_decision        object
winner               object
eliminator           object
dl_applied           object
win_by_runs         float64
win_by_wickets      float64
result               object
overs                 int64
player_of_match      object
venue                object
umpire1              object
umpire2              object
umpire3              object
first_bat_team       object
first_bowl_team      object
first_bat_score     float64
second_bat_score    float64
dtype: object

Removing unnecessary stuff.

Plus replacing the value by 1 if anyone is caught out or bowled.


xxxxxxxxxx
Deliveries.drop(['replacements','id'],inplace=True,axis=1)
Deliveries['type_out'].replace('[a-zA-z \s]+','1',regex=True,inplace=True)
Deliveries['type_out']=Deliveries['type_out'].astype(int)


xxxxxxxxxx
print(Deliveries.isnull().sum())
Deliveries.head()


xxxxxxxxxx
season                0
batsman               0
bowler                0
innings               0
non_striker           0
bowled_over           0
batsman_team          0
player_out            0
fielder_caught_out    0
type_out              0
extras_wides          0
extras_legbyes        0
extras_noballs        0
extras_byes           0
extras_penalty        0
total_extras_runs     0
batsman_runs          0
total_runs            0
dtype: int64

Cool!

Image 6

Now we shall remove the unnecessary values in Matches.csv and fill unknown values.


xxxxxxxxxx
Matches.drop(['id','eliminator','result','umpire3','dl_applied'],inplace=True,axis=1)
Matches['city'].fillna('unknown',inplace=True)
Matches['winner'].fillna('No Result',inplace=True)
Matches['player_of_match'].fillna('unknown',inplace=True)
Matches['umpire1'].fillna('unknown',inplace=True)
Matches['umpire2'].fillna('unknown',inplace=True)
Matches['second_bat_score'].fillna(0,inplace=True)
Matches['win_by_runs'].fillna(0.0,inplace=True)
Matches['win_by_wickets'].fillna(0.0,inplace=True)


xxxxxxxxxx
Matches.isnull().sum()


xxxxxxxxxx
season              0
city                0
date                0
team1               0
team2               0
toss_winner         0
toss_decision       0
winner              0
win_by_runs         0
win_by_wickets      0
overs               0
player_of_match     0
venue               0
umpire1             0
umpire2             0
first_bat_team      0
first_bowl_team     0
first_bat_score     0
second_bat_score    0
dtype: int64

Almost done, just a lil changes here and there.


xxxxxxxxxx
Matches['win_by_runs']=Matches['win_by_runs'].astype(int)
Matches['win_by_wickets']=Matches['win_by_wickets'].astype(int)


xxxxxxxxxx
Matches.head()

Image 7

Okay. All anamolies removed.

What we basically did was

Got an overview of the data.
Checked for anomalies.
Removed them if they couldn't be fixed.
Filled the missing values for others.

Note: It is better to remove rows or columns with high missing values. We could opt for filling it because it brings down the "real-ness" of the dataset.

Data Analysis


xxxxxxxxxx
df=Deliveries.loc[Deliveries['season']>=2016]
df=df.groupby('batsman').sum()
df.reset_index(level=0, inplace=True)
df=df[['batsman','batsman_runs']]
df.sort_values(by=['batsman_runs'],ascending=False,inplace=True)
df.head(10)

Now we are grouping the data by batsman and sorting them based on the number of runs they scored.


x
.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

	batsman	batsman_runs
49	DA Warner	1871
232	V Kohli	1826
190	S Dhawan	1805
16	AM Rahane	1626
202	SK Raina	1569
7	AB de Villiers	1498
142	MS Dhoni	1389
108	KL Rahul	1342
181	RG Sharma	1270
22	AT Rayudu	1267

	bowler	type_out
69	JJ Bumrah	67
168	SP Narine	66
173	Sandeep Sharma	66
192	YS Chahal	63
185	UT Yadav	61
21	B Kumar	56
146	Rashid Khan	53
103	MJ McClenaghan	49
11	AR Patel	48
68	JD Unadkat	45

We can see that DA Warner scored the highest runs followed by Kohli and so on.

We have come up with our own function to calculate the points for a player after referring some official websites of BCCI and IPL.

The function we came up with is valid and usable because it isn't bias and giving points to bowlers and batsman based on good credibility.

	player	season	batsman_runs	mvp_batsman	mvp_fielder
114	DA Warner	2016	755	275.5	50.0
524	V Kohli	2016	661	212.0	57.5
253	KS Williamson	2018	657	241.0	47.5
116	DA Warner	2019	617	211.0	55.0
423	S Dhawan	2016	560	190.0	30.0

Exploratory Data Analysis of IPL Dataset

Problem Statement

About Data

Data Cleaning

Data Analysis

Calculating MVP Values for Players

Creating Data frames for MVPs in Batting, Bowling and Overall

Dividing MVP table wrt seasons

Finding MVP for each season

Finding Most Consistent Batsman and Bowler

Worst Player, Batsman and Bowler