The task is to analyze ball by ball data from all the way 2008 to 2019. Using this we need to come up with analysis to form our own dream team for IPL.
For year 2016, 2017 and 2018
I.
II.
III. Rank top 25 players for the year 2016 to 2019
IV. Identify most improved player from 2018 to 2019
V. Find out which stadium had the most runs and which scored the least
VI. Find out where (which stadium or which bowler) bowlers perform better
VII. Create a dream team satisfying all the rules of IPL team making.
Indian Premier League (IPL) is a Twenty20 cricket format league in India. It is usually played in April and May every year. As of 2019, the title sponsor of the game is Vivo. The league was founded by Board of Control for Cricket India (BCCI) in 2008.
The data contains result of each match and each delivery for 11 seasons.
Ah, here we go again!
Launching in T minus 3
3
2
1
Import the necessary packages needed for EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Read csv(s)
xxxxxxxxxx
Deliveries=pd.read_csv(r'innings_data.csv')
Matches=pd.read_csv(r'match_data.csv')
Displaying the first 5 records of the Deliveries dataframe.
xxxxxxxxxx
Deliveries.head()
Image 1
Checking out [No pun intended] the first 5 records of Matches
xxxxxxxxxx
Matches.head()
IMage2
We can pretty much understand what the data is all about just at a single glance.
Now we might or might not have missing values in our dataset. But better safe than sorry let's check and remove the anomalies.
xxxxxxxxxx
Deliveries.isnull().sum()*100/Deliveries.shape[0]
The shape[0] is just to get a percentage.
IMage 3
replacements? (duh!)
xxxxxxxxxx
Matches.isnull().sum()*100/Matches.shape[0]
image 4
eliminator, check!
dl_applied, check!
result, check!
umpire3, double check!
Now let's just get into deliveries' details.
xxxxxxxxxx
print('Deliveries Details')
print('Shape: ',Deliveries.shape)
print('Size: ',Deliveries.size)
print('Dimensions: ',Deliveries.ndim)
xxxxxxxxxx
Deliveries Details
Shape: (176573, 20)
Size: 3531460
Dimensions: 2
So we have 176573 rows with 20 columns each.
And the matches' details are as follows:
xxxxxxxxxx
print('Matches Details')
print('Shape: ',Matches.shape)
print('Size: ',Matches.size)
print('Dimensions: ',Matches.ndim)
xxxxxxxxxx
Matches Details
Shape: (746, 24)
Size: 17904
Dimensions: 2
Let's sort index of Deliveries and that of matches
xxxxxxxxxx
Deliveries=Deliveries.sort_index()
xxxxxxxxxx
Matches=Matches.sort_values('id')
xxxxxxxxxx
Deliveries.describe()
IMage 5
xxxxxxxxxx
Matches.describe()
IMage 6
xxxxxxxxxx
Deliveries.dtypes
xxxxxxxxxx
id int64
season int64
batsman object
bowler object
innings object
non_striker object
replacements object
bowled_over float64
batsman_team object
player_out object
fielder_caught_out object
type_out object
extras_wides int64
extras_legbyes int64
extras_noballs int64
extras_byes int64
extras_penalty int64
total_extras_runs int64
batsman_runs int64
total_runs int64
dtype: object
xxxxxxxxxx
Matches.dtypes
xxxxxxxxxx
id int64
season int64
city object
date object
team1 object
team2 object
toss_winner object
toss_decision object
winner object
eliminator object
dl_applied object
win_by_runs float64
win_by_wickets float64
result object
overs int64
player_of_match object
venue object
umpire1 object
umpire2 object
umpire3 object
first_bat_team object
first_bowl_team object
first_bat_score float64
second_bat_score float64
dtype: object
Removing unnecessary stuff.
Plus replacing the value by 1 if anyone is caught out or bowled.
xxxxxxxxxx
Deliveries.drop(['replacements','id'],inplace=True,axis=1)
Deliveries['type_out'].replace('[a-zA-z \s]+','1',regex=True,inplace=True)
Deliveries['type_out']=Deliveries['type_out'].astype(int)
xxxxxxxxxx
print(Deliveries.isnull().sum())
Deliveries.head()
xxxxxxxxxx
season 0
batsman 0
bowler 0
innings 0
non_striker 0
bowled_over 0
batsman_team 0
player_out 0
fielder_caught_out 0
type_out 0
extras_wides 0
extras_legbyes 0
extras_noballs 0
extras_byes 0
extras_penalty 0
total_extras_runs 0
batsman_runs 0
total_runs 0
dtype: int64
Cool!
Image 6
Now we shall remove the unnecessary values in Matches.csv and fill unknown values.
xxxxxxxxxx
Matches.drop(['id','eliminator','result','umpire3','dl_applied'],inplace=True,axis=1)
Matches['city'].fillna('unknown',inplace=True)
Matches['winner'].fillna('No Result',inplace=True)
Matches['player_of_match'].fillna('unknown',inplace=True)
Matches['umpire1'].fillna('unknown',inplace=True)
Matches['umpire2'].fillna('unknown',inplace=True)
Matches['second_bat_score'].fillna(0,inplace=True)
Matches['win_by_runs'].fillna(0.0,inplace=True)
Matches['win_by_wickets'].fillna(0.0,inplace=True)
xxxxxxxxxx
Matches.isnull().sum()
xxxxxxxxxx
season 0
city 0
date 0
team1 0
team2 0
toss_winner 0
toss_decision 0
winner 0
win_by_runs 0
win_by_wickets 0
overs 0
player_of_match 0
venue 0
umpire1 0
umpire2 0
first_bat_team 0
first_bowl_team 0
first_bat_score 0
second_bat_score 0
dtype: int64
Almost done, just a lil changes here and there.
xxxxxxxxxx
Matches['win_by_runs']=Matches['win_by_runs'].astype(int)
Matches['win_by_wickets']=Matches['win_by_wickets'].astype(int)
xxxxxxxxxx
Matches.head()
Image 7
Okay. All anamolies removed.
What we basically did was
Note: It is better to remove rows or columns with high missing values. We could opt for filling it because it brings down the "real-ness" of the dataset.
xxxxxxxxxx
df=Deliveries.loc[Deliveries['season']>=2016]
df=df.groupby('batsman').sum()
df.reset_index(level=0, inplace=True)
df=df[['batsman','batsman_runs']]
df.sort_values(by=['batsman_runs'],ascending=False,inplace=True)
df.head(10)
Now we are grouping the data by batsman and sorting them based on the number of runs they scored.
x.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
batsman | batsman_runs | |
---|---|---|
49 | DA Warner | 1871 |
232 | V Kohli | 1826 |
190 | S Dhawan | 1805 |
16 | AM Rahane | 1626 |
202 | SK Raina | 1569 |
7 | AB de Villiers | 1498 |
142 | MS Dhoni | 1389 |
108 | KL Rahul | 1342 |
181 | RG Sharma | 1270 |
22 | AT Rayudu | 1267 |
Same with the bowler except here we calculate the total wickets taken.
xxxxxxxxxx
df1=Deliveries.loc[Deliveries['season']>=2016]
df1=df1.groupby('bowler').sum()
df1.reset_index(level=0, inplace=True)
df1=df1[['bowler','type_out']]
xxxxxxxxxx
df1.sort_values(by=['type_out'],ascending=False,inplace=True)
df1.head(10)
xxxxxxxxxx
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
bowler | type_out | |
---|---|---|
69 | JJ Bumrah | 67 |
168 | SP Narine | 66 |
173 | Sandeep Sharma | 66 |
192 | YS Chahal | 63 |
185 | UT Yadav | 61 |
21 | B Kumar | 56 |
146 | Rashid Khan | 53 |
103 | MJ McClenaghan | 49 |
11 | AR Patel | 48 |
68 | JD Unadkat | 45 |
And a simple bar plot.
xxxxxxxxxx
import palettable
sns.set_style('darkgrid')
pal=palettable.cartocolors.qualitative.Prism_10.mpl_colors
plt.figure(figsize=[15,7])
plt.title('Highest Run Scorers from 2016-2019')
ax=sns.barplot(x='batsman',y='batsman_runs',data=df.head(10),palette=pal)
t=ax.set(xlabel='Batsman',ylabel='Runs Scored')
We can see that DA Warner scored the highest runs followed by Kohli and so on.
Plotting for bowlers gives the below image.
Inference? Comment below * Wink *
xxxxxxxxxx
pal=palettable.cartocolors.qualitative.Prism_10.mpl_colors
plt.figure(figsize=[15,7])
plt.title('Highest Wicket Takers from 2016-2019')
sns.set_style('darkgrid')
ax=sns.barplot(x='bowler',y='type_out',data=df1.head(10),palette=pal)
t=ax.set(xlabel='Bowlers',ylabel='Wickets Taken')
We have come up with our own function to calculate the points for a player after referring some official websites of BCCI and IPL.
The function we came up with is valid and usable because it isn't bias and giving points to bowlers and batsman based on good credibility.
xxxxxxxxxx
def mvpbat(x):
if x==4:
return 2.5
elif x==6:
return 3.5
else:
return 0
def mvpbowl(x):
if x==1:
return 2.5
else:
return 0
def mvpdotball(x):
if x==0:
return 1
else:
return 0
def mvpfield(x):
if x=='0':
return 0
else:
return 2.5
Deliveries['mvp_batsman'] = [mvpbat(x) for x in Deliveries['batsman_runs']]
Deliveries['mvp_bowler'] = [mvpbowl(x) for x in Deliveries['type_out']]
Deliveries['mvp_dot'] = [mvpdotball(x) for x in Deliveries['batsman_runs']]
Deliveries['mvp_fielder']= [mvpfield(x) for x in Deliveries['fielder_caught_out']]
Deliveries['mvp_bowler']+=Deliveries['mvp_dot']
Deliveries.drop('mvp_dot',axis=1,inplace=True)
xxxxxxxxxx
Deliveries.head()
Image 8
xxxxxxxxxx
Deliveries['mvp_bowler'].describe()
xxxxxxxxxx
count 176573.000000
mean 0.526170
std 0.820316
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 3.500000
Name: mvp_bowler, dtype: float64
xxxxxxxxxx
Deliveries['mvp_fielder'].describe()
xxxxxxxxxx
count 176573.000000
mean 0.088547
std 0.462091
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 2.500000
Name: mvp_fielder, dtype: float64
xxxxxxxxxx
mvpbat=Deliveries.loc[Deliveries['season']>=2016]
mvpbat=mvpbat.groupby(['batsman','season']).sum()
mvpbat.reset_index(level=(0,1), inplace=True)
mvpbat.sort_values(by=['batsman_runs'],ascending=False,inplace=True)
mvpbat.head(10)
mvpbat.drop(['mvp_bowler','bowled_over','type_out','extras_wides','extras_legbyes','extras_noballs','extras_byes','total_extras_runs'
,'total_runs','extras_penalty'],axis=1,inplace=True)
mvpbat.rename(index=str,columns={'batsman':'player'},inplace=True)
mvpbat.head()
xxxxxxxxxx
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
player | season | batsman_runs | mvp_batsman | mvp_fielder | |
---|---|---|---|---|---|
114 | DA Warner | 2016 | 755 | 275.5 | 50.0 |
524 | V Kohli | 2016 | 661 | 212.0 | 57.5 |
253 | KS Williamson | 2018 | 657 | 241.0 | 47.5 |
116 | DA Warner | 2019 | 617 | 211.0 | 55.0 |
423 | S Dhawan | 2016 | 560 | 190.0 | 30.0 |
xxxxxxxxxx
mvpbowl=Deliveries.loc[Deliveries['season']>=2016]
mvpbowl=mvpbowl.groupby(['bowler','season']).sum()
mvpbowl.reset_index(level=(0,1), inplace=True)
mvpbowl.sort_values(by=['type_out'],ascending=False,inplace=True)
mvpbowl.head(10)
mvpbowl.drop(['mvp_batsman','mvp_fielder','bowled_over','batsman_runs','extras_wides','extras_legbyes','extras_noballs','extras_byes','total_extras_runs'
,'total_runs','extras_penalty'],axis=1,inplace=True)
mvpbowl.rename(index=str,columns={'bowler':'player'},inplace=True)
xxxxxxxxxx
mvp=mvpbat.merge(mvpbowl,how='outer')
mvp.fillna(0,inplace=True)
mvp['total_value']=mvp['mvp_batsman']+mvp['mvp_bowler']+mvp['mvp_fielder']
xxxxxxxxxx
mvp.head()
IMage 9
xxxxxxxxxx
mvp.groupby(['player','season']).sum().sort_values('total_value',ascending=False).head(10)
Image 10
xxxxxxxxxx
season2016=mvp[mvp['season']==2016]
season2017=mvp[mvp['season']==2017]
season2018=mvp[mvp['season']==2018]
season2019=mvp[mvp['season']==2019]
season2016bat=mvpbat[mvpbat['season']==2016]
season2017bat=mvpbat[mvpbat['season']==2017]
season2018bat=mvpbat[mvpbat['season']==2018]
season2019bat=mvpbat[mvpbat['season']==2019]
season2016bowl=mvpbowl[mvpbowl['season']==2016]
season2017bowl=mvpbowl[mvpbowl['season']==2017]
season2018bowl=mvpbowl[mvpbowl['season']==2018]
season2019bowl=mvpbowl[mvpbowl['season']==2019]
xxxxxxxxxx
#Most Valuable Player for each season
season2016[season2016['total_value']==max(season2016['total_value'])]
Image 11
xxxxxxxxxx
season2017[season2017['total_value']==max(season2017['total_value'])]
Im 12
xxxxxxxxxx
season2018[season2018['total_value']==max(season2018['total_value'])]
im 13
xxxxxxxxxx
#Most Consistent Batsman
top10bat=pd.concat([season2016bat.sort_values('mvp_batsman',ascending=False).head(10),
season2017bat.sort_values('mvp_batsman',ascending=False).head(10),
season2018bat.sort_values('mvp_batsman',ascending=False).head(10)])
xxxxxxxxxx
top10bat.groupby('player').sum().sort_values('mvp_batsman',ascending=False)['mvp_batsman'].head(1)
xxxxxxxxxx
player
AM Rahane 490.5
Name: mvp_batsman, dtype: float64
xxxxxxxxxx
#Most Consistent Bowler
top10bowl=pd.concat([season2016bowl.sort_values('mvp_bowler',ascending=False).head(10),
season2017bowl.sort_values('mvp_bowler',ascending=False).head(10),
season2018bowl.sort_values('mvp_bowler',ascending=False).head(10)])
xxxxxxxxxx
top10bowl.groupby('player').sum().sort_values('mvp_bowler',ascending=False)['mvp_bowler'].head(1)
xxxxxxxxxx
player
JJ Bumrah 541.5
Name: mvp_bowler, dtype: float64
xxxxxxxxxx
#Worst Player
season2016[season2016['total_value']==min(season2016['total_value'])].tail(1)
Im 14
xxxxxxxxxx
season2017[season2017['total_value']==min(season2017['total_value'])].tail(1)