The task is to analyze ball by ball data from all the way 2008 to 2019. Using this you come up with analysis to form your own dream team for IPL.
For year 2016, 2017 and 2018
I.
II.
III. Rank top 25 players for the year 2016 to 2019
IV. Identify most improved player from 2018 to 2019
V. Find out which stadium had the most runs and which scored the least
VI. Find out where (which stadium or which bowler) bowlers perform better (Take most wickets while giving least runs)
VII. Now the interesting part - I have which will be a dream team to have for 2020 based on historical data
Indian Premier League (IPL) is a Twenty20 cricket format league in India. It is usually played in April and May every year. As of 2019, the title sponsor of the game is Vivo. The league was founded by Board of Control for Cricket India (BCCI) in 2008.
The data contains result of each match and each delivery for 11 seasons.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Deliveries=pd.read_csv(r'innings_data.csv')
Matches=pd.read_csv(r'match_data.csv')
Deliveries.head()
Matches.head()
Deliveries.isnull().sum()*100/Deliveries.shape[0]
Matches.isnull().sum()*100/Matches.shape[0]
print('Deliveries Details')
print('Shape: ',Deliveries.shape)
print('Size: ',Deliveries.size)
print('Dimensions: ',Deliveries.ndim)
print('Matches Details')
print('Shape: ',Matches.shape)
print('Size: ',Matches.size)
print('Dimensions: ',Matches.ndim)
Deliveries=Deliveries.sort_index()
Matches=Matches.sort_values('id')
Deliveries.describe()
Matches.describe()
Deliveries.dtypes
Matches.dtypes
Deliveries.drop(['replacements','id'],inplace=True,axis=1)
Deliveries['type_out'].replace('[a-zA-z \s]+','1',regex=True,inplace=True)
Deliveries['type_out']=Deliveries['type_out'].astype(int)
print(Deliveries.isnull().sum())
Deliveries.head()
Matches.drop(['id','eliminator','result','umpire3','dl_applied'],inplace=True,axis=1)
Matches['city'].fillna('unknown',inplace=True)
Matches['winner'].fillna('No Result',inplace=True)
Matches['player_of_match'].fillna('unknown',inplace=True)
Matches['umpire1'].fillna('unknown',inplace=True)
Matches['umpire2'].fillna('unknown',inplace=True)
Matches['second_bat_score'].fillna(0,inplace=True)
Matches['win_by_runs'].fillna(0.0,inplace=True)
Matches['win_by_wickets'].fillna(0.0,inplace=True)
Matches.isnull().sum()
Matches['win_by_runs']=Matches['win_by_runs'].astype(int)
Matches['win_by_wickets']=Matches['win_by_wickets'].astype(int)
Matches.head()
df=Deliveries.loc[Deliveries['season']>=2016]
df=df.groupby('batsman').sum()
df.reset_index(level=0, inplace=True)
df=df[['batsman','batsman_runs']]
df.sort_values(by=['batsman_runs'],ascending=False,inplace=True)
df.head(10)
df1=Deliveries.loc[Deliveries['season']>=2016]
df1=df1.groupby('bowler').sum()
df1.reset_index(level=0, inplace=True)
df1=df1[['bowler','type_out']]
df1.sort_values(by=['type_out'],ascending=False,inplace=True)
df1.head(10)
import palettable
sns.set_style('darkgrid')
pal=palettable.cartocolors.qualitative.Prism_10.mpl_colors
plt.figure(figsize=[15,7])
plt.title('Highest Run Scorers from 2016-2019')
ax=sns.barplot(x='batsman',y='batsman_runs',data=df.head(10),palette=pal)
t=ax.set(xlabel='Batsman',ylabel='Runs Scored')
pal=palettable.cartocolors.qualitative.Prism_10.mpl_colors
plt.figure(figsize=[15,7])
plt.title('Highest Wicket Takers from 2016-2019')
sns.set_style('darkgrid')
ax=sns.barplot(x='bowler',y='type_out',data=df1.head(10),palette=pal)
t=ax.set(xlabel='Bowlers',ylabel='Wickets Taken')
def mvpbat(x):
if x==4:
return 2.5
elif x==6:
return 3.5
else:
return 0
def mvpbowl(x):
if x==1:
return 2.5
else:
return 0
def mvpdotball(x):
if x==0:
return 1
else:
return 0
def mvpfield(x):
if x=='0':
return 0
else:
return 2.5
Deliveries['mvp_batsman'] = [mvpbat(x) for x in Deliveries['batsman_runs']]
Deliveries['mvp_bowler'] = [mvpbowl(x) for x in Deliveries['type_out']]
Deliveries['mvp_dot'] = [mvpdotball(x) for x in Deliveries['batsman_runs']]
Deliveries['mvp_fielder']= [mvpfield(x) for x in Deliveries['fielder_caught_out']]
Deliveries['mvp_bowler']+=Deliveries['mvp_dot']
Deliveries.drop('mvp_dot',axis=1,inplace=True)
Deliveries.head()
Deliveries['mvp_bowler'].describe()
Deliveries['mvp_fielder'].describe()
mvpbat=Deliveries.loc[Deliveries['season']>=2016]
mvpbat=mvpbat.groupby(['batsman','season']).sum()
mvpbat.reset_index(level=(0,1), inplace=True)
mvpbat.sort_values(by=['batsman_runs'],ascending=False,inplace=True)
mvpbat.head(10)
mvpbat.drop(['mvp_bowler','bowled_over','type_out','extras_wides','extras_legbyes','extras_noballs','extras_byes','total_extras_runs'
,'total_runs','extras_penalty'],axis=1,inplace=True)
mvpbat.rename(index=str,columns={'batsman':'player'},inplace=True)
mvpbat.head()
mvpbowl=Deliveries.loc[Deliveries['season']>=2016]
mvpbowl=mvpbowl.groupby(['bowler','season']).sum()
mvpbowl.reset_index(level=(0,1), inplace=True)
mvpbowl.sort_values(by=['type_out'],ascending=False,inplace=True)
mvpbowl.head(10)
mvpbowl.drop(['mvp_batsman','mvp_fielder','bowled_over','batsman_runs','extras_wides','extras_legbyes','extras_noballs','extras_byes','total_extras_runs'
,'total_runs','extras_penalty'],axis=1,inplace=True)
mvpbowl.rename(index=str,columns={'bowler':'player'},inplace=True)
mvp=mvpbat.merge(mvpbowl,how='outer')
mvp.fillna(0,inplace=True)
mvp['total_value']=mvp['mvp_batsman']+mvp['mvp_bowler']+mvp['mvp_fielder']
mvp.head()
mvp.groupby(['player','season']).sum().sort_values('total_value',ascending=False).head(10)
season2016=mvp[mvp['season']==2016]
season2017=mvp[mvp['season']==2017]
season2018=mvp[mvp['season']==2018]
season2019=mvp[mvp['season']==2019]
season2016bat=mvpbat[mvpbat['season']==2016]
season2017bat=mvpbat[mvpbat['season']==2017]
season2018bat=mvpbat[mvpbat['season']==2018]
season2019bat=mvpbat[mvpbat['season']==2019]
season2016bowl=mvpbowl[mvpbowl['season']==2016]
season2017bowl=mvpbowl[mvpbowl['season']==2017]
season2018bowl=mvpbowl[mvpbowl['season']==2018]
season2019bowl=mvpbowl[mvpbowl['season']==2019]
#Most Valuable Player for each season
season2016[season2016['total_value']==max(season2016['total_value'])]
season2017[season2017['total_value']==max(season2017['total_value'])]
season2018[season2018['total_value']==max(season2018['total_value'])]
#Most Consistent Batsman
top10bat=pd.concat([season2016bat.sort_values('mvp_batsman',ascending=False).head(10),
season2017bat.sort_values('mvp_batsman',ascending=False).head(10),
season2018bat.sort_values('mvp_batsman',ascending=False).head(10)])
top10bat.groupby('player').sum().sort_values('mvp_batsman',ascending=False)['mvp_batsman'].head(1)
#Most Consistent Bowler
top10bowl=pd.concat([season2016bowl.sort_values('mvp_bowler',ascending=False).head(10),
season2017bowl.sort_values('mvp_bowler',ascending=False).head(10),
season2018bowl.sort_values('mvp_bowler',ascending=False).head(10)])
top10bowl.groupby('player').sum().sort_values('mvp_bowler',ascending=False)['mvp_bowler'].head(1)
#Worst Player
season2016[season2016['total_value']==min(season2016['total_value'])].tail(1)
season2017[season2017['total_value']==min(season2017['total_value'])].tail(1)
season2018[season2018['total_value']==min(season2018['total_value'])].tail(1)
#Worst Batsman
season2016bat[(season2016bat['mvp_batsman']==min(season2016bat['mvp_batsman']))].tail(1)
season2017bat[(season2017bat['mvp_batsman']==min(season2017bat['mvp_batsman']))].tail(1)
season2018bat[(season2018bat['mvp_batsman']==min(season2018bat['mvp_batsman']))].tail(1)
#Worst Bowler
season2016bowl[(season2016bowl['mvp_bowler']==min(season2016bowl['mvp_bowler']))].tail(1)
season2017bowl[(season2017bowl['mvp_bowler']==min(season2017bowl['mvp_bowler']))].tail(1)
season2018bowl[(season2018bowl['mvp_bowler']==min(season2018bowl['mvp_bowler']))].tail(1)
#Top 25 For Each Year
top25=season2016.sort_values('total_value',ascending=False).head(25)
top25
plt.figure(figsize=[20,7])
plt.title('Top 15 Players of 2016')
sns.set_style('darkgrid')
ax=sns.barplot(x='player',y='total_value',data=top25.head(15))
t=ax.set(xlabel='Player',ylabel='MVP Score')
top25=season2017.sort_values('total_value',ascending=False).head(25)
top25
plt.figure(figsize=[20,7])
plt.title('Top 15 Players of 2017')
sns.set_style('darkgrid')
ax=sns.barplot(x='player',y='total_value',data=top25.head(15))
t=ax.set(xlabel='Player',ylabel='MVP Score')
top25=season2018.sort_values('total_value',ascending=False).head(25)
top25
plt.figure(figsize=[20,7])
plt.title('Top 15 Players of 2018')
sns.set_style('darkgrid')
ax=sns.barplot(x='player',y='total_value',data=top25.head(15))
t=ax.set(xlabel='Player',ylabel='MVP Score')
top25=season2019.sort_values('total_value',ascending=False).head(25)
top25
plt.figure(figsize=[20,7])
plt.title('Top 15 Players of 2019')
sns.set_style('darkgrid')
ax=sns.barplot(x='player',y='total_value',data=top25.head(15))
t=ax.set(xlabel='Player',ylabel='MVP Score')
mvpimproved=season2018.merge(season2019,left_on='player',right_on='player')
mvpimproved.head()
mvpimproved=mvpimproved[['player','total_value_x','total_value_y']]
mvpimproved['improvement']=mvpimproved['total_value_y']-mvpimproved['total_value_x']
player=mvpimproved.sort_values('improvement',ascending=False).head(1)
player
Stadium=Matches.groupby('city').sum()
Stadium['total_runs']=Stadium['first_bat_score']+Stadium['second_bat_score']
Stadium.reset_index(inplace=True)
Stadium=Stadium.sort_values('total_runs',ascending=False).head()
Stadium
plt.figure(figsize=[10,7])
plt.title('Stadiums with most runs')
ax=sns.barplot('city','total_runs',data=Stadium.head(),palette='pastel')
Deliveries['bowled_over']=[1 for x in Deliveries['bowled_over']]
Deliveries.head()
bowler=Deliveries.groupby('bowler').sum()
bowler=bowler[['bowled_over','type_out','total_runs']]
bowler['economy']=6*bowler['total_runs']/bowler['bowled_over']
bowler['average']=bowler['total_runs']/bowler['type_out']
bowler=bowler[bowler['bowled_over']>300]
bowler['eco_norm']=(bowler['economy']-bowler['economy'].mean())/bowler['economy'].std()
bowler['avg_norm']=(bowler['average']-bowler['average'].mean())/bowler['average'].std()
bowler['wicket_to_runs']=bowler['eco_norm']+bowler['avg_norm']
bowler=bowler.sort_values('wicket_to_runs')
bowler.reset_index(inplace=True)
bowler.head(10)
from palettable.colorbrewer.qualitative import Paired_12_r
plt.figure(figsize=[20,7])
plt.title('Bowlers who give least runs and take up more wicket')
sns.set_style('darkgrid')
ax=sns.barplot(x='bowler',y='wicket_to_runs',data=bowler.head(10),palette=Paired_12_r.mpl_colors)
t=ax.set(xlabel='Bowler',ylabel='Balance b/w Average and Economy (Lower the better)')
# Start By Looking at Most Consistent Players For Seasons from 2016-2019
consistent_players=pd.concat([season2016.sort_values('total_value',ascending=False).head(20),
season2017.sort_values('total_value',ascending=False).head(20),
season2018.sort_values('total_value',ascending=False).head(20),
season2019.sort_values('total_value',ascending=False).head(20)])
consistent_players_total=consistent_players.groupby('player').sum().sort_values('total_value',ascending=False)
consistent_players_total.drop('season',axis=1,inplace=True)
consistent_players_total.head(30)
consistent_players_total.sort_values('mvp_batsman',ascending=False).head(30)
consistent_players_total.sort_values('mvp_bowler',ascending=False).head(30)
From the above data, we need to choose
DAWarner=mvp[mvp['player']=='DA Warner']
DAWarner
plt.title("David Warner")
ax=sns.pointplot(x='season',y='mvp_batsman',data=DAWarner)
ticks=np.arange(150,400,step=20)
ax=ax.set(yticks=ticks)
We select David Warner as our first foreign player and We need to select 10 more players and 3 more overseas players. Team So Far:
VK=mvp[mvp['player']=='V Kohli']
VK
plt.title("Virat Kohli")
ax=sns.pointplot(x='season',y='mvp_batsman',data=VK)
ticks=np.arange(100,300,step=20)
ax=ax.set(yticks=ticks)
Team So Far:
SD=mvp[mvp['player']=='S Dhawan']
SD
plt.title("Shikar Dhawan")
ax=sns.pointplot(x='season',y='mvp_batsman',data=SD)
ticks=np.arange(150,300,step=20)
ax=ax.set(yticks=ticks)
Team So Far:
We have picked out top order. We will now pick out Middle Order We need: 2 Batsman and 1 Wicket Keeper Batsman. We have 3 overseas players.
We will Skip AM Rahane since he is a top order batsman and we've got it covered. Let us have him as backup. We shall have our next MVP Batsman KL Rahul as backup as well
SR=consistent_players[consistent_players['player']=='SK Raina']
SR
plt.title("Suresh Raina")
ax=sns.pointplot(x='season',y='mvp_batsman',data=SR)
ticks=np.arange(100,300,step=20)
ax=ax.set(yticks=ticks)
ABD=mvp[mvp['player']=='AB de Villiers']
ABD
plt.title("AB de Villers")
ax=sns.pointplot(x='season',y='mvp_batsman',data=ABD)
ticks=np.arange(100,300,step=20)
ax=ax.set(yticks=ticks)
We have picked 5 batsmen so far, two of which are overseas players
Batting Lineup:
Now, we need to pick a wicket keeper.
We have Q deKock who is the highest ranked wicket keeper in our most consistent batsmen dataframe. But choosing him would mean we are left with 1 overseas player. So, we will choose the next best wicket keeper who is an Indian, RR Pant
RP=mvp[mvp['player']=='RR Pant']
RP
plt.title("Rishab Pant")
ax=sns.pointplot(x='season',y='mvp_batsman',data=RP)
ticks=np.arange(100,300,step=20)
ax=ax.set(yticks=ticks)
Batting Lineup so far:
We have 2 overseas players remaining. We need to pick 1 batting allrounder, 1 bowling allrounder, 3 pure bowlers.
To choose the batting allrounder, we will look at the players who have highest mvp_bowler value among top 30 batsmen for seasons from 2016-2019
top25bat=pd.concat([season2016.sort_values('mvp_batsman',ascending=False).head(50),
season2017.sort_values('mvp_batsman',ascending=False).head(50),
season2018.sort_values('mvp_batsman',ascending=False).head(50),
season2019.sort_values('mvp_batsman',ascending=False).head(50)])
top25bat.drop('season',axis=1,inplace=True)
top25bat.groupby('player').sum().sort_values('mvp_bowler',ascending=False).head(10)
SP=mvp[mvp['player']=='SP Narine']
SP
plt.title("Sunil Narine Batting")
ax=sns.pointplot(x='season',y='mvp_batsman',data=SP)
ticks=np.arange(60,200,step=20)
ax=ax.set(yticks=ticks)
RJ=mvp[mvp['player']=='RA Jadeja']
RJ
plt.title("Ravindra Jadeja Batting")
ax=sns.pointplot(x='season',y='mvp_batsman',data=RJ)
ticks=np.arange(60,200,step=20)
ax=ax.set(yticks=ticks)
KP=mvp[mvp['player']=='KH Pandya']
KP
plt.title("Krunal Pandya Batting")
ax=sns.pointplot(x='season',y='mvp_batsman',data=KP)
ticks=np.arange(60,200,step=20)
ax=ax.set(yticks=ticks)
HP=mvp[mvp['player']=='HH Pandya']
HP
plt.title("Hardik Pandya Batting")
ax=sns.pointplot(x='season',y='mvp_batsman',data=KP)
ticks=np.arange(60,300,step=20)
ax=ax.set(yticks=ticks)
If we observe the graphs of 4 players for mvp_batsman, we observe that HH Pandya and KH Pandya are the most consistent with mvp_bat around 60. RA Jadeja and SR Narine have a dip in form. If we chose HH Pandya, We can need to choose and extra spinner. If we chose KH Pandya, We can need to choose and extra pacer.
Team so far:
We now need to choose a bowling allrounder. To choose the bowling allrounder, we will look at the players who have highest mvp_batsman value among top 30 bowlers for seasons from 2016-2019
top25bowl=pd.concat([season2016.sort_values('mvp_bowler',ascending=False).head(50),
season2017.sort_values('mvp_bowler',ascending=False).head(50),
season2018.sort_values('mvp_bowler',ascending=False).head(50),
season2019.sort_values('mvp_bowler',ascending=False).head(50)])
top25bowl.drop('season',axis=1,inplace=True)
top25bowl.groupby('player').sum().sort_values('mvp_batsman',ascending=False).head(10)
SR Watson is retired. Since we already have KH Pandya shortlisted, we can go with him.
plt.title("Krunal Pandya Bowling")
ax=sns.pointplot(x='season',y='mvp_bowler',data=KP)
ticks=np.arange(60,200,step=20)
ax=ax.set(yticks=ticks)
Team so far:
We need to choose 3 bowlers and we still have 2 overseas players remaining. We can go with either 2 spinners or 2 pacers. If we look at top 2 in most consistent bowlers list, we have Rashid Khan and SR Narine.Both are overseas players and we can afford to pick both of them
RK=mvp[mvp['player']=='Rashid Khan']
RK
plt.title("Rashid Khan")
ax=sns.pointplot(x='season',y='mvp_bowler',data=RK)
ticks=np.arange(100,300,step=20)
ax=ax.set(yticks=ticks)
plt.title("Sunil Narine Bowling")
ax=sns.pointplot(x='season',y='mvp_bowler',data=SP)
ticks=np.arange(100,300,step=20)
ax=ax.set(yticks=ticks)
We see a dip in form of Sunil Narine from 2017. We shall go with Rashid Khan as of now and keep Sunil Narine as a backup. We have to pick 2 more bowlers and atleast one pacer and one more overseas player can be accommodated. Highest ranked pace bowler is B Kumar in our list.
Batting Lineup:
BK=mvp[mvp['player']=='B Kumar']
BK
plt.title("Bhuvaneshwar Kumar")
ax=sns.pointplot(x='season',y='mvp_bowler',data=BK)
ticks=np.arange(100,300,step=20)
ax=ax.set(yticks=ticks)
Batting Lineup:
We now need to pick another bowler. If we go with Pacer, we can go with JD Unadkat, DJ Bravo or JJ Bumrah. We can also go with Sunil Narine if we want a spinner
JU=mvp[mvp['player']=='JD Unadkat']
JU
plt.title("Jaidev Unadkat")
ax=sns.pointplot(x='season',y='mvp_bowler',data=JU)
ticks=np.arange(100,300,step=20)
ax=ax.set(yticks=ticks)
DJ=mvp[mvp['player']=='DJ Bravo']
DJ
plt.title("Dwayne Bravo")
ax=sns.pointplot(x='season',y='mvp_bowler',data=DJ)
ticks=np.arange(100,300,step=20)
ax=ax.set(yticks=ticks)
JB=mvp[mvp['player']=='JJ Bumrah']
JB
plt.title("Jasprit Bumrah")
ax=sns.pointplot(x='season',y='mvp_bowler',data=JB)
ticks=np.arange(100,300,step=20)
ax=ax.set(yticks=ticks)
Using the graph above, we can go with JJ Bumrah. So, our playing 11 is, Batting Lineup:
Bench: