Exploratory Data Analysis(EDA) according to Wikipedia is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Primarily it is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Too much theory? I know. That’s why we shall perform hands on EDA for IPL dataset found on Kaggle.
Note: The data set differs from the original stats. And hence any result of the EDA we performed will be void to the real world.
Problem Statement
The task is to analyze ball by ball data from all the way from 2008 to 2019. Using this we need to come up with analysis to form your own dream team for IPL. For year 2016, 2017 and 2018, we need to find out :- Find out most valuable player – explain why
- Find out most consistent batsman – explain why
- Find out most consistent bowler – explain why
- Find out, worst player – explain why
- Find out worst batsmen – explain why
- Find out worst bowler – explain why
- Rank top 25 players for the year 2016 to 2019
- Identify most improved player from 2018 to 2019
- Find out which stadium had the most runs and which scored the least
- Which bowler gave the least runs and took most wickets.(Economy)
- Design a super awesome dream team for 2020.
The First Steps
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Deliveries=pd.read_csv(r'innings_data.csv')Lets understand how the data format is and how exactly the data is.
Matches=pd.read_csv(r'match_data.csv')
Deliveries.head()
Okay… We have some pretty interesting columns and might require preprocessing.
Let’s checkout [No pun intended] the matches.csv
Matches.head()
Great. One needs to understand that a very important step in solving any problem in data science is to understand how the data is, what the data is and how should the data be formatted in order to fit our needs or the problem statement.
Data Cleaning
Now that we took a small peek at how the structure of data is, we need to make sure that there are no null values in our dataset. It is very crucial that we eliminate null values as much as possible. It can either be done by eliminating those specific records itself or filling the null values. However, there is a catch in both the methods. If we remove every record that we stumble upon that has a null value then we might end up having very less data. Or if we fill the missing values using any methods, we lose the “essence” of the data. It is the job of a data scientist to understand, diagnose effectively what methods must be used to handle such cases. Let’s now check what percentage of values are missing in each column.Deliveries.isnull().sum()*100/Deliveries.shape[0]
id 0.00000Replacements?! (Duh!)
season 0.00000
batsman 0.00000
bowler 0.00000
innings 0.00000
non_striker 0.00000
replacements 99.98301
bowled_over 0.00000
batsman_team 0.00000
player_out 0.00000
fielder_caught_out 0.00000
type_out 0.00000
extras_wides 0.00000
extras_legbyes 0.00000
extras_noballs 0.00000
extras_byes 0.00000
extras_penalty 0.00000
total_extras_runs 0.00000
batsman_runs 0.00000
total_runs 0.00000
dtype: float64
Matches.isnull().sum()*100/Matches.shape[0]
id 0.000000Eliminator? Check. Result? Check. dl_applied? Check. umpire3? Double check. Other columns with missing values will be handled eventually.
season 0.000000
city 1.742627
date 0.000000
team1 0.000000
team2 0.000000
toss_winner 0.000000
toss_decision 0.000000
winner 1.608579
eliminator 98.927614
dl_applied 97.453083
win_by_runs 54.959786
win_by_wickets 46.648794
result 98.391421
overs 0.000000
player_of_match 0.536193
venue 0.000000
umpire1 0.134048
umpire2 0.134048
umpire3 99.731903
first_bat_team 0.000000
first_bowl_team 0.000000
first_bat_score 0.000000
second_bat_score 0.268097
dtype: float64
print('Deliveries Details')
print('Shape: ',Deliveries.shape)
print('Size: ',Deliveries.size)
print('Dimensions: ',Deliveries.ndim)
Deliveries Details
Shape: (176573, 20)
Size: 3531460
Dimensions: 2
print('Matches Details')
print('Shape: ',Matches.shape)
print('Size: ',Matches.size)
print('Dimensions: ',Matches.ndim)
Matches Details
Shape: (746, 24)
Size: 17904
Dimensions: 2
Matches=Matches.sort_values('id')
#We can see that the id-s of the matches are not sorted and hence the sorting.
Deliveries.dtypes
id int64
season int64
batsman object
bowler object
innings object
non_striker object
replacements object
bowled_over float64
batsman_team object
player_out object
fielder_caught_out object
type_out object
extras_wides int64
extras_legbyes int64
extras_noballs int64
extras_byes int64
extras_penalty int64
total_extras_runs int64
batsman_runs int64
total_runs int64
dtype: object
Matches.dtypes
id int64Now we have the overview of the entire data. We know what columns have missing values, we know what its data types are. It is time to make changes into our dataframe to fit our purpose.
season int64
city object
date object
team1 object
team2 object
toss_winner object
toss_decision object
winner object
eliminator object
dl_applied object
win_by_runs float64
win_by_wickets float64
result object
overs int64
player_of_match object
venue object
umpire1 object
umpire2 object
umpire3 object
first_bat_team object
first_bowl_team object
first_bat_score float64
second_bat_score float64
dtype: object
Deliveries.drop(['replacements','id'],inplace=True,axis=1)
#Remove replacements and id
Deliveries['type_out'].replace('[a-zA-z \s]+','1',regex=True,inplace=True)
#The type_out is of the form stump or run out and we don't really need that information
#hence we are replacing that with 1
Deliveries['type_out']=Deliveries['type_out'].astype(int)
print(Deliveries.isnull().sum())
Deliveries.head()
season 0
batsman 0
bowler 0
innings 0
non_striker 0
bowled_over 0
batsman_team 0
player_out 0
fielder_caught_out 0
type_out 0
extras_wides 0
extras_legbyes 0
extras_noballs 0
extras_byes 0
extras_penalty 0
total_extras_runs 0
batsman_runs 0
total_runs 0
dtype: int64
Sweet!!
Let’s proceed with the changes in the Matches dataframe.
Matches.drop(['id','eliminator','result','umpire3','dl_applied'],inplace=True,axis=1)
Matches['city'].fillna('unknown',inplace=True)
Matches['winner'].fillna('No Result',inplace=True)
Matches['player_of_match'].fillna('unknown',inplace=True)
Matches['umpire1'].fillna('unknown',inplace=True)
Matches['umpire2'].fillna('unknown',inplace=True)
Matches['second_bat_score'].fillna(0,inplace=True)
Matches['win_by_runs'].fillna(0.0,inplace=True)
Matches['win_by_wickets'].fillna(0.0,inplace=True)
Matches.isnull().sum()
season 0Sweet stuff again!!
city 0
date 0
team1 0
team2 0
toss_winner 0
toss_decision 0
winner 0
win_by_runs 0
win_by_wickets 0
overs 0
player_of_match 0
venue 0
umpire1 0
umpire2 0
first_bat_team 0
first_bowl_team 0
first_bat_score 0
second_bat_score 0
dtype: int64
Matches['win_by_runs']=Matches['win_by_runs'].astype(int)
Matches['win_by_wickets']=Matches['win_by_wickets'].astype(int)
Matches.head()
Until here, all that we did was prepare our data for analysis. We got an overview of the data, looked out for missing values and handled everything like a boss. So this was the part – 1 of the EDA we will be performing. It was quite simple and small. The next part will have the entire analysis, till then try the problem and see where it leads you to. This task was given to us (Nithin and Pareekshith) at Ambee during our internship.
Cya.