Exploratory Data Analysis(EDA) according to Wikipedia is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Primarily it is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Too much theory? I know. That’s why we shall perform hands on EDA for IPL dataset found on Kaggle.
Note: The data set differs from the original stats. And hence any result of the EDA we performed will be void to the real world.
Problem StatementThe task is to analyze ball by ball data from all the way from 2008 to 2019. Using this we need to come up with analysis to form your own dream team for IPL. For year 2016, 2017 and 2018, we need to find out :
- Find out most valuable player – explain why
- Find out most consistent batsman – explain why
- Find out most consistent bowler – explain why
- Find out, worst player – explain why
- Find out worst batsmen – explain why
- Find out worst bowler – explain why
- Rank top 25 players for the year 2016 to 2019
- Identify most improved player from 2018 to 2019
- Find out which stadium had the most runs and which scored the least
- Which bowler gave the least runs and took most wickets.(Economy)
- Design a super awesome dream team for 2020.
The First Steps
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Deliveries=pd.read_csv(r'innings_data.csv')Lets understand how the data format is and how exactly the data is.
Okay… We have some pretty interesting columns and might require preprocessing.
Let’s checkout [No pun intended] the matches.csv
Great. One needs to understand that a very important step in solving any problem in data science is to understand how the data is, what the data is and how should the data be formatted in order to fit our needs or the problem statement.
Data CleaningNow that we took a small peek at how the structure of data is, we need to make sure that there are no null values in our dataset. It is very crucial that we eliminate null values as much as possible. It can either be done by eliminating those specific records itself or filling the null values. However, there is a catch in both the methods. If we remove every record that we stumble upon that has a null value then we might end up having very less data. Or if we fill the missing values using any methods, we lose the “essence” of the data. It is the job of a data scientist to understand, diagnose effectively what methods must be used to handle such cases. Let’s now check what percentage of values are missing in each column.
id 0.00000Replacements?! (Duh!)
id 0.000000Eliminator? Check. Result? Check. dl_applied? Check. umpire3? Double check. Other columns with missing values will be handled eventually.
Shape: (176573, 20)
Shape: (746, 24)
#We can see that the id-s of the matches are not sorted and hence the sorting.
id int64Now we have the overview of the entire data. We know what columns have missing values, we know what its data types are. It is time to make changes into our dataframe to fit our purpose.
#Remove replacements and id
#The type_out is of the form stump or run out and we don't really need that information
#hence we are replacing that with 1
Let’s proceed with the changes in the Matches dataframe.
season 0Sweet stuff again!!
Until here, all that we did was prepare our data for analysis. We got an overview of the data, looked out for missing values and handled everything like a boss. So this was the part – 1 of the EDA we will be performing. It was quite simple and small. The next part will have the entire analysis, till then try the problem and see where it leads you to. This task was given to us (Nithin and Pareekshith) at Ambee during our internship.