Buzzing with all those Machine Learning and Data Science words all around, we try to sneak in into knowing what these terms actually are. Deliberately looking for resources to learn, to get into the flow , every novice wannabe data scientist stumbles upon carving his own path into this Dream World of Artificial Intelligence. This article is aimed to just contribute to a small part of your large path on becoming a successful data scientist. OK! Enough intro let’s dive in.
Exploratory Data Analysis(EDA) according to Wikipedia is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Primarily it is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Too much theory? I know. That’s why we shall perform hands on EDA for IPL dataset found on Kaggle.
Note: The data set differs from the original stats. And hence any result of the EDA we performed will be void to the real world.

## Problem Statement

The task is to analyze ball by ball data from all the way from 2008 to 2019. Using this we need to come up with analysis to form your own dream team for IPL. For year 2016, 2017 and 2018, we need to find out :
1. Find out most valuable player – explain why
2. Find out most consistent batsman – explain why
3. Find out most consistent bowler – explain why
4. Find out, worst player – explain why
5. Find out worst batsmen – explain why
6. Find out worst bowler – explain why
7. Rank top 25 players for the year 2016 to 2019
8. Identify most improved player from 2018 to 2019
9. Find out which stadium had the most runs and which scored the least
10. Which bowler gave the least runs and took most wickets.(Economy)
11. Design a super awesome dream team for 2020.

# The First Steps

`import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns`
`Deliveries=pd.read_csv(r'innings_data.csv')Matches=pd.read_csv(r'match_data.csv')`
Lets understand how the data format is and how exactly the data is.
`Deliveries.head()`

Okay… We have some pretty interesting columns and might require preprocessing.
Let’s checkout [No pun intended] the matches.csv
`Matches.head()`

Great. One needs to understand that a very important step in solving any problem in data science is to understand how the data is, what the data is and how should the data be formatted in order to fit our needs or the problem statement.

# Data Cleaning

Now that we took a small peek at how the structure of data is, we need to make sure that there are no null values in our dataset. It is very crucial that we eliminate null values as much as possible. It can either be done by eliminating those specific records itself or filling the null values. However, there is a catch in both the methods. If we remove every record that we stumble upon that has a null value then we might end up having very less data. Or if we fill the missing values using any methods, we lose the “essence” of the data. It is the job of a data scientist to understand, diagnose effectively what methods must be used to handle such cases. Let’s now check what percentage of values are missing in each column.

`Deliveries.isnull().sum()*100/Deliveries.shape[0]`
`id                     0.00000season                 0.00000batsman                0.00000bowler                 0.00000innings                0.00000non_striker            0.00000replacements          99.98301bowled_over            0.00000batsman_team           0.00000player_out             0.00000fielder_caught_out     0.00000type_out               0.00000extras_wides           0.00000extras_legbyes         0.00000extras_noballs         0.00000extras_byes            0.00000extras_penalty         0.00000total_extras_runs      0.00000batsman_runs           0.00000total_runs             0.00000dtype: float64`
Replacements?! (Duh!)
`Matches.isnull().sum()*100/Matches.shape[0]`
`id                   0.000000season               0.000000city                 1.742627date                 0.000000team1                0.000000team2                0.000000toss_winner          0.000000toss_decision        0.000000winner               1.608579eliminator          98.927614dl_applied          97.453083win_by_runs         54.959786win_by_wickets      46.648794result              98.391421overs                0.000000player_of_match      0.536193venue                0.000000umpire1              0.134048umpire2              0.134048umpire3             99.731903first_bat_team       0.000000first_bowl_team      0.000000first_bat_score      0.000000second_bat_score     0.268097dtype: float64`
Eliminator? Check. Result? Check. dl_applied? Check. umpire3? Double check. Other columns with missing values will be handled eventually.
`print('Deliveries Details')print('Shape: ',Deliveries.shape)print('Size: ',Deliveries.size)print('Dimensions: ',Deliveries.ndim)`
`Deliveries DetailsShape:  (176573, 20)Size:  3531460Dimensions:  2`
`print('Matches Details')print('Shape: ',Matches.shape)print('Size: ',Matches.size)print('Dimensions: ',Matches.ndim)`
`Matches DetailsShape:  (746, 24)Size:  17904Dimensions:  2`
`Matches=Matches.sort_values('id')#We can see that the id-s of the matches are not sorted and hence the sorting.`
`Deliveries.dtypes`
`id                      int64season                  int64batsman                objectbowler                 objectinnings                objectnon_striker            objectreplacements           objectbowled_over           float64batsman_team           objectplayer_out             objectfielder_caught_out     objecttype_out               objectextras_wides            int64extras_legbyes          int64extras_noballs          int64extras_byes             int64extras_penalty          int64total_extras_runs       int64batsman_runs            int64total_runs              int64dtype: object`
`Matches.dtypes`
`id                    int64season                int64city                 objectdate                 objectteam1                objectteam2                objecttoss_winner          objecttoss_decision        objectwinner               objecteliminator           objectdl_applied           objectwin_by_runs         float64win_by_wickets      float64result               objectovers                 int64player_of_match      objectvenue                objectumpire1              objectumpire2              objectumpire3              objectfirst_bat_team       objectfirst_bowl_team      objectfirst_bat_score     float64second_bat_score    float64dtype: object`
Now we have the overview of the entire data. We know what columns have missing values, we know what its data types are. It is time to make changes into our dataframe to fit our purpose.
`Deliveries.drop(['replacements','id'],inplace=True,axis=1)#Remove replacements and idDeliveries['type_out'].replace('[a-zA-z \s]+','1',regex=True,inplace=True)#The type_out is of the form stump or run out and we don't really need that information#hence we are replacing that with 1Deliveries['type_out']=Deliveries['type_out'].astype(int)`
`print(Deliveries.isnull().sum())Deliveries.head()`
`season                0batsman               0bowler                0innings               0non_striker           0bowled_over           0batsman_team          0player_out            0fielder_caught_out    0type_out              0extras_wides          0extras_legbyes        0extras_noballs        0extras_byes           0extras_penalty        0total_extras_runs     0batsman_runs          0total_runs            0dtype: int64`

Sweet!!​​
Let’s proceed with the changes in the Matches dataframe.
`Matches.drop(['id','eliminator','result','umpire3','dl_applied'],inplace=True,axis=1)Matches['city'].fillna('unknown',inplace=True)Matches['winner'].fillna('No Result',inplace=True)Matches['player_of_match'].fillna('unknown',inplace=True)Matches['umpire1'].fillna('unknown',inplace=True)Matches['umpire2'].fillna('unknown',inplace=True)Matches['second_bat_score'].fillna(0,inplace=True)Matches['win_by_runs'].fillna(0.0,inplace=True)Matches['win_by_wickets'].fillna(0.0,inplace=True)`
`Matches.isnull().sum()`
`season              0city                0date                0team1               0team2               0toss_winner         0toss_decision       0winner              0win_by_runs         0win_by_wickets      0overs               0player_of_match     0venue               0umpire1             0umpire2             0first_bat_team      0first_bowl_team     0first_bat_score     0second_bat_score    0dtype: int64`
Sweet stuff again!!
`Matches['win_by_runs']=Matches['win_by_runs'].astype(int)Matches['win_by_wickets']=Matches['win_by_wickets'].astype(int)`
`Matches.head()`

Until here, all that we did was prepare our data for analysis. We got an overview of the data, looked out for missing values and handled everything like a boss. So this was the part – 1 of the EDA we will be performing. It was quite simple and small. The next part will have the entire analysis, till then try the problem and see where it leads you to. This task was given to us (Nithin and Pareekshith) at Ambee during our internship.
Cya.
Categories: Engineering