Buzzing with all those Machine Learning and Data Science words all around, we try to sneak in into knowing what these terms actually are. Deliberately looking for resources to learn, to get into the flow , every novice wannabe data scientist stumbles upon carving his own path into this Dream World of Artificial Intelligence. This article is aimed to just contribute to a small part of your large path on becoming a successful data scientist. OK! Enough intro let’s dive in.
Exploratory Data Analysis(EDA) according to Wikipedia is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Primarily it is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Too much theory? I know. That’s why we shall perform hands on EDA for IPL dataset found on Kaggle.
Note: The data set differs from the original stats. And hence any result of the EDA we performed will be void to the real world.

Problem Statement

The task is to analyze ball by ball data from all the way from 2008 to 2019. Using this we need to come up with analysis to form your own dream team for IPL. For year 2016, 2017 and 2018, we need to find out :
  1. Find out most valuable player – explain why
  2. Find out most consistent batsman – explain why
  3. Find out most consistent bowler – explain why
  4. Find out, worst player – explain why
  5. Find out worst batsmen – explain why
  6. Find out worst bowler – explain why
  7. Rank top 25 players for the year 2016 to 2019
  8. Identify most improved player from 2018 to 2019
  9. Find out which stadium had the most runs and which scored the least
  10. Which bowler gave the least runs and took most wickets.(Economy)
  11. Design a super awesome dream team for 2020.

The First Steps

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Lets understand how the data format is and how exactly the data is.

Okay… We have some pretty interesting columns and might require preprocessing.
Let’s checkout [No pun intended] the matches.csv

Great. One needs to understand that a very important step in solving any problem in data science is to understand how the data is, what the data is and how should the data be formatted in order to fit our needs or the problem statement.

Data Cleaning

Now that we took a small peek at how the structure of data is, we need to make sure that there are no null values in our dataset. It is very crucial that we eliminate null values as much as possible. It can either be done by eliminating those specific records itself or filling the null values. However, there is a catch in both the methods. If we remove every record that we stumble upon that has a null value then we might end up having very less data. Or if we fill the missing values using any methods, we lose the “essence” of the data. It is the job of a data scientist to understand, diagnose effectively what methods must be used to handle such cases. Let’s now check what percentage of values are missing in each column.

id                     0.00000
season                 0.00000
batsman               0.00000
bowler                 0.00000
innings               0.00000
non_striker           0.00000
replacements         99.98301
bowled_over           0.00000
batsman_team           0.00000
player_out             0.00000
fielder_caught_out     0.00000
type_out               0.00000
extras_wides           0.00000
extras_legbyes         0.00000
extras_noballs         0.00000
extras_byes           0.00000
extras_penalty         0.00000
total_extras_runs     0.00000
batsman_runs           0.00000
total_runs             0.00000
dtype: float64
Replacements?! (Duh!)
id                   0.000000
season               0.000000
city                 1.742627
date                 0.000000
team1               0.000000
team2               0.000000
toss_winner         0.000000
toss_decision       0.000000
winner               1.608579
eliminator         98.927614
dl_applied         97.453083
win_by_runs         54.959786
win_by_wickets     46.648794
result             98.391421
overs               0.000000
player_of_match     0.536193
venue               0.000000
umpire1             0.134048
umpire2             0.134048
umpire3             99.731903
first_bat_team       0.000000
first_bowl_team     0.000000
first_bat_score     0.000000
second_bat_score     0.268097
dtype: float64
Eliminator? Check. Result? Check. dl_applied? Check. umpire3? Double check. Other columns with missing values will be handled eventually.
print('Deliveries Details')
print('Shape: ',Deliveries.shape)
print('Size: ',Deliveries.size)
print('Dimensions: ',Deliveries.ndim)
Deliveries Details
Shape: (176573, 20)
Size: 3531460
Dimensions: 2
print('Matches Details')
print('Shape: ',Matches.shape)
print('Size: ',Matches.size)
print('Dimensions: ',Matches.ndim)
Matches Details
Shape: (746, 24)
Size: 17904
Dimensions: 2
#We can see that the id-s of the matches are not sorted and hence the sorting.
id                      int64
season                 int64
batsman               object
bowler                 object
innings               object
non_striker           object
replacements           object
bowled_over           float64
batsman_team           object
player_out             object
fielder_caught_out     object
type_out               object
extras_wides           int64
extras_legbyes         int64
extras_noballs         int64
extras_byes             int64
extras_penalty         int64
total_extras_runs       int64
batsman_runs           int64
total_runs             int64
dtype: object
id                    int64
season               int64
city                 object
date                 object
team1               object
team2               object
toss_winner         object
toss_decision       object
winner               object
eliminator           object
dl_applied           object
win_by_runs         float64
win_by_wickets     float64
result               object
overs                 int64
player_of_match     object
venue               object
umpire1             object
umpire2             object
umpire3             object
first_bat_team       object
first_bowl_team     object
first_bat_score     float64
second_bat_score   float64
dtype: object
Now we have the overview of the entire data. We know what columns have missing values, we know what its data types are. It is time to make changes into our dataframe to fit our purpose.
#Remove replacements and id
Deliveries['type_out'].replace('[a-zA-z \s]+','1',regex=True,inplace=True)
#The type_out is of the form stump or run out and we don't really need that information
#hence we are replacing that with 1
season                0
batsman               0
bowler               0
innings               0
non_striker           0
bowled_over           0
batsman_team         0
player_out           0
fielder_caught_out   0
type_out             0
extras_wides         0
extras_legbyes       0
extras_noballs       0
extras_byes           0
extras_penalty       0
total_extras_runs     0
batsman_runs         0
total_runs           0
dtype: int64

Let’s proceed with the changes in the Matches dataframe.
Matches['winner'].fillna('No Result',inplace=True)
season              0
city               0
date               0
team1               0
team2               0
toss_winner         0
toss_decision       0
winner             0
win_by_runs         0
win_by_wickets     0
overs               0
player_of_match     0
venue               0
umpire1             0
umpire2             0
first_bat_team     0
first_bowl_team     0
first_bat_score     0
second_bat_score   0
dtype: int64
Sweet stuff again!!

Until here, all that we did was prepare our data for analysis. We got an overview of the data, looked out for missing values and handled everything like a boss. So this was the part – 1 of the EDA we will be performing. It was quite simple and small. The next part will have the entire analysis, till then try the problem and see where it leads you to. This task was given to us (Nithin and Pareekshith) at Ambee during our internship.