Buzzing with all those Machine Learning and Data Science words all around, we try to sneak in into knowing what these terms actually are. Deliberately looking for resources to learn, to get into the flow , every novice wannabe data scientist stumbles upon carving his own path into this Dream World of Artificial Intelligence. This article is aimed to just contribute to a small part of your large path on becoming a successful data scientist. OK! Enough intro let’s dive in.

Exploratory Data Analysis(EDA) according to Wikipedia is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Primarily it is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Too much theory? I know. That’s why we shall perform hands on EDA for IPL dataset found on Kaggle.
Note: The data set differs from the original stats. And hence any result of the EDA we performed will be void to the real world.

Problem Statement

The task is to analyze ball by ball data from all the way from 2008 to 2019. Using this we need to come up with analysis to form your own dream team for IPL. For year 2016, 2017 and 2018, we need to find out :

Find out most valuable player – explain why
Find out most consistent batsman – explain why
Find out most consistent bowler – explain why
Find out, worst player – explain why
Find out worst batsmen – explain why
Find out worst bowler – explain why
Rank top 25 players for the year 2016 to 2019
Identify most improved player from 2018 to 2019
Find out which stadium had the most runs and which scored the least
Which bowler gave the least runs and took most wickets.(Economy)
Design a super awesome dream team for 2020.

The First Steps

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Deliveries=pd.read_csv(r'innings_data.csv')
Matches=pd.read_csv(r'match_data.csv')

Lets understand how the data format is and how exactly the data is.

Deliveries.head()

Okay… We have some pretty interesting columns and might require preprocessing.
Let’s checkout [No pun intended] the matches.csv

Matches.head()

Great. One needs to understand that a very important step in solving any problem in data science is to understand how the data is, what the data is and how should the data be formatted in order to fit our needs or the problem statement.

Data Cleaning

Now that we took a small peek at how the structure of data is, we need to make sure that there are no null values in our dataset. It is very crucial that we eliminate null values as much as possible. It can either be done by eliminating those specific records itself or filling the null values. However, there is a catch in both the methods. If we remove every record that we stumble upon that has a null value then we might end up having very less data. Or if we fill the missing values using any methods, we lose the “essence” of the data. It is the job of a data scientist to understand, diagnose effectively what methods must be used to handle such cases. Let’s now check what percentage of values are missing in each column.

Deliveries.isnull().sum()*100/Deliveries.shape[0]

id                     0.00000
season                 0.00000
batsman                0.00000
bowler                 0.00000
innings                0.00000
non_striker            0.00000
replacements          99.98301
bowled_over            0.00000
batsman_team           0.00000
player_out             0.00000
fielder_caught_out     0.00000
type_out               0.00000
extras_wides           0.00000
extras_legbyes         0.00000
extras_noballs         0.00000
extras_byes            0.00000
extras_penalty         0.00000
total_extras_runs      0.00000
batsman_runs           0.00000
total_runs             0.00000
dtype: float64

Replacements?! (Duh!)

Matches.isnull().sum()*100/Matches.shape[0]

id                   0.000000
season               0.000000
city                 1.742627
date                 0.000000
team1                0.000000
team2                0.000000
toss_winner          0.000000
toss_decision        0.000000
winner               1.608579
eliminator          98.927614
dl_applied          97.453083
win_by_runs         54.959786
win_by_wickets      46.648794
result              98.391421
overs                0.000000
player_of_match      0.536193
venue                0.000000
umpire1              0.134048
umpire2              0.134048
umpire3             99.731903
first_bat_team       0.000000
first_bowl_team      0.000000
first_bat_score      0.000000
second_bat_score     0.268097
dtype: float64

Eliminator? Check. Result? Check. dl_applied? Check. umpire3? Double check. Other columns with missing values will be handled eventually.

print('Deliveries Details')
print('Shape: ',Deliveries.shape)
print('Size: ',Deliveries.size)
print('Dimensions: ',Deliveries.ndim)

Deliveries Details
Shape:  (176573, 20)
Size:  3531460
Dimensions:  2

print('Matches Details')
print('Shape: ',Matches.shape)
print('Size: ',Matches.size)
print('Dimensions: ',Matches.ndim)

Matches Details
Shape:  (746, 24)
Size:  17904
Dimensions:  2

Matches=Matches.sort_values('id')
#We can see that the id-s of the matches are not sorted and hence the sorting.

Deliveries.dtypes

id                      int64
season                  int64
batsman                object
bowler                 object
innings                object
non_striker            object
replacements           object
bowled_over           float64
batsman_team           object
player_out             object
fielder_caught_out     object
type_out               object
extras_wides            int64
extras_legbyes          int64
extras_noballs          int64
extras_byes             int64
extras_penalty          int64
total_extras_runs       int64
batsman_runs            int64
total_runs              int64
dtype: object

Matches.dtypes

id                    int64
season                int64
city                 object
date                 object
team1                object
team2                object
toss_winner          object
toss_decision        object
winner               object
eliminator           object
dl_applied           object
win_by_runs         float64
win_by_wickets      float64
result               object
overs                 int64
player_of_match      object
venue                object
umpire1              object
umpire2              object
umpire3              object
first_bat_team       object
first_bowl_team      object
first_bat_score     float64
second_bat_score    float64
dtype: object

Now we have the overview of the entire data. We know what columns have missing values, we know what its data types are. It is time to make changes into our dataframe to fit our purpose.

Deliveries.drop(['replacements','id'],inplace=True,axis=1)
#Remove replacements and id
Deliveries['type_out'].replace('[a-zA-z \s]+','1',regex=True,inplace=True)
#The type_out is of the form stump or run out and we don't really need that information
#hence we are replacing that with 1
Deliveries['type_out']=Deliveries['type_out'].astype(int)

print(Deliveries.isnull().sum())
Deliveries.head()

season                0
batsman               0
bowler                0
innings               0
non_striker           0
bowled_over           0
batsman_team          0
player_out            0
fielder_caught_out    0
type_out              0
extras_wides          0
extras_legbyes        0
extras_noballs        0
extras_byes           0
extras_penalty        0
total_extras_runs     0
batsman_runs          0
total_runs            0
dtype: int64

Sweet!!
Let’s proceed with the changes in the Matches dataframe.

Matches.drop(['id','eliminator','result','umpire3','dl_applied'],inplace=True,axis=1)
Matches['city'].fillna('unknown',inplace=True)
Matches['winner'].fillna('No Result',inplace=True)
Matches['player_of_match'].fillna('unknown',inplace=True)
Matches['umpire1'].fillna('unknown',inplace=True)
Matches['umpire2'].fillna('unknown',inplace=True)
Matches['second_bat_score'].fillna(0,inplace=True)
Matches['win_by_runs'].fillna(0.0,inplace=True)
Matches['win_by_wickets'].fillna(0.0,inplace=True)

Matches.isnull().sum()

season              0
city                0
date                0
team1               0
team2               0
toss_winner         0
toss_decision       0
winner              0
win_by_runs         0
win_by_wickets      0
overs               0
player_of_match     0
venue               0
umpire1             0
umpire2             0
first_bat_team      0
first_bowl_team     0
first_bat_score     0
second_bat_score    0
dtype: int64

Sweet stuff again!!

Matches['win_by_runs']=Matches['win_by_runs'].astype(int)
Matches['win_by_wickets']=Matches['win_by_wickets'].astype(int)

Matches.head()

Until here, all that we did was prepare our data for analysis. We got an overview of the data, looked out for missing values and handled everything like a boss. So this was the part – 1 of the EDA we will be performing. It was quite simple and small. The next part will have the entire analysis, till then try the problem and see where it leads you to. This task was given to us (Nithin and Pareekshith) at Ambee during our internship.
Cya.

Exploratory Data Analysis – Part 1

Published by Nithin on August 31, 2019August 31, 2019

Problem Statement

The First Steps

Data Cleaning

Engineering

Moving Away From Traditional SSH to AWS EC2

Engineering

ETL at Ambee | AWS Glue | How do we do it?

Data Science

Is iOS/Android Capable of Doing Data Science?

Download App

Exploratory Data Analysis – Part 1

Published by Nithin on August 31, 2019August 31, 2019

Problem Statement

The First Steps

Data Cleaning

Related Posts

Engineering

Moving Away From Traditional SSH to AWS EC2

Engineering

ETL at Ambee | AWS Glue | How do we do it?

Data Science

Is iOS/Android Capable of Doing Data Science?

Download App