Exploratory Data Analysis of IPL Dataset

Problem Statement

The task is to analyze ball by ball data from all the way 2008 to 2019. Using this we need to come up with analysis to form our own dream team for IPL.
For year 2016, 2017 and 2018

About Data

Indian Premier League (IPL) is a Twenty20 cricket format league in India. It is usually played in April and May every year. As of 2019, the title sponsor of the game is Vivo. The league was founded by Board of Control for Cricket India (BCCI) in 2008.

The data contains result of each match and each delivery for 11 seasons.

Ah, here we go again!

Launching in T minus 3





Import the necessary packages needed for EDA

Read csv(s)

Displaying the first 5 records of the Deliveries dataframe.

Image 1

Checking out [No pun intended] the first 5 records of Matches


We can pretty much understand what the data is all about just at a single glance.

Now we might or might not have missing values in our dataset. But better safe than sorry let's check and remove the anomalies.

Data Cleaning

The shape[0] is just to get a percentage.

IMage 3

replacements? (duh!)

image 4

eliminator, check!

dl_applied, check!

result, check!

umpire3, double check!



Now let's just get into deliveries' details.

So we have 176573 rows with 20 columns each.

And the matches' details are as follows:

Let's sort index of Deliveries and that of matches

IMage 5


IMage 6




Removing unnecessary stuff.

Plus replacing the value by 1 if anyone is caught out or bowled.


Image 6

Now we shall remove the unnecessary values in Matches.csv and fill unknown values.


Almost done, just a lil changes here and there.

Image 7

Okay. All anamolies removed.

What we basically did was

  1. Got an overview of the data.
  2. Checked for anomalies.
  3. Removed them if they couldn't be fixed.
  4. Filled the missing values for others.

Note: It is better to remove rows or columns with high missing values. We could opt for filling it because it brings down the "real-ness" of the dataset.

Data Analysis

Now we are grouping the data by batsman and sorting them based on the number of runs they scored.

batsman batsman_runs
49 DA Warner 1871
232 V Kohli 1826
190 S Dhawan 1805
16 AM Rahane 1626
202 SK Raina 1569
7 AB de Villiers 1498
142 MS Dhoni 1389
108 KL Rahul 1342
181 RG Sharma 1270
22 AT Rayudu 1267

Same with the bowler except here we calculate the total wickets taken.


bowler type_out
69 JJ Bumrah 67
168 SP Narine 66
173 Sandeep Sharma 66
192 YS Chahal 63
185 UT Yadav 61
21 B Kumar 56
146 Rashid Khan 53
103 MJ McClenaghan 49
11 AR Patel 48
68 JD Unadkat 45

And a simple bar plot.


We can see that DA Warner scored the highest runs followed by Kohli and so on.

Plotting for bowlers gives the below image.

Inference? Comment below * Wink *


We have come up with our own function to calculate the points for a player after referring some official websites of BCCI and IPL.

The function we came up with is valid and usable because it isn't bias and giving points to bowlers and batsman based on good credibility.

Calculating MVP Values for Players

Image 8





Creating Data frames for MVPs in Batting, Bowling and Overall


player season batsman_runs mvp_batsman mvp_fielder
114 DA Warner 2016 755 275.5 50.0
524 V Kohli 2016 661 212.0 57.5
253 KS Williamson 2018 657 241.0 47.5
116 DA Warner 2019 617 211.0 55.0
423 S Dhawan 2016 560 190.0 30.0



IMage 9



Image 10

Dividing MVP table wrt seasons

Finding MVP for each season

Image 11

Im 12



im 13

Finding Most Consistent Batsman and Bowler





Worst Player, Batsman and Bowler

Im 14