Indian Premier League Analysis and Prediction
This project (Write a Data Science Blog Post) is part of Udacity Data Scientists (Nanodegree Program). Detailed analysis with all required code is posted in my GitHub repository.
The Indian Premier League (IPL) is a professional Twenty20 cricket league in India contested during March or April and May of every year by eight teams representing eight different cities in India. The league was founded by the Board of Control for Cricket in India (BCCI) in 2008. IPL has an exclusive window in ICC Future Tours Programme.
Since the dawn of the IPL in 2008, it has attracted viewers all around the globe. High level of uncertainty and last moment nail biters has urged fans to watch the matches. Within a short period, IPL has become the highest revenue generating league of cricket. Data Analytics has been a part of sports entertainment for a long time. In a cricket match, we might have seen the scoreline showing the probability of the team winning based on the current match situation. This, my friend, is Data Analytics in action!
Being a cricket fan, visualizing the statistics of cricket is mesmerizing and building a classifier to predict the winning team is equally interesting.
In Machine Learning, the problems are categorized into 2 groups mainly: Regression Problem and Classification problem. The Regression problem deals with the kind of problems having continuous values as output while in the Classification problem the outputs are categorical values. Since the output of winner prediction is a categorical value, the problem which we are trying to solve is a Classification problem.
The dataset used for this experiment is real and authentic. The dataset is acquired from the Kaggle website. This dataset contains 2 files:
* matches.csv — Match by match data
* deliveries.csv — Ball by ball data
We will initially do exploratory data analysis to understand the data and answer a few questions of interest. Later we will make predictions whether the team will win or not.
The motivation behind this project includes answer to following questions:
“What is the probability of winning the game at a particular venue based on decision to field/bat first on winning the toss ?”
“Most dismissals by a wicketkeeper?”
“Does Home Ground Advantage has any effect on the result of the game ?”
Importing Dataset and EDA
On performing EDA, we found out :
1. Matches Dataframe:
* There are 756 rows and 18 columns.
* There are null values in columns (city,winner,player_of_match,umpire1,umpire2,umpire3).
* There are 5 numerical columns and rest are categorical columns
2. Deliveries Dataframe:
* There are 179078 rows and 21 columns.
* There are null values in columns: (player_dismissed,dismissal_kind,fielder)
* There are 8 categorical columns and rest are numerical columns
Hence, we filled the null values in ‘city ’columns with a certain logic and cleaned ‘city’, ‘venues’, and ‘team names’ for further analysis of our dataset.
Part I: What is the probability of winning the game at a particular venue based on the decision to field/bat first on winning the toss?
One of the important decisions before playing cricket is toss. Of course, a mere toss decision would not affect a player’s calibre or performance. Still, let us find out how a toss decision in each city impacts the game. This is just an analysis and we cannot and should not consider it as a prediction as we have not kept into consideration the predictive models.
Generally, cricket analysts look at the previous data and recommend what is a good decision to take on winning the toss at different venues.
We found out that for instance, if we choose Kolkata as the city and we field first on winning the toss, there is a 61% chance that we can win the game. Again, we are not sure about it, but from the history data, we can give this statement.
Part II: Most dismissals by a wicketkeeper?
I chose this question to determine which wicketkeeper to buy in the next auction based on the player’s performance. I used the ball by ball deliveries dataset and first determined the names of all wicketkeepers and then aggregated the stumpings and catches by them.
From the above chart, we can infer the following:
KD Karthik, MS Dhoni, and RV Uthappa have been a great wicketkeeper in IPL. Also, we can say that, when it comes to stumping, MS Dhoni performs the best followed by RV Uthappa and KD Karthik.
Part III: Does Home Ground Advantage has any effect on the result of the game?
Generally, there is much hype that generally teams tend to perform well when playing at their home ground. Let us find out if this hype is actually true in real. For this, we plotted for every team season-wise wins at different grounds using a stacked bar plot. I have provided the plots for 2018 & 2019 while for other seasons, you can check here.
It is clearly evident from the above two plots that home advantage is a very big factor in deciding the result of the game. Teams tend to perform well at their home grounds.
Part IV: Predictive Modelling and Evaluation
Data Preprocessing
In our analysis of the data, we had found out that some of the rows in our target variable ‘winner’ were missing and hence we drop those rows.
df_matches = df_matches.dropna(subset=['winner'],axis=0)
Rest all the data cleaning, we have already done it before.
Feature Engineering
Note: The columns taken into consideration are: team_1, team_2, toss_winner, toss_decision, venue, and winner.
Before we hop on to building models, an important observation has to be acknowledged. Columns like toss_winner, toss_decision, and the winner might make sense to us, but what about the machines?
Hence, we are posing the problem as a binary classification for predicting whether ‘Team 1’ will win or not. For this, we create new features:
1. team1_win: 1 if team 1 wins, else 0.
2. team1_toss_win: 1 if team 1 wins the toss, else 0.
3. team1_bat: 1 if team 1 bats first, else 0.
#outcome variable as a probability of team1 winning
matches.loc[matches["winner"]==matches["team1"],"team1_win"]=1
matches.loc[matches["winner"]!=matches["team1"],"team1_win"]=0matches.loc[matches["toss_winner"]==matches["team1"],"team1_toss_win"]=1
matches.loc[matches["toss_winner"]!=matches["team1"],"team1_toss_win"]=0matches["team1_bat"]=0
matches.loc[((matches["team1_toss_win"]==1) & (matches["toss_decision"]=="bat") | ((matches["team1_toss_win"]==0) & (matches["toss_decision"]=="field"))),"team1_bat"]=1
Here, we see that team1_bat represents the same information for all the columns i.e. value 1. Strange, right? It’s just how the dataset was built, and hence we drop this column as it doesn’t provide any help in prediction.
Once we build the model, we need to validate that model using values that are never exposed to the model. Hence we split our data using train_test_split, a class provided by Scikit-learn into 2 parts having a distribution of 80–20. The model is trained on 80% of data and validated against the other 20% of the data.
from sklearn.model_selection import train_test_split#feature selection
prediction_df=matches[["team1","team2","team1_toss_win","team1_win","venue"]]# Train-Test Split
X = prediction_df.drop('team1_win', axis=1)
target = prediction_df['team1_win']
target=target.astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.2, random_state=42)
For the columns to be able to assist the model in the prediction, the values should make some sense to the computers. Since our columns are categorical in nature, they don’t have the ability to assist in prediction and hence we need to encode the categorical variable to numeric categorical values. We use two famous types of encoding for categorical variables:
- OneHotEncoder
# OneHotEncoding for Team1
encoder_team1 = OneHotEncoder()
train_team1 = encoder_team1.fit_transform(X_train['team1'].values.reshape(-1,1))
test_team1 = encoder_team1.transform(X_test['team1'].values.reshape(-1,1))# OneHotEncoding for Team2
encoder_team2 = OneHotEncoder()
train_team2 = encoder_team2.fit_transform(X_train['team2'].values.reshape(-1,1))
test_team2 = encoder_team2.transform(X_test['team2'].values.reshape(-1,1))# OneHotEncoding for Venue
encoder_venue = OneHotEncoder()
train_venue = encoder_venue.fit_transform(X_train['venue'].values.reshape(-1,1))
test_venue = encoder_venue.transform(X_test['venue'].values.reshape(-1,1))# Converting binary feature into array
pq_tr = np.array(X_train['team1_toss_win']).reshape(-1,1)
pq_te = np.array(X_test['team1_toss_win']).reshape(-1,1)# horizontally stacking all the columns
from scipy.sparse import hstack
X_train_ohe = hstack((train_team1,train_team2,pq_tr))
X_test_ohe = hstack((test_team1,test_team2,pq_te))
2. LabelEncoder
encoder= LabelEncoder()
matches["team1"]=encoder.fit_transform(matches["team1"])
matches["team2"]=encoder.fit_transform(matches["team2"])
matches["winner"]=encoder.fit_transform(matches["winner"].astype(str))
matches["toss_winner"]=encoder.fit_transform(matches["toss_winner"])
matches["venue"]=encoder.fit_transform(matches["venue"])
Building, Training & Testing the Model
For a Classification problem, multiple algorithms can train the classifier according to the data we have and using the pattern, predict the outcomes of certain input conditions. We will try DecisionTreeClassifier, RandomForestClassifier, LogisticRegression, and SVM for both types of encoding and choose the algorithm best suited for our data distribution.
Using OneHotEncoded Features
#Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train_ohe, y_train)
y_pred = logreg.predict(X_test_ohe)
print('Accuracy of logistic regression classifier on test set: {:.4f}'.format(logreg.score(X_test_ohe, y_test))) #Decision Tree Classifier
dtree=DecisionTreeClassifier()
dtree.fit(X_train_ohe,y_train)
y_pred = dtree.predict(X_test_ohe)
print('Accuracy of Decision Tree Classifier on test set: {:.4f}'.format(dtree.score(X_test_ohe, y_test))) #SVM
svm=SVC()
svm.fit(X_train_ohe,y_train)
y_pred = svm.predict(X_test_ohe)
print('Accuracy of SVM Classifier on test set: {:.4f}'.format(svm.score(X_test_ohe, y_test))) #Random Forest Classifier
randomForest=RandomForestClassifier(n_estimators=100)
randomForest.fit(X_train_ohe,y_train)
y_pred = randomForest.predict(X_test_ohe)
print('Accuracy of Random Forest Classifier on test set: {:.4f}'.format(randomForest.score(X_test_ohe, y_test)))
Using LabelEncoded Features
#Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.4f}'.format(logreg.score(X_test, y_test))) #Decision Tree Classifier
dtree=DecisionTreeClassifier()
dtree.fit(X_train,y_train)
y_pred = dtree.predict(X_test)
print('Accuracy of Decision Tree Classifier on test set: {:.4f}'.format(dtree.score(X_test, y_test))) #SVM
svm=SVC()
svm.fit(X_train,y_train)
y_pred = svm.predict(X_test)
print('Accuracy of SVM Classifier on test set: {:.4f}'.format(svm.score(X_test, y_test))) #Random Forest Classifier
randomForest=RandomForestClassifier(n_estimators=100)
randomForest.fit(X_train,y_train)
y_pred = randomForest.predict(X_test)
print('Accuracy of Random Forest Classifier on test set: {:.4f}'.format(randomForest.score(X_test, y_test)))
It is evident from the above results that SVM with LabelEncoded features gives us the highest accuracy of 63.27% than any other algorithms and encoding for this data distribution. One more observation, we found that label encoded features tend to give higher accuracy than one-hot encoded features.
Even though the accuracy is not high enough to be useful, it gives a basic idea about the strategies and methodologies used in designing a solution to the Machine Learning problem.
What can we do next?
There are many other factors affecting the outcome of a match like weather, the form of a player, home ground advantage, etc which are not included here.
We can also create a feature based on the weighted sum of each player form to get the overall strength of the team. Similarly, many other features can be re-engineered.
Try adding these features and playing around with it. We can also perform hyperparameter tuning in order to get better performance. We can also experiment with other ML/DL models.
Lastly!
We did a decent job in first analysing the data and answering a few questions of our interest and then predicting the probability of an IPL team to win by converting it to a binary classification problem.
If you have any questions, suggestions or feedback, feel free to reach out to me on Email or LinkedIn. To see more about this analysis, see the link to my GitHub and you are more than welcome to contribute.
Happy Coding!