FIFA 2022 World Cup Prediction • Vedang Wartikar

What’s The Project About?

Predicting Group Stage of the 2022 FIFA World Cup

GitHub Link: github.com/vedangwartikar/fifa-2022-group-stage-pred

Project Overview

This project aims to compare various machine learning models to predict the outcomes of FIFA World Cup 2022 matches. The objective is to determine which model performs best in predicting match outcomes and tournament progression. Furthermore, a custom implementation of top-performing models was developed to simulate match results based on historical data.

Motivation

With the FIFA World Cup 2022 being an ongoing global event at the time of this project, it presented a unique opportunity to apply machine learning models to real-world sports data. The goal was to explore how machine learning can benefit sports analysis, aiding strategic decision-making in match predictions, substitutions, and game tactics.

Data Sources

Three datasets were used in this project:

international_matches.csv: Used for building and training the models, consisting of around 6,000 processed matches.
fifa_rankings_2022.csv: Contains FIFA rankings for teams as of 2022, used to make predictions for World Cup matches.
world_cup_teams.csv: Contains the official tournament schedule, teams, and stages to sequentially predict match outcomes.

Key Features

Team FIFA Rank: The current rank of the home and away teams.
Team FIFA Points: The total FIFA points for each team.
Match Outcome: Binary classification, indicating if a team wins (1) or loses/draws (0).

The project focuses on predicting whether a team qualifies for the next round based on these features.

Machine Learning Models Used

We evaluated multiple machine learning models to determine their performance on the dataset:

Logistic Regression
Naive Bayes
Decision Tree
K-Nearest Neighbors (KNN)
Support Vector Machines (SVM)
AdaBoost

Train and Test Accuracies

The comparison between train and test accuracies across models is shown below:

From the above figure, Logistic Regression, Naive Bayes, and SVM performed well, with Decision Tree overfitting the training data.

Cross-Validation Results

We applied 10-fold cross-validation to assess the model performance. The results show that Logistic Regression and Naive Bayes achieved the highest median accuracy, approximately 73%.

ROC/AUC Curves

The ROC curves highlight that Logistic Regression performed the best in terms of AUC, with SVM and Naive Bayes following closely behind.

Custom Model Implementations

In addition to using pre-built models from the scikit-learn library, we implemented Logistic Regression and Naive Bayes from scratch.

Logistic Regression Implementation

The model was built using the gradient descent algorithm, and performed slightly better than the scikit-learn implementation. Below is the confusion matrix for both:

Custom Logistic Regression Accuracy: 72.76%
scikit-learn Logistic Regression Accuracy: 72.15%

Naive Bayes Implementation

The custom Naive Bayes model also achieved similar performance to scikit-learn’s version, with an accuracy of 72.37%.

Decision Tree

Although Decision Tree was implemented, scikit-learn’s version performed significantly better. The custom Decision Tree struggled with generalization, particularly on deeper trees.

World Cup Simulation Results

We used the best-performing algorithm (Logistic Regression) to simulate the World Cup schedule and predict outcomes match-by-match.

For instance, the model correctly predicted the outcome of Costa Rica vs. Spain, where Spain was predicted to win with a 67.49% probability. However, some predictions were incorrect, as underdog teams such as Morocco performed better than expected.

Prediction Accuracy by Stage

Group Stage Accuracy: 68.75%
Round of 16 Accuracy: 75%

Here’s the final group stage prediction (visualized) Confusion Matrix

Feel free to contribute through a PR or report any issues here.