Who Will win the 2018 FIFA World Cup ? – Model Thinking The Soccer World Cup kicked off today in Russia. In first game Russia slayed minnows Saudi Arabia by a margin of 5-0. Though, the main focus is on games involving Latin American and European teams, it was a great start of the tournament. If you are mad about data and predictions like us or a crazy fan you have been trying to find any source that can help you to learn about the outcome of the championship. For fans it is the confirmation bias where they only want to find only news item that extols their teams. For data scientists like us we are trying to see what statistical model and what data set can give us that slight advantage that can help us to predict with a better significance and get the proverbial Alpha. When we looked around there are tons of predictions. These predictions can be divided into two parts Expert Picks There are predictions from experts like this one at The Telegraph or by USA Today . Most of these predictions use a team of experts who follow the game and teams across the globe and understand these team dynamics and other finer details. Machine Learning/AI Picks This is the second category and more interesting for us. Here a lot of reputable institutions have developed models to predict the outcome. Goldman Sachs is using a AI model that ran more than 2,000,000 (2 million) simulations to get the final outcome of the result. MIT Review published an prediction that used a Random Forest Search algorithm that took other predictions as an input to run the simulations If you are thinking about running your own prediction there is a ready made model available on Git Hub . The only thing you need is to collect some data that you think will be important in prediction an outcome and use it to train the model. The PICK – BRAZIL It was interesting to see that while there were so many approaches the final outcome in majority of the cases was Brazil. Most of the experts/models picked Brazil as the eventual winner. The interesting part to this outcome was no two prediction had the same path to the final outcome but most of them reached the same conclusion. This similarity in outcome can be attributed to two problems: Confirmation Bias: Most of the models when run randomly will throw a random result. Every modeler that is worth their salt knows that in their lifetime they have run a model that gave the result that showed an opposite behavior compared to the expected behavior. At these times we usually went back to the data messaged it using some technique and tried to force it to give the expected outcome. Same thing is true with experts. Correlation Bias: Due to recent form of top Brazil players as individuals in different teams most of the models are assuming that they can play great together as a team. Some are predicting them to be even better than sum of parts. This is called positive correlation . While that might be true past events have showed us that this might not be true and two star players together might not be equal to some of parts they might result in reduction in performance of other player. This is called Negative correlation. While before the tournament it is hard to say anything about this correlation most of the models assume it would be positive. How to get a Real Prediction Whether you choose AI or you choose Random Forest based machine learning the old proverb of Garbage in = Garbage Out is always true in any modeling approach. So if you are looking to build a good or real prediction model you need good clean data. Once you have clean data you can try different modeling approached and create various models to see what works and what not. But good starting point is always the good data. What is Good Clean Data This is the part where the things start getting hazy. A good clean data is something that a modeler decides and this decision makes or breaks the model (as we pointed out earlier with data messaging). The good data can be individual records in various teams. Below is a small list that everyone knows need to be collected to predict an outcome : Individual past performance records Team past performance records If you are thinking about predicting the outcome you can reasonably assume that if you start with these two data sets you should have a good outcome. But if that was the case why are most of the models offer a differing path ? The answer is again an old proverb, “The devil is in the details”. Lets start with individual past performances. There are many variables that can impact what data is collected for this part. Some examples are What length of time we should consider the past performance ? 2 years, 5 years ? What tournament performance should I consider (club level, national level, local sports level) How do I deal with players with different experiences Most of the players performance degrades over time with splines at different age points, should I put that as a variable ? If the player has not played a lot of international sports how do I model him ? Has player played with same team before ? if yes then was the performance different ? Has the player played in Russia before if yes then how was the performance ? How individual team match-ups impact the player performance ? This list though good is not definitive. But you need to answer some if not all of the questions if you are thinking about building a good model. If you take team performance due to the nature of team sport the team dynamics and coaching dynamics comes into play as well. Like recent firing of Spanish Coach can have a huge impact to the team performance but it is hard to predict it through any past experience (model or expert). Conclusion – No model can predict a perfect outcome it is the data and approach that matters If after reading all this you are feeling lost or frustrated don’t lose hope. In all this noise there are still ways to make a learned prediction that can give you a leg up over your competition. The need of the hour is the right approach. Understand the outcome you are trying to predict Try to analyze the factors that can impact the outcome (this step can be improved by gaining as much knowledge as possible about the outcome) Analyze if there are any external factors to the problem (like firing of the coach in this case) that can impact the outcome Once you have collected all the factors run a correlation algorithm to see if there are competing factors than can skew your result. If yes separate them Normalize the factors Run the model Analyze the results. If you are happy this is the outcome if not repeat step 2 – 6. Happy Watching Football and Go Brazil. Use Model Thinking for Investing Also published on Medium.