clock menu more-arrow no yes mobile

Filed under:

Machine Learning - Preseason Predictions

How could this new season go? Let's have a machine predict it.

Quanta Magazine

As the staff at Big D Soccer prepares to release our personal predictions for the coming season, I thought it would be an interesting experiment to release a machine learning series. Today will be the first of (hopefully) many articles on machine learning and soccer throughout the season, and we are starting with season-long predictions. First I will break down the very basics of machine learning, then we will tackle what the predictions are, what might change, and why.

What is machine learning?

Well here is a great starting point. Machine learning refers to the algorithms and methods that build models to predict and reflect data. What does that mean exactly? Think about it this way: in the same way that you or I would take in a soccer game, watch an interview, read about training and injuries, and then turn that into an informed decision about the outcome of a game, machine learning takes in the data that it is given and makes informed decisions about relationships in data to give a prediction. There are countless articles, books, videos and more about the various types of machine learning models and methods so I will not get too deep into the math behind those, but throughout the season we will look at a number of different methods, and then compare the results to see what works best (and why).

Two big types of machine learning

Regression is what we will be using today for our projections, but machine learning has another side as well - classification. This is what it sounds like, a way of organizing and classifying data. If we wanted to examine game stats to determine what position groups our players truly belong in (hint: coming in a future article), then we would use a classifier model to pick which “position” or role a player belongs. Regression is, most simply, estimating the relationship between variables. Those relationships are “trained” or fed to the model as the input, and then the output is our prediction.

The happy future of soccer analytics!

Instead of using some more expected and traditional data points to predict the final points totals for Major League Soccer teams, I chose to use a mixture of those traditional stats and some different pieces of data. Taking only data from the past five seasons (the 2014-2018 seasons), I collected what was needed for this and here is what I chose to use for the first model:

  • Coaching Change - was there a change in coach during the offseason (Yes or No)
  • Conference in which the team plays (East or West)
  • Expansion Team - is the team an expansion team (Yes or No)
  • Average Age of Roster
  • Shots per 90 minutes
  • Shots on Goal per 90 minutes
  • Fouls Committed per 90 minutes
  • Fouls Received per 90 minutes
  • Corners per 90 minutes
  • Offsides per 90 minutes
  • Penalty Kicks Won
  • Penalty Kicks Conceded
  • Total Team Salary
  • Number of Home Grown Players
  • Number of Designated Players
  • The Salary of those Designated Players
  • Average Attendance
  • Percentage of Stadium Filled on Average (using MLS capacity numbers, not stadium max)

That might seem like a lot of data, but honestly we could have gone after a whole lot more. One of the main advantages of using machine learning is that we can explore the relationships between seemingly unrelated pieces of data. I used a neural network here for regression, which just means that there is a bunch of stuff going on behind the scenes that is likely not super interesting, but dramatically changes the outcome of our predictions. Specifically this is using a multilayer perceptron regressor, which uses backpropagation for training. We could have just used a linear regression here and not involved a neural network at all, but I was working on a neural network for something unrelated and wanted to test this out, so here we are. If there is interest I can dive deeper into what these terms mean, so feel free to comment and let me know if there is a desire for a deeper dive into this type of stuff.

Still with me?

Yeah. Cool. Show me the results

Here’s what you came for.

Western Conference Prediction

Team Predicted Points
Team Predicted Points
Los Angeles FC 53
Seattle Sounders FC 53
FC Dallas 52
Real Salt Lake 50
Vancouver Whitecaps FC 48
LA Galaxy 47
Portland Timbers 47
Sporting Kansas City 47
Houston Dynamo 42
San Jose Earthquakes 40
Minnesota United 38
Colorado Rapids 35

There are some interesting things here, most notably is how incredibly close the conference is projected to be come the end of the season. Sporting Kansas City is likely too low, but with them only six points from the top of the conference it could be argued that being so close to the top is perfectly reasonable. Seattle is probably a bit higher than I would expect, but that is likely due to their recent appearances in the MLS Cup Final. Teams that have been more successful recently tend to have higher projections, but Sporting KC is a bit of an outlier here so we will want to look at that later.

Now to the Eastern Conference, let’s see who comes out on top.

Eastern Conference Prediction

Team Predicted Points
Team Predicted Points
Atlanta United 63
Toronto FC 51
New York Red Bulls 50
New England Revolution 49
New York City FC 49
Montreal Impact 48
D.C. United 46
Philadelphia Union 45
Chicago Fire 44
FC Cincinnati 44
Orlando City SC 44
Columbus Crew SC 43

Looks like Atlanta is on top again. No surprise there. Toronto bouncing back is a non-trivial prediction since they are without Giovinco for the first time in since 2014. The Revolution are quite high here, perhaps because of their “prudent” spending patterns, perhaps not.

What could change? and why?

Each and every one of these numbers will likely change when I rerun this after week one. Since we have no game data (shots, sog, etc.) for 2019, we are using projections from past seasons weighted for trends, coaching changes, and so on. As that data comes in we will be able to see the relationships between these variables and the critical season point totals. We could also add some other contributing data points here such as: goals, expected goals, unique starting lineups, pace of play*, and on and on. The beautiful game has immense amounts of data and the stuff that is publicly available can provide us with some serious insights into the game, the league, and our own team - FC Dallas.

That about wraps it up for our first surface dip into machine learning predictions. Look for another article this weekend that will be predicting score lines for the MLS opening weekend matches. If there is something you would like for me to explore or if you have some ideas please feel free to let me know and we can figure out if it is something worth putting in an article. If not, then I may put it out on social media if there is interest. Hit up the comments with feedback and your thoughts. What looks off? What is surprising and what do you think looks just about right?

* I am currently authoring a research paper on Pace of Play as a new metric in soccer. I’ll share some insights and some of the methodology as the season goes along here.