Recently I have been pondering player positions and responsibilities. Deciding that this was a worthwhile project, I started collecting data and doing some reading. Similar analysis has been done on basketball players in the NBA, but my brief look revealed no soccer analysis. Soccer having more than twice as many players on the playing field/court at once certainly makes the task more difficult, as does the unavailability of data. Basic statistics are available (goals, assists, minutes played, etc) but anything further requires much more work to acquire as opposed to basketball who has a site (basketball-reference.com) where all of the data one could want is readily available.
Steve, my close person machine learning friend, offered his help and thus the project kicked off. For the purposes of this article we will look specifically at MLS players for the past five seasons, and primarily focusing on FC Dallas players and their roles. Part of my difficulty when starting this project was determining what exactly a “position” is in the game of soccer. Knowing that many players could play the same “position” but have vastly different responsibilities while, at the same time, players in different “positions” could be given the same tasks and responsibilities throughout a game it became apparent that the term position was not really what I was hoping to describe. Roles seemed a better fit as it encompasses all of the things that players are asked to do during a game.
Collecting data was a moderately long process, but after finishing we wound up with more than 140 variables. I should note here that (x, y, z) location data was not included in this analysis, though doing so would likely provide slightly changed results. The datapoints consist of pass length, passing percentage, aerials, shot locations, shot distances, and many more. The datapoints are included on a per-game basis as well as on a per-96 minute basis. Per-game data has some fairly obvious flaws, most notably that it does not account for any changes between a player who starts and plays an entire game versus a player who comes on in the 85th minute. We have a lot of variables, which is interesting but also poses some problems for us because it can create a lot of noise and prevent us from finding some true clusters.
Each row in the dataset consists of a player, the year of the season data, and all of the features/variables from that season. Players are separated into unique rows for each year to account for roles changing from season to season, as well as players moving from one team to another, thus changing tactics and responsibilities. I should also note here that I have only included players who played 900 minutes or more during a season so as to exclude some of the low-lying outliers. I have thought about altering the minute requirement but 900 seems about right as it gives us ten full games or so of data.
Variables and Principal Component Analysis (PCA)
When we have lots of variables it can become cumbersome to find the best fit and the most descriptive features of our data. At the beginning of analysis I had 145 features, but using Principal Component Analysis I narrowed that list down to 40 new features (components). PCA helps us explain which features account for the most variance in our data. The 40 components make up nearly 94% of our total variance, which means that using just these 40 components we can summarize the vast majority of the data that we would get from the original 145 features. To determine how many components is optimal we plot the number of components against the percentage of variance.
Each component is made up of a set of the original features and I have provided the top 15 of those features for a few of the components below. I named each of the components and moving forward I will refer to the components as “Descriptive Traits.” These traits help us describe a player’s true role on the field and for each team that they have played with. Here are four of the traits that I’ve included for you to examine so that you will have a better idea of what goes into each component.
These four traits make up 10% of the total traits that describe an MLS player’s role on the field. Now that we have our 40 (!) traits, we can cluster out player data. There are many ways to cluster the data including KMeans, Hierarchical clustering, Affinity Propagation, Spectral Clustering, and so on and so forth. After trying several models I decided upon Hierarchical clustering as it successively builds nested clusters by merging and splitting clusters as it goes along, and it provided better results than the KMeans clustering.
Selecting the number of clusters can be a tricky process as well, but using sum of squared distances and the elbow method for optimal K (basically looking for the ideal number of clusters by trying out several possible K values procedurally and then looking for the biggest change in the sum of squared distances), I was able to find 27 discernibly distinct player roles for MLS players.
FC Dallas Players
This article will focus primarily on FC Dallas players (present and past) and where they are classified. I have a few specific players who I want to point out and talk about how their roles changed over their time with the club and then we can let the discussion really begin.
First, let’s start with a former player: Mauro Diaz, the colossus from Concepcion.
- Mauro Diaz (2015 & 2016) - Group 8 - 23 player-seasons in this role from 2013-2018
- Mauro Diaz (2017) - Group 17 - 46 player-seasons in this role form 2013-2018
This example was one that I was very curious about, as Mauro was coming back from a very serious injury in 2017. His play-style beforehand was different than it was when he returned. Of course this could be explained by his work to recover fitness , but I think that he intentionally altered his play-style in 2017. There are some fantastic players in group 8, extremely impactful players to be sure. This shows the impact of Mauro Diaz during that two year period where he truly dominated the league.
Now when we contrast player group 8 to player group 17, where Mauro was his last season with FC Dallas we see that he was grouped with players like Sebastian Lletget (2018), Boniek Garcia (2017), and Christian Techera (2014) instead of those like Thierry Henry (2014), Diego Valeri (2014, 2016, 2018), and Benny Feilhaber (2015) in group 8. In no way is this a shot at any of those very good players but the group 8 players are the type that teams game plan and prepare around.
Lastly, we have Michael Barrios, who I have admittedly not been incredibly fond of over the past several seasons. I will not spend too much time here on the reasons for my feelings about him but I want to explore his season role groupings in a bit more detail.
- Barrios (2015) - Group 12
- Barrios (2016) - Group 6
- Barrios (2017) - Group 10
- Barrios (2018) - Group 10
Barrios had his best goalscoring year in 2016, but with fewer assists than in any successive year, and in fact he has already matched that number (2) this season. You can see the other players in player group 6 in the gallery, as well as the players in roles 10 and 12 (note that role 12 is truncated because it has 105 player-seasons, so it has been shortened to fit in an image). Group 10 is where he has been placed for the past two seasons and I think that this role contains some hard working players who are primarily providers for other players in the league. Take a look and let me know what you think.
Player Groups vs Player Traits
Here I wanted to show you guys a few of the Trait - Player Group relationships. I have selected two roles to show you how the different traits weigh into player group assignments. Player Group 4 is primarily composed of one type of central defender (Hedges is here for two of the past five years), and Group 9 is a variety of traditional “positions” but that all function in much the same way, similar to central midfielder but there are some wide players in this group as well (examples are: Graham Zusi as a back and as a midfielder, Alessandrini from the Galaxy, or Higuain from the Crew).
Yup, that text is small. The reason I did not include the entire heat-map is because it is enormous (over 1000 cells) and that would make it nearly unreadable. You can see which traits are most important to the different roles. I tried to give the traits names that are indicative of the features they contain, but ask away if you have any questions.
Applications Moving Forward
Here is where I find this experiment quite interesting, the future applications. In the next article, I will expand upon the processes here and examine championship team makeup, player groups/roles that FC Dallas is missing, what players could help them get over the hump, as well as some extrapolated estimations of current form players and the player groupings to which they would belong (Note: 2019 data was not included as the season is incomplete). Other applications would be projected player growth, looking specifically at the HGPs that FC Dallas has signed and where they are trending with their development and minutes played.
If you like this kind of stuff, please comment below and ask any questions you might have. I’d like to do more deep analysis like this so let me know if this kind of stuff is interesting to you. Also, any questions you have about other players and where they stand just let me know and I can provide that kind of stuff in the comments as well.
Oh and before I forget...
Lastly, here we have this week’s projected standings for the end of the season. I won’t dive into any detail but just list it here. Remember that this isn’t based on typical data that one would use to predict results, and is mostly a thought experiment.
Season Prediction Week 6
|Los Angeles FC||61|
|Real Salt Lake||57|
|New York City FC||56|
|Seattle Sounders FC||55|
|Sporting Kansas City||55|
|New York Red Bulls||52|
|New England Revolution||51|
|Vancouver Whitecaps FC||50|
|Orlando City SC||49|
|Columbus Crew SC||46|
|San Jose Earthquakes||42|