Monday, July 16, 2012

Phylogenetic network of the FIFA World Cup


Since this is post #50 in this blog, I thought I might try something ambitious, just to celebrate.

There have been several attempts to provide visualizations of the relative success of the different national teams at the FIFA World Cup competitions. This is quite a complex task, because there have been 19 competitions so far, and at least 74 teams have competed in the finals at least once. The relationships between these teams represent a network within each competition, based on their relative success at the games they play, and this network changes through time across the various competitions. Here, I review some of the previous network analyses, and then I present a combined analysis of all of the competitions based on a phylogenetic network.

Background to the Association Football World Cup

The Fédération Internationale de Football Association (FIFA) World Cup™ competition has been played every 4 years since 1930, except 1942 and 1946. Teams qualify for places in the finals by playing against other teams within defined geographical regions: Europe, Africa, Asia, Oceania, North+Central America, and South America. Most of the teams qualify for the finals by succeeding within their region, but the remainder qualify in a subsequent inter-region competition. The host nation(s) automatically qualify.

The number of teams competing has changed dramatically over the years (13-204), as has the number of teams accepted into the finals (13-32). Here is a summary of the finalists at the time of the 2010 competition (yellow represents previous finals participations and red the 2010 one). It also shows the seven countries who have won the competition.

Click to enlarge.

Given the 80 years over which the competitions have been held, there have been some changes in the political entities that the teams represent. Confusion over this issue affects some of the graphs shown below. FIFA officially attributes the various results as follows:
(i) all West Germany results go to Germany (leaving 1 finals result for East Germany);
(ii) all Yugoslavia and Serbia & Montenegro results are attributed to Serbia (since the break-up, both Croatia and Slovenia have reached the finals independently);
(iii) all Czechoslovakia results are attributed to both the Czech Republic and Slovakia;
(iv) all USSR results go to Russia (only Ukraine has reached the finals independently).

The results available for analysis are for the finals only, at the end of which FIFA provides an ordering of these teams based on their success in the finals. The full data are presented at the official FIFA site, and a summary is reproduced at Wikipedia.

In the data, some of the "zero" results are attributable to the team not competing in that year's World Cup competition at all, while others are attributable to the team not getting to the finals. Only the Brazilian and French teams have competed in all 19 editions, and only the Brazilians have made it to the finals every time.

The format of the finals competition has also changed over the years, at least partly in response to the increasing number of teams involved. Nevertheless, an official ranking of all finals teams has been produced for each edition. Otherwise: the ball is round, a team has 11 players (plus a substitute or three), and a match takes 90 minutes (possibly with extra time, and maybe a bizarre lottery called a "penalty shoot-out").

Previous Analyses

The finals competition is usually considered to be the most widely viewed sporting event in the world, surpassing even the Olympic Games. Not unexpectedly, there is now an enormous internet presence before, during and after each Cup, and some of the web sites have rather impressive data visualizations. These consist of (i) pre-competition viewing information, team analyses and result forecasts, (ii) competition game presentations, and (iii) post-game summaries and incredibly detailed de-constructions of each game (with every move made by each player and its effect on the outcome).

Some of the data visualizations for the 2010 competition have been collected for viewing at:

An example of a network analysis used as a team summary is illustrated here, in which directed line graphs (from 2010 Football World Cup Graphs) show the pattern of ball movement, averaged across the first-round games from the 2010 finals. The nodes of the graphs represent the players and the arrows represent the ball passes, with the size and colour of the arrows representing the number of passes between players.

Click to enlarge.

Note that the German team mainly builds their attacks from their defenders (notably #17 and #16), which is a strategy they have been successfully using for many decades (they are, on average, the most consistently successful team in World Cup history). The English team works mainly from the midfield, and concentrates their attack through player #10. (Note that although #16 receives the ball frequently from the defenders he usually just passes it back.) Sadly, this concentration presents an easily predicted strategy, and it focuses play precisely where the Germans are concentrating their defensive work (on the field, German #16 plays near English #4). So, if these two teams were to meet, and played in a similar manner to what is shown in the networks, the outcome is easy to forecast. (The Germans won 4-1 when the two teams met in the second round of the competition). Forecasting is not always this easy, of course.

This is a type of social network, and it is amenable to examination using the standard network summaries for each player. For example, closeness centrality (the summed shortest pathlengths to all other nodes, measuring how easy it is to reach a given node in the network) measures how well connected a player is in the team; and betweenness centrality (the number of inter-node shortest paths on which a node lies, measuring the extent to which a node lies on a path to other nodes) measures how the ball flow between players depends on each other player. Both teams have relatively evenly distributed centrality values, in this example, so that no single player can be said to be a "key" player for either team.

Another form of social network connects the players and their country teams and club teams (from FIFA 2010 World Cup as Networks). In this example, the nodes represent the countries in the 2010 finals, and two countries are connected if they have players who share the same club in which they play. Node size represents (closeness?) centrality in the network.

Click to enlarge.

Centrality in an organizational social network has been linked to team and individual performance, because the players could transfer knowledge from different clubs to their own country teams. In this example, their large centrality might have been a contributor to the Netherlands' success, as they appeared in the final. However, they then lost to the Spanish team, who have a very low centrality (most of their players play in Spain).

In addition to this sort of analysis, there have also been attempts to summarize and visualize the competition results from the entire 19-cup history. For example, there is an interactive set of bar charts at:   FIFA World Cup Statistics with Tableau.

However, a network can try to summarize the information in a single diagram, rather than a set of diagrams. This has been tried by:
Ulrik Brandes (2006) Centrality: Concepts and Methods. NetSci 2006 Workshop, 16-19 May 2006, Bloomington, Indiana, USA.
The two networks shown here summarize the data from the matches at the World Cup finals from 1930–2002. The nodes represent the teams, and the arrows represent the results of all of the matches played between each pair of teams (the arrows point to the winner).

Betweenness centrality
Click to enlarge.

Note that the German team, with the greatest average success across all competitions, has the greatest betweenness centrality (defined above), rather than the Brazilian team, who have won the most Cups (4 to the German's 3, at the time of the graph).

Closeness centrality
Click to enlarge.

There is no single "most central" team for closeness centrality, but instead a group of those teams who had appeared most often in the finals, to that time (the unlabelled teams include France, Hungary, former Yugoslavia and Czechoslovakia).

Node-degree centrality is not shown here (the number of incident edges to a node, measuring how well-connected each node is), but Brazil is most central based on that measure, having appeared in all of the Cup finals and thus having played against more teams than anyone else.

Another interesting attempt to simultaneously view the entire dataset using networks is provided by:
Adel Ahmed, Xiaoyan Fu, Seok-Hee Hong, Quan Hoang Nguyen, Kai Xu (2010) Visual analysis of history of World Cup: a dynamic network with dynamic hierarchy and geographic clustering. Pages 25-39 in M.L. Huang, Q.V. Nguyen, K. Zhang (eds) Visual Information Communication. Springer, New York.
The two network visualizations shown here summarize the data from the matches at the World Cup finals from 1930–2006. As above, in the network for each year the nodes represent the teams, and the arrows represent the results of all of the matches. The network summaries are based on node-degree centrality, as the more successful teams play in more games. Unfortunately, the dataset used separates the results for "Germany" and "West Germany" (contrary to FIFA), thus reducing the apparent success of the German team.

The first graph displays the centrality values as a wheel, with the size of each node representing the value. The yearly values are arranged in concentric circles (coloured by year), with 1930 in the centre, and the countries are represented by the spokes (indicated by their flags, and grouped into their geographical regions). This reveals the change in centrality value for each team through time. The German, Brazilian and Italian teams, for example, each have an almost continuous series of nodes. This construction can be viewed more clearly in the animation of the graph provided by the authors.

Click to enlarge.

The next graph arranges the teams in concentric circles based on groupings of their centrality values (they are grouped based on the range of values across the 18 Cups), with the team having the highest value (Brazil) in the centre. All of the games played are represented as connecting lines. This graph thus super-imposes the results for all 18 competitions (i.e. it is the union of the separate networks for each year). The size of the nodes represents the largest centrality value observed for each team.


The authors present an animation of this graph, in order to show the change in the 18 component networks through time. Each competition network forms a slice that can be viewed separately.

These networks provide no particularly deep insight into the history of the World Cup, in the sense that they summarize only patterns that are already obvious in the data. Nevertheless, they are effective summaries of a complex time-series of dynamic networks.

Phylogenetic Analysis

A phylogenetic analysis seeks to uncover the historical patterns associated with a group of objects for which multi-variable data have been collected. It is thus related to other multivariate analysis techniques, such as ordination and clustering, as well as to line-graph visualization techniques.

The network analysis assumes, of course, that the data have been formed by some historical process(es), and it produces a visualization that places objects with similar histories near each other in the network. The World Cup data are thus ideal for this type of analysis.

For my analysis, the FIFA rank-order data for each Cup were range-scaled to vary from 1 (last in the order) to 2 (first in the order), to deal with the varying number of finalists. Absence from the finals was coded as 0 (which could be due to not competing that year, or to competing but not qualifying for the finals).

For those teams that have changed through time (listed above), I have followed (i) and (ii). For (iii), I have attributed all results to Czech/Slovakia, since the Czech Republic and Slovakia have never been in the finals together. [It is unnecessary to have the results duplicated, since the two countries would be almost perfectly correlated.] For (iv) I have also attributed all results to Russia/Ukraine, since they have never been in the finals together.

The similarity among the 19 scores for each pair of teams was calculated using the Steinhaus dissimilarity. The Steinhaus dissimilarity ignores "negative matches", as discussed in a previous blog post, so that two teams are not considered to be similar just because they were both absent from the finals in the same years. This is important, because (a) there are another c. 130 teams who have always been absent from the finals (and would then need to be accommodated), and (b) we would need to somehow account for the two different reasons for being absent from the finals.

A Neighbor-net analysis was used to display the between-team similarities as a phylogenetic network. This decomposes the similarities into a series of bi-partitions of the teams, and then tries to display as many of these bi-partitions as possible in two dimensions. Each bi-partition represents the division of the teams into two sub-groups, where the data indicate that the two sub-groups differ in some way. That is, countries that are closely connected in the network are similar to each other based on their World Cup results, and those that are further apart are progressively more different from each other.

Click to enlarge.

To interpret the graph, it can be noted that the biggest split (i.e. best supported by the data) separates North Korea, Greece, Algeria, New Zealand and Honduras into a partition apart from the other teams. Inspection of the original data shows that these five teams all appeared in the 2010 finals and did poorly, while not appearing in most of the other finals.

The network has two main bi-partitions of interest, and the split that separates each sub-group is highlighted in red or blue in the graph. This pattern of two bi-partitions thus creates four quadrants in the network. The lower-left quadrant (from Romania clockwise round to Serbia/Montenegro) contains those teams who have been successful on most of those occasions when they have appeared in the finals (e.g. they have made it to the quarter-finals). Note that the most successful teams (Brazil, Germany, Italy) do not stand out in this group. The left quadrant (Mexico to South Korea) contains those teams whose finals results have varied from very good to very poor. The upper quadrant (Ecuador to Iran) contains those teams who have usually been moderately successful whenever they have qualified for the finals (e.g. they have made it to the second round). The right quadrant (Jamaica to Norway) contains those teams who have usually been unsuccessful when they have appeared in the finals (e.g. they have been eliminated in the first round).

This phylogenetic network thus provides a very effective summary of the main features of the World Cup results when averaged over all of the competitions.

If we want an alternative network summary that emphasizes the success of the most successful teams, then it would have to include the "negative matches", because one of the main indicators of a team's success is the fact that have appeared in most of the finals (i.e. they have few zeroes, indicating absence). The similarity measure that includes these, but is otherwise equivalent to the Steinhaus dissimilarity, is the Manhattan distance. Note that this analysis treats all absences from the finals as equivalent, and (arbitrarily) includes only those teams who have made it to the finals at least once.

Click to enlarge.

This network basically highlights those teams who have appeared in the semi-finals at least once (I have labelled only these teams on the network). It does, however, also strongly emphasize the most successful teams, with most of the winners at the very top of the graph. Uruguay is separated from the other winners because it has rarely done well since its early wins in 1930 and 1950. The other (unlabelled) teams in the upper part of the network are Mexico, Romania, Scotland (top to bottom on the left) and Switzerland (on the right), who regularly have made it to the quarter-finals but never to the semi-finals.

This phylogenetic network thus provides an effective summary of the successful teams when averaged over all of the competitions.

Note that the two phylogenetic networks succeed in a very different manner from the other types of network, as shown above. Here the summary is based on calculating the similarity of the teams' results and then displaying this as a network, whereas in the other, more traditional, approach the networks are first derived as direct displays of the data and then their centrality is calculated and displayed.

Furthermore, the traditional networks consist only of edges connecting "observed" nodes, whereas the phylogenetic networks have many extra "inferred" nodes and edges. These inferred nodes are designed to support the display of sets of incompatible bi-partitions — they are not intended to be hypothetical extra teams!

These are thus quite different approaches to the same visualization problem. They decompose the problem in different ways and produce different visualizations. They seem to be equally effective, however.

1 comment: