Wednesday, December 26, 2012

Network analysis of McDonald's fast-food


Season's greetings: or as we say here in Sweden: God jul och gott nytt år! In many cultures, over-dosing on food is traditional at this time of year, so it is appropriate to have a food-related blog post this week. [Note there is a follow-up blog post called: Is there good and bad fast-food?]

In 1954 a multi-mixer salesman decided to check out the fast-food operation of a couple of brothers in California, named McDonald, and then offered to form a partnership with them. Nearly 60 years later, there are approximately 34,000 fast-food stores with this name worldwide, although there are very few in Africa, and I don't think they have any in Antarctica. This virus-like growth has been accompanied by negative comments on the nutritional quality of the food. Indeed, the international Slow Food organization, which cares very much about the traditional quality of food, was first formed to contest the opening of a McDonald's store near the historic Spanish Steps in Rome.

In a blog post amusingly entitled Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process, Edwin Chen looked at the nutritional content of the food provided by this culinary mega-chain. Another way to look these data is to use a phylogenetic network as a means of exploratory data analysis, which is what I provide here.

The data

The data are taken from the official document McDonald's USA Nutrition Facts for Popular Menu Items, dated 7 August 2012. The stated purpose of the document is this: "We provide a nutrition analysis of our menu items to help you balance your McDonald's meal with other foods you eat. Our goal is to provide you with the information you need to make sensible decisions about balance, variety and moderation in your diet." I presume that the data are accurate, at least on average for each menu item.

The data consist of measurements of 12 dietary characteristics for 82 of the menu food items available in the USA (excluding the drinks), not all of which are necessarily available in other countries. The data for each characteristic are listed as "% Daily Value" based on a 2,000 calorie diet, which thus standardizes the data to the same scale across all of the variables, thereby making them directly comparable. The characteristics are:
  • Calories
  • Total Fat
  • Saturated Fat
  • Carbohydrates
  • Cholesterol
  • Dietary Fiber
  • Protein
  • Sodium
  • Vitamin A
  • Vitamin C
  • Calcium
  • Iron
Sugar values are also listed in the original document, but there is no FDA recommended daily allowance available for sugar, and so these cannot be included in my analysis. The FDA argument is that we get enough calories out of the fat and protein that we eat, and so we don't actually need any extra sugar in our diet.

The analysis

I have analyzed these data using the manhattan distance and a neighbor-net network. The result is shown in the figure. Menu items that are closely connected in the network are similar to each other based on their dietary characteristics, and those that are further apart are progressively more different from each other.

Click to enlarge.

The network is basically a blob, with three side branches. The blob (with menu items labelled in black) represents what we could call "typical" McDonald's products, while the side-branches (with the labels in various colours) represent more extreme products, with either more or less of some of the measured characteristics. Basically, the menu items are increasingly "bad for you" at right and better for you at the very bottom.

I have numbered parts of the side-branches as 1–7 in the figure, and coloured the food names, to indicate various groups that are worth discussing. These groups can be described as follows:
  1. The dark red menu items to the right of (1) include the two Big Breakfast with Hotcakes, each of which provides 55% of your daily calorie needs (while most of the menu items to the left provide <40% each), and 40% of your iron requirements.
  2. The menu items to the right of (2) (in red or dark red) include all four of the Big Breakfasts, which each provide nearly 200% of the recommended daily intake of cholesterol (most of the other items provide < 90%), >60% of the total fat requirement, and >65% of your sodium needs (ie. salt).
  3. The items to the right of (3) (in orange, red or dark red) each provide >35% of your calorie needs and >60% of the total fat requirement, while those to the left provide less.
  4. The purple items below (4) include all of the menu items with added egg, and so these have the next highest cholesterol levels after the four Big Breakfasts, at 80-95% of your daily requirements (the other menu items have <50%).
  5. The four menu items to the left of (5) (in light green) are unique in containing >130% of your vitamin C needs, while the other items each provide <35%.
  6. The items to the left of (6) (in blue or light green) include all of the fruit-based items and potato-based items (fries, hash browns) plus the chicken nuggets and bites. These are distinguished by having relatively low values of several characteristics, including those that are bad for you (total fat, saturated fat, cholesterol, and sodium) and those that are good for you (protein and calcium).
  7. The green items below (7) include all nine of the Premium Salads (but not the Side Salad!). They each provide 160% of your vitamin A requirements (the other items each contain <20%), 25-35% of the vitamin C (most of the others contain <15%), and >15% of your dietary fiber.
The conclusions

This means that, for a healthy diet, you should steer well clear of the four "Big Breakfast" menu items, as they are very extreme even by McDonald's standards — you will need to take part in a great deal of strenuous exercise, to burn off their calories, cholesterol, fat and salt. The "Premium Salads", on the other hand, are extreme by having much less of these "bad for you" characteristics than is usual for McDonald's.

Unfortunately, while the the fruit concoctions (in light green and light blue) look good in the network, they also have more sugar (20-30 g each) than any of the other menu items (except the Cinnamon Melts), and therefore not everyone is a fan of them (eg. Mark Bittman; see also the MNN blog). Incidentally, it is also worth pointing out that fast-food in the USA is often much saltier than it is elsewhere in the world (see the article in Health magazine).

Finally, you might also like to compare the network locations of the menu items to William Harris' 10 Most Popular McDonald's Menu Items of All Time (not all of which have been included in my analysis):
  1. French Fries
  2. Big Mac
  3. Snack Wrap
  4. Happy Meal
  5. Egg McMuffin
  6. Apple Dippers and Baked Apple Pie
  7. Chicken McNuggets and Chicken Select Strips
  8. Premium Salads
  9. Double Cheeseburger
  10. McGriddles Breakfast Sandwich

Wednesday, December 19, 2012

Networks can outperform PCA ordinations in phylogenetic analysis


In a previous blog post I noted that there are potentially Distortions and artifacts in Principal Components Analysis analysis of genome data, in which the Principal Components Analysis (PCA) ordination axes are not independent of each other. This happens when the samples form one or more monotonic gradients, in which case higher-dimensional axes of the ordination plot are curved and twisted relative to the lower axes. The so-called Arch Effect is the best-known example of this problem. The danger is that the ordination diagrams will be interpreted as showing biological patterns when these artifactual patterns are not in the original datasets.

This mathematical artifact can be dealt with by: (i) mathematically correcting the distortion; (ii) using a Non-metric Multidimensional Scaling (NMDS) ordination instead; or (iii) using a phylogenetic network to display the multidimensional data, rather than using an ordination. Here, I illustrate this third possibility.

Network analysis of a 1-dimensional gradient

In the previous blog post I used the following small dataset as an example of a genetic dataset with a 1-dimensional pattern of relationships among the taxa. Here the taxa form a series of overlapping groups, defined by the pattern of Cs.

The diagonal dataset.

In the previous post I also illustrated the PCA ordination of this dataset, which shows the classic Arch Effect, rather than the expected linear arrangement of samples in the scatter plot. Below is a table showing whether analyses using two different types of data-display network can effectively display all of the taxon groups for datasets of this type but with varying numbers of taxa. All analyses were performed using the SplitsTree program.


As you can see, the NeighborNet analysis successfully displays all of the relevant bipartitions in all of the datasets. On the other hand, the related Neighbor-Joining tree only succeeds up to a taxon size of six. Six taxa can be represented by a simple linear tree, which can display three non-trivial bipartitions plus six trivial partitions, which is all that is needed when each taxon consists of five Cs. (Note that the NeighborNet and the Median network are identical to this tree, as well, for six taxa or less.)

The next two figures show the NeighborNet graphs for taxon sizes of 10 and 20, respectively, as examples.

NeighborNet of the diagonal dataset for n=10.

NeighborNet for the diagonal dataset for n=20.

The table also shows that the Median network analysis is only successful for small numbers of taxa. For greater numbers of taxa, some of the bipartitions are not displayed, as listed in the table. For example, for a taxon size of 10, the split {2,3,4,5,6}{1,7,8,9,10} is not displayed, as illustrated in the next figure.

Median network for the diagonal dataset for n=10.

For even larger numbers of taxa, the Median network superimposes some of the taxa, even though they have distinct sequences, and thus it cannot display all of the bipartitions, by definition. For example, for a taxon size of 20, taxon_9 and taxon_10 are superimposed, as shown in the next figure.

Median network for the diagonal dataset for n=20.

Network analysis of a 2-dimensional gradient

In the previous blog post I used the following dataset as an example of a genetic dataset with a 2-dimensional pattern of relationships among the taxa. Here the taxa form two series of overlapping groups, one defined by the pattern of Cs and the other by the pattern of Ts.

The grid dataset.

In the previous post I also illustrated the PCA ordination of this dataset, which shows a complicated 3-dimensional version of the 2-dimensional pattern. On the other hand, the Median network perfectly reproduces the original 2-dimensional grid arrangement of the taxa, as shown next, so that the network diagram is exactly as expected.

Median network for the grid dataset.

Sadly, the NeighborNet analysis does not fare quite so well, as it produces a much more complicated arrangement of the taxa. This arrangement does approximately represent the original grid-like pattern contained in the data, but it is presented in an unnecessarily convoluted way.

NeighborNet for the grid dataset.

Conclusion

It appears that data-display networks can be used to display multidimensional information, in the same manner as an ordination analysis. These networks try to display all of the relevant bipartitions in the data, rather than simply displaying a scatter plot of the samples.

However, there are considerable differences in the ability of the various network analyses to display the patterns inherent in the original data. In particular, NeighborNet seems to be very effective for displaying 1-dimensional patterns, while Median networks are more effective for 2-dimensional patterns.

Finally, it is worth noting that David Bryant has already thought of comparing NeighborNet with a Multi-dimensional Scaling (MDS) ordination, using an empirical dataset (2006, Radiation and network breaking in Polynesian linguistics. In: Forster P, Renfrew C (eds) Phylogenetic Methods and the Prehistory of Languages. McDonald Institute Press, University of Cambridge, pp 111-118). He found that they approximately agreed with each other.

Monday, December 17, 2012

The Future of Phylogenetic Networks: Videos


This note is just to inform everyone that video files of some of the presentations from the workshop The Future of Phylogenetic Networks are now online at a DailyMotion web page called Phylogenetic Networks. This page has been set up by Philippe Gambette, who also recorded the videos.

Those of you who were there can reminisce about what happened, and those of you who were not there can see what it was all about.


Wednesday, December 12, 2012

Distortions and artifacts in Principal Components Analysis for analysis of genome data


It is perfectly clear from examples in the genomics literature that there is a problem occurring with the use of Principal Components Analysis (PCA) ordinations, and that it is apparently going undetected by users. This problem involves a spurious second axis in the output graph that is a curvilinear function of the first axis. The danger is that the ordination diagrams will be interpreted as showing biological patterns (see François et al. 2010; Arenas et al. 2013), when these patterns are not in the original datasets. I illustrate this problem here using some very simple datasets, and point to some examples of it in the literature. I then suggest some possible solutions, including the use of phylogenetic networks.

Introduction

Genomic data are multivariate, in the sense that we have multiple characters (the nucleotide sequences) for multiple organisms. "Ordination" is the term used when we arrange the multivariate data in a rational order, which is then usually displayed as a scatter plot, with the points representing the organisms. The spatial relationships of the points on the graph then indicate how similar the organisms are to each other (based on their genomes) — close points are more similar to each other than are distant points, as shown by the example in the first figure.

Salmela et al. (2011), Figure 2d.
Each dot represents the genome of one Swedish person,
coloured according to their geographical location within northern Sweden.

There are many mathematical techniques for ordination, although they fall into two main groups: (i) eigenanalysis methods, and (ii) non-metric multidimensional scaling methods. These methods have a long history in some parts of biology, for example in ecology (Gauch 1982; Jongman et al. 1987; Kent & Coker 1992; Leps & Smilauer; Legendre & Legendre 2012), although they are virtually unknown in other parts. (Note: the eigenanalysis ordinations are sometimes confusingly called metric multidimensional scaling ordinations.)

The eigenanalysis method called Principal Components Analysis (PCA) was introduced by Patterson et al. (2006) to the study of genomic data, having previously been introduced into genetics by Menozzi et al. (1978)Cavalli-Sforza and Edwards (1964), and it has become quite popular in the literature since then. The basic idea is to summarize the data by calculating the correlation coefficient between each pair of organisms, and then to display as much as possible of the common pattern of variation along the first ordered axis of the graph, and as much of the remainder along the second axis, and so on. The different axes are expected to show independent patterns of genetic relationship among the organisms, with the first axis showing the most important pattern, the second axis the second-most important pattern, and so on.

However, ecologists have long known that the first two components of a PCA (and related eigenanalysis techniques such as correspondence analysis) often show an arching pattern, which is a mathematical artifact of the eigenanalysis rather than a real biological pattern in the original data (Gauch et al. 1977). This is most apparent when the sampling units cover a long ecological gradient and those samples at each end of the gradient have few species in common (Minchin 1987; see also this blog post). This is called the Arch Effect (or the Horseshoe Effect, or the Guttman Effect), in which the second axis is curved and twisted relative to the first axis, and thus the axes are not independent of each other.

The problem illustrated

Basically, the distortion involves the data pattern being displayed using one more axis than is needed for that data pattern. That is, a 1-dimensional pattern is displayed using 2 dimensions, instead, and a 2-dimensional pattern is displayed using 3 dimensions, etc. I will first illustrate this with a simple 1-dimensional pattern.

This first dataset consists of 20 nucleotide sequences. Each sequence consists of only two nucleotides (19 A and 5 C), and the pattern of Cs forms a simple gradient (ie. there is only one pattern in the data). This gradient can be displayed using only one ordination axis, which would arrange the taxa in order from Taxon_1 to Taxon_20. However, note that the patterns at the two ends of the gradient have nothing in common as far as the pattern of Cs is concerned.

The gradient dataset.

The next figure is the Principal Components Analysis (PCA) ordination diagram of these data. Note that the 1-dimensional pattern is displayed using 2 dimensions, arranging the 20 taxa in an arch rather than in a straight line. (Note that the open part of the arch can face either up or down; in this case it happens to face up.) So, it appears in the scatter plot that Taxon_1 and Taxon_20 are rather similar to each other (ie. they are near each other), which is not true. Furthermore, the dots are not equally spaced, which they should be given the pattern in the original data. The second dimension, along with the unequal spacing of the points, is a mathematical artifact of the eignenanalysis rather than a pattern in the data. It is caused by the unimodal distribution of the pattern of Cs along the gradient.

PCA ordination of the gradient data.
Not all of the taxa are labelled.

The second dataset consists of 24 nucleotide sequences. Each sequence consists of three nucleotides (8 A, 5 C and 4 T), and the pattern of Cs forms one gradient while the pattern of Ts forms a second gradient (orthogonal to the first one). These two independent patterns can be displayed using two ordination axes, which would arrange the taxa in a 6x4 rectangular grid. (If this is not clear, look at the bottom two figures of this post.)

The grid dataset.

This next figure is the PCA ordination diagram of these data. Note that the 2-dimensional pattern is displayed using 3 dimensions, only two of which are actually shown in the diagram. (It is standard to show only the first two axes, on the assumption that these contain most of the important information.) These axes arrange the 24 taxa in an arch in the first two dimensions, rather than in a straight line, and then separate pairs of taxa along the third axis. So, in two dimensions pairs of taxa appear to sit on top of one another, indicating a similarity that is not present in the original dataset. Once again, the use of the third dimension is an artifact of the eignenanalysis rather than a pattern in the data. It is caused by the unimodal distribution of the patterns of Cs and Ts along the two gradients.

PCA ordination of the grid data.

Clearly, when the samples form gradients then eigenanalysis is not an appropriate ordination technique, although it is very useful under many other circumstances. This problem with gradients has previously been pointed out for genomic data when they form a clinal geographical pattern (Novembre & Stephens 2008; Reich et al. 2008), but clearly the problem involves much more than just geographical patterns. Also, the problem is being ignored by users — most of the examples below were published after the paper by Novembre & Stephens.

Literature examples

The Arch Effect has appeared repeatedly in the genomics literature (often produced using the SmartPCA, Plink or Eigenstrat computer programs), and some examples are illustrated here. Note that the axes are not always in the same orientation, but the arch should be found on axis 2 relative to axis 1, and it can be oriented either up or down.

These first two examples come from the very paper that first introduced PCA to genomics studies.

Patterson et al. (2006), Figure 5.

Patterson et al. (2006), Figure 8.

These next two examples are from studies of genomic data from Scandinavia.

Salmela et al. (2011), Figure 2c.

Skoglund et al. (2012), Figure 1c.
Note that axis 1 is shown vertically.

The next two examples also come from worldwide genome studies. Note that the second of these refers to the ordination as metric multidimensional scaling (MDS).

Stoneking & Krause (2011), Box 1a.

Jakobsson et al. (2008), Figure 1d.

The next two examples also come from a worldwide genome study, but in this case the relationship of the Arch Effect to the geographical sampling is made explicit.

Wang et al. (2012), Figure 1b.

Wang et al. (2012), Figure 5b.

The next example illustrates rotation of the ordination axes, so that the arch is now aligned along the two axes.

Novembre and Ramachandran (2011), Figure 4e.

Finally, the Gene Expression blog has an example, where the ordination is called a metric multidimensional scaling (MDS) ordination. This is based on ~3,000 individuals and 250,000 SNPs.

Possible solutions

There are three possible solutions to the Arch Effect problem in eigenanalysis ordinations.

First, one can try to mathematically correct the distortion. This approach was actively pursued in ecology via what is known as detrending and rescaling (Hill & Gauch 1980), in which the axes are forced to be independent of each other after the eigenanalysis. This approach works, in the sense that the Arch Effect disappears, but it has the drawback that all relationships between the axes are removed, including those that require more than one dimension to be displayed (Palmer 1993). So, while this method is still commonly encountered in the ecological literature, it is not uniformly recommended.

This approach is illustrated in the next figure, which is a Detrended Correspondence Analysis (DCA, with detrending by second-order polynomials) ordination analysis of the second of my simple datasets above. Note that the original 2-dimensional grid arrangement of the taxa is perfectly preserved, so that the ordination diagram is exactly as expected.

DCA ordination of the grid data.

The second, and most obvious, solution to the problem is not to use an eigenanalysis ordination in the first place. That is, not all ordination techniques produce the same result. In this case we could use a Non-metric Multidimensional Scaling (NMDS) ordination, instead. The latter is a very different type of technique, mathematically, which explicitly tries to arrange the points in the ordination plot so that their relative distances reflect as closely as possible the pairwise "distances" in the original dataset (here, a distance is the inverse of a similarity). This technique is known to avoid the Arch Effect (Minchin 1987).

This is illustrated in the next figure, which is a NMDS ordination analysis of the second dataset. Note that the original 2-dimensional grid arrangement of the taxa is almost perfectly preserved. The main disadvantages of NMDS are the computational time required, the difficulty of finding the optimal solution for large datasets (eigenanalysis is significantly faster; Zheng et al. 2012), and the fact that the number of ordination axes needs to be specified beforehand.

NMDS ordination of the grid data.
Not all of the taxa are labelled.

The third solution to the problem is to use a phylogenetic network to display the multidimensional data, rather than using an ordination. Conceptually, this is similar to the NMDS type of ordination, but it uses a line graph to display the results rather than using a scatter plot. I have illustrated this in the next blog post (Networks can outperform PCA ordinations in phylogenetic analysis), where I report that this solution works quite well, too.

Conclusion

Every time you see an arch in a PCA plot, you might very well be looking at a mathematical artifact. You should bear this in mind when you interpret the ordination pattern, especially if you wish to draw biological conclusions from the diagram. You will, of course, see plenty of PCA ordinations without any evidence of an arch, and these are probably safe to interpret naively.

PCA can be useful when we expect genomes to be monotonically related to one another, but this is likely to be rather rare in nature. It is much more likely that genomes will have nonlinear relationships, such as unimodal patterns, with a peak at an intermediate part of any gradients that are present. Under these circumstances it produces artifacts, which may be mistakenly interpreted as representing real biological patterns. Under these circumstances, an alternative method of multivariate analysis should be used.

More information about ordination techniques can be found at the Ordination Methods web page.

Note: updated Friday December 21 2012.

References

Arenas M, François O, Currat M, Ray N, Excoffier L (2013) Influence of admixture and paleolithic range contractions on current European diversity gradients. Molecular Biology and Evolution 30: 57-61.

Cavalli-Sforza L, Edwards AWF (1964) Analysis of human evolution. Proceedings of the 11th International Congress of Genetics, The Hague, 1963. Genetics Today 3: 923-933.

François O, Currat M, Ray N, Han E, Excoffier L, Novembre J (2010) Principal component analysis under population genetic models of range expansion and admixture. Molecular Biology and Evolution 27: 1257-1268.

Hill MO, Gauch HG (1980) Detrended Correspondence Analysis: an improved ordination technique. Vegetatio 42: 47-58.

Gauch HG (1982) Multivariate Analysis in Community Structure. Cambridge University Press, Cambridge.

Gauch HG, Whittaker RH, Wentworth TR (1977) A comparative study of reciprocal averaging and other ordination techniques. Journal of Ecology 65: 157-174.

Jongman RHG, ter Braak CJF, van Tongeren OFR (1987) Data Analysis in Community and Landscape Ecology. Pudoc, Wageningen.

Kent M, Coker P (1992) Vegetation Description and Analysis: a Practical Approach. Belhaven Press, London.

Legendre P, Legendre L (2012) Numerical Ecology, 3rd edition. Elsevier, Amsterdam.

Leps J, Smilauer P (2003) Multivariate Analysis of Ecological Data using Canoco. Cambridge University Press, Cambridge.

Menozzi P, Piazza A, Cavalli-Sforza L (1978) Synthetic maps of human gene frequencies in Europeans. Science 201: 786-792.

Minchin PR (1987) An evaluation of relative robustness of techniques for ecological ordinations. Vegetatio 69: 89-107.

Novembre J, Ramachandran S (2011) Perspectives on human population structure at the cusp of the sequencing era. Annual Review of Genomics and Human Genetics 12: 245-74.

Novembre J, Stephens M (2008) Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40: 646-649.

Palmer MW (1993) Putting things in even better order: the advantages of Canonical Correspondence Analysis. Ecology 74: 2215-2230.

Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genetics 2: e190.

Reich D, Price AL, Patterson N (2008) Principal component analysis of genetic data. Nature Genetics 40: 491-492.

Salmela E, Lappalainen T, Liu J, Sistonen P, Andersen PM, Schreiber S, Savontaus ML, Czene K, Lahermo P, Hall P, Kere J (2011) Swedish population substructure revealed by genome-wide single nucleotide polymorphism data. PLoS One 6: e16747.

Skoglund P, Malmström H, Raghavan M, Storå J, Hall P, Willerslev E, Gilbert MT, Götherström A, Jakobsson M (2012) Origins and genetic legacy of Neolithic farmers and hunter-gatherers in Europe. Science 336: 466-469.

Stoneking M, Krause J (2011) Learning about human population history from ancient and modern genomes. Nature Reviews Genetics 12: 603-614.

Wang C, Zöllner S, Rosenberg NA (2012) A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genetics 8: e1002886.

Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012) A high-performance computing toolset for relatedness and Principal Component Analysis of SNP data. Bioinformatics 28: 3326-3328.

Monday, December 10, 2012

Data enrichment in phylogenetics


Since this is post #100 in this blog, I thought that we might celebrate with something humorous. Since evolutionists often have a tough time, this post is about how to get more out of your phylogenetic analyses than you previously thought was possible.

In 1957, Henry R. Lewis published an article about The Data-Enrichment Method (Operations Research 5: 551-554). This method was intended "to improve the quality of inferences drawn from a set of experimentally obtained data ... without recourse to the expense and trouble of increasing the size of the sample data." This distinguishes the method from similarly named techniques, such as the likelihood method of Data Augmentation, which require actual data.

Clearly, such a method is of great interest to all empirical scientists, especially those without much grant money. Indeed, The Data Enrichment Method was immediately expanded by other interested parties (see Operations Research 5: 858-859, and 6: 136), who pointed out that it can be applied iteratively to great effect, and that it can be used to support an hypothesis and also its opposite.

The important requirements for the Data Enrichment Method are: (i) a nested set of data patterns, and (ii) an a priori expectation about what should be the answer to the experimental question. All scientists should have the latter, of course, since they are supposed to be testing the expectation by calling it an "hypothesis".

Most interestingly for us, phylogeneticists will often be able to meet requirement (i), as well, because their data often form a nested set, representing the shared derived character states from which a phylogenetic tree will be derived. I therefore once wrote an article examining the application of The Data Enrichment Method to phylogenetics, where it does indeed work very well. You do need at least some data to start with, and so it does not free you entirely from the inconvenience and embarrassment of uncontrollable empirical results.

This article appeared in 1992 in the Australian Systematic Botany Society Newsletter 71: 2–5. Since this issue of the Newsletter is not online, presumably no-one has read this article since then. However, you should read it, and so I have linked to a PDF copy of the paper:
A new method for increasing the robustness of cladistic analyses

After reading it, you might like to think about how to apply this method to phylogenetic networks. The mixture of horizontal gene flow with vertical descent breaks the simple nested data pattern of a phylogenetic tree, which complicates the application of data enrichment to networks.

Wednesday, December 5, 2012

How networks differ from bootstrapped trees

I have noted before (Networks and bootstraps as tree-support criteria) that data-display networks can produce quite different results from bootstrap values on phylogenetic trees. For example, a splits-graph assesses character support for alternative bipartitions of the dataset, whereas a bootstrapped tree assesses  support for those branches that appear only in the tree. These two data evaluations will often be congruent, but they can also differ notably. Here, I use a published dataset to illustrate the two ways in which they can differ.

The dataset is from: Wang N., Braun E.L., Kimball R.T. (2012) Testing hypotheses about the sister group of the Passeriformes using an independent 30-locus data set. Molecular Biology and Evolution 29: 737–750. There are 28 taxa and 25,700 aligned nucleotides.

I used the SplitsTree program to calculate (i) a Neighbor-Joining tree with 1,000 bootstrap pseudoreplicates, and (ii) a NeighborNet graph. In both cases the simple p-distance was used.

Assessment

The graph below shows the split weights (or edge lengths) for the 58 splits that were included in the NeighborNet graph. These form a collection of what are called circular splits, and it is important to note that this collection does not include all of the splits supported by the data. Those splits not in the NeighborNet graph are shown in green with a split weight of 0.00001 (rather than zero), to accommodate the log scale.

The graph also shows the bootstrap percentages for all of the splits in the NeighborNet graph plus all of those branches with a bootstrap frequency greater than 1/100.  Splits that did not appear in any of the bootstrap pseudoreplicates are shown in pink.


Those splits / branches where there is an approximate agreement between the tree and the network are shown in blue. There is a roughly s-shaped relationship between the split weights and the bootstrap percentages, so that an increase in one is associated with an increase in the other.

However, in the range of the graph where there is 100% bootstrap support there are 8 splits (in pink) with a large split weight but 0% bootstrap support. These are splits that contradict at least one better-supported split. The better-supported splits appear as branches in the NeighborJoining tree at the expense of these 8 splits. This is the limitation of a tree representation, as it cannot accommodate alternative patterns, no matter how well-supported they are by the character data.

The important thing to realize is that these splits cannot appear in any bootstrap pseudoreplicate, because they are out-weighed in the character resampling performed by the bootstrap procedure. For each bootstrap pseudoreplicate the resampled data are forced into a tree, and thus all contradictory splits are ignored each time, no matter how well-supported they are. These splits therefore get 0% bootstrap support even though there is considerable character support for them.

Equally importantly, there is one edge that appears in the bootstrap assessment with high support (87%) but which does not appear in the NeighborNet graph at all, shown in green in the graph. The first step of the NeighborNet algorithm decides on a set of circular splits, and only these splits will appear in the splits graph, no matter how well-supported other splits might be.

In this example, there is a nested set of taxa that appears in the tree ((13,14)((15,16)17)), but the NeighborNet finds greater support for several contradictory partitions such as {16,17,27,28} and {15,16,17,27}, and thus cannot display the partition {15,16,17}, although it can accommodate the other three partitions: {13,14}, {15,16} and {13,14,15,16,17}.

So, the nested set gets high bootstrap support simply because it fits onto a tree. That is, given the partitions {13,14}, {15,16}, and {13,14,15,16,17}, all of which are well supported by the character data, then {15,16,17} will be supported as well because it fits neatly onto the tree, irrespective of the strength of its character support (the location of 17 is not well supported no matter where it is on the tree).

Conclusion

Trees and networks will often be in agreement, especially if the data are very tree-like. However, they can differ in two ways: (i) the network may show well-supported character patterns that are not included in the tree, and (ii) the tree may show well-supported branches that are not accommodated by the network-building algorithm. Well-supported branches on a tree are not necessarily well-supported by the character data, and absence of a branch from a tree does not necessarily mean that it has little character support.

Monday, December 3, 2012

Faux phylogenies


It is, of course, possible to produce a phylogeny of any group of objects that vary in their intrinsic characteristics, and where those characteristics can be inferred to vary through time. One popular subject is the Tree of Life, but where "life" is defined in rather a loose fashion. Here are a few examples of what I mean.

Note: there is a follow-up post (Faux phylogenies II), as well.

Legendary figures

This first one is reproduced from pages 90-91 of Bart Simpson's Guide to Life (1993, Harper Collins). It includes a series of somewhat legendary figures; and newts apparently evolved twice.

Bart Simpson's Tree of Life
© Matt Groening

Cartoon animals

Mike Keesey, from the Three-Pound Monkey Brain blog, has an example of his own, in which he took a stab at a phylogeny of cartoon animals (it is the third phylogeny on that blog page, or click the image below).

Mike Keesey's
phylogeny of cartoon animals

Kalle Anka

In 1993, the cartoonist Don Rosa produced a genealogy of Donald Duck and his family, intended to resolve decades of contradictions among the comic-book stories.

Don Rosa's Donald Duck genealogy
characters © Disney

There are many other versions of this genealogy, most of which are linked at this page. The first one on that page is the most detailed and complete version of the family tree.

Monstrasinu

The July-August 2012 edition of the Annals of Improbable Research (vol. 18, no. 4, pp. 15-17) contained an article by Matan Shelomi, Andrew Richards, Ivana Li and Yukinari Okido called "A phylogeny and evolutionary history of the Pokémon". Below is a low-resolution image, but a much higher resolution version of the phylogeny (2.9 MB) can be found here.

Pokémon phylogeny
characters © Nintendo

Asian lóng and Eurasian dragons

At the same time, Rob Colautti produced a t-shirt design from his phylogeny of dragon-like organisms, which is based on a neighbor-joining analysis of 27 distinct traits for 76 pieces of historical artwork.

Rob Colautti's dragon phylogeny

According to his Facebook page, he is intending to publish this phylogenetic analysis in the afore-mentioned Annals of Improbable Research, and to make the dataset available online, as well.