Wednesday, March 13, 2013

Topological restrictions: some comments


In a recent post (Different topological restrictions of rooted phylogenetic networks. Which make biological sense?), Leo and Steven synthesized a lot of the issues that have been raised in recent years, both in this blog and in the literature, about the underlying models for rooted phylogenetic networks. This is an excellent summary (and explanation) of the current situation; please read it if you have not already done so.

The key issue for biologists is: what are we trying to model biologically? In one sense the answer is trivial: "evolutionary history". However, this answer is uninformative, in the sense that it is too broad to be practical. We do not know enough about evolutionary history for it to be obvious which model(s) we should use, and we presumably never will know (we were not there to study it, and it happens too slowly to study it now).

Nevertheless, we need to make decisions about models, and in many cases quite detailed decisions. Therefore, we need public discussion about the various possible models that have been suggested, as well as suggestions for new models. Unfortunately, the impetus for this has almost always come form the computational side rather than the biological side, if only from practical necessity — a mathematician cannot proceed without a model. (A biologist can't either, but they seem to be much more comfortable with vague word models rather than quantitative mathematical ones.)

So, I will provide an initial response to Leo and Steven's points, as a staring point for discussion, in the order in which they raised them. Hopefully, someone else will have something to say, as well.

Rooted

If there is assumed to be a single origin of life, and evolution is assumed to be acyclic, then there will always be at least one root in every evolutionary network. Indeed, if we are studying a monophyletic group of organisms then there will always be precisely one root. But if we are studying a polyphyletic group then technically there are multiple roots, in the sense that we cannot reconstruct the most recent common ancestor of the group. In practice, we have ignored this issue, and simply reconstructed the ancestor anyway, as best we can, since that is what the current tree-building algorithms do. In this sense, the only difference between an evolutionary tree and an evolutionary network is the complication caused by horizontal gene transfer, in which we might consider a species' genome to be polyphyletic, as well.

So, a single artificial root will probably suffice — see Steven's earlier comment: "the multiple-root situation can probably be reduced to the single-root situation by having some kind of high-degree (i.e. unrefined) artificial root which is the parent of the real roots."

Acyclic

Acyclicity is an obvious criterion for evolution, because a descendant cannot be its own ancestor. So, in a network we should treat this as "sacred" — an evolutionary network in its final form cannot show any directed cycles.

This does not, I guess, preclude algorithms that do not themselves require acyclicity, as biology always violates mathematical assumptions (eg. normality, homoscedasticity, etc). However, I think that we should require the output to be acyclic.

This is not necessarily an issue for reconciliation between gene trees and species trees, as raised in the original blog post. In this scenario the resulting phylogeny is a tree, and therefore cannot show cycles, by definition.

Time-consistent

Time consistency is a stronger form of acyclicity, in the sense that the acyclic parts of a network are also restricted by time's arrow. It is another obvious criterion for evolution, because a descendant cannot reticulate with its predecessors.

However, this requirement is confounded by incomplete taxon sampling, as noted by Leo and Steven. To deal with this, we can allow "ghost" lineages representing hypothesized missing taxa. This does, nevertheless, raise the issue of where we draw the line. Any two trees can be reconciled by allowing enough ghost lineages, just as is done by algorithms for the gene duplication-loss problem, where minimizing the numbers of unobserved duplications and losses is the optimality criterion. Similarly, how many ghost lineages should we allow in a reticulating network? Perhaps we should be minimizing them, too.

There is also the more fundamental issue of whether time-consistency should actually be a requirement at all. It has been pointed out a number of times that this is not a requirement for anthropology, for example, where transfer of non-genetic information through time is not only possible but quite common. This has been discussed in earlier blog posts.

So, in a general sense we cannot make time consistency a mandatory requirement of evolutionary networks. However, we could do so for strictly biological networks, where information transfer occurs with gene flow, which must follow time's arrow.

Degree-restricted

Nodes with indegree 1 and outdegree 1 are usually considered to be unnecessary, because they do not represent anything more than would an edge (or arc). The only times they have been seriously suggested are when placing known fossils onto a tree, and the fossil is claimed to represent an ancestor of one of the nodes. However, this is considered to be inappropriate, because we can never be sure that any specified fossil is actually an ancestor, as opposed to being a sister to one of the ancestors (see the blog post at Evolving Thoughts).

Indegree >2 has also been considered to be problematic in phylogenetics. If a sexually reproducing organism has two parents, then theoretically we cannot observe indegree>2. In practice, however, evolutionary events may occur over a short enough time-scale that we cannot distinguish a series of individual indegree-2 events, so that for practical purposes a network might end up showing indegree >2. This is analogous to "soft polytomies" in trees, where outdegree >2 represents uncertainty.

Tree-child, Tree-sibling, Reticulation-visible

These network restrictions all refer to so-called "invisible" nodes (nodes from which all paths end up in reticulations). All nodes in a tree are visible, and so there is no analogous situation in tree-building from which we might derive a response (as I have done several times above).

The basic issue with invisible nodes is restricting their occurrence. Theoretically, we could add an infinite number of invisible nodes to a network, but few, if any, of them would be based on any inference from the data. So, tree-child networks ban them entirely, whereas tree-sibling and reticulation-visible networks allow them under particular circumstances.

It is difficult for me to see exactly how I should interpret invisible nodes biologically. How can I reconstruct their features, for example? Invisible nodes tend to appear when combining multiple trees into a network, and so they are not usually constructed directly from character data. Biologically, they might exist, but there seems to be little reason to consider them as well-supported inferences, as we do for visible nodes.

In this sense, they have much more of the feel of mathematical artifacts than do visible nodes.

Galled trees, Galled networks, Level-k

These are all artificial restrictions on networks that do nothing more than limit the complexity (and thus make the algorithms more tractable). There seems to be no good biologicals basis for using any of these criteria as restrictions (as opposed to using them as network descriptors).

There is, however, a good logical basis for their use — simpler networks are to be preferred over more complex ones, at least as working hypotheses for evolutionary history. The problem with this attitude seems to be that sometimes simpler networks look less biologically realistic than do more complex ones (eg. level-1 networks might spread the reticulations across the network whereas level-3 networks can concentrate them in one place).

Regular and normal

These always seem to me to be of mathematical interest for characterizing networks, rather than being of any biological interest. But maybe that is just my ignorance.

Directed Acyclic Graph or tree-with-edges-added?

To me , this is one of the BIG questions. Until biologists come to grips with this, I do not see how the computational people are going to make the helpful contributions that they obviously desire (or, at least, the ones I know). I have thought (and read) about this a lot, and I have come to the (perhaps unfortunate) conclusion that the answer is: "both".

From the mathematical point of view, the problem is simply that certain network topologies will not be considered when we add reticulation nodes to a tree. So, we automatically exclude those topologies if our algorithm starts with a tree. From the biological point of view, the issue is whether we consider evolutionary history to be approximately tree-like or whether it is fundamentally reticulated.

My reading of the literature is that those people dealing with hybridization in eukaryotes are leaning towards the "tree obscured by vines" model, whereas those dealing with horizontal gene transfer in prokaryotes prefer the "anastomosing plexus" model. Perhaps it is too soon to tell whether this dichotomy will continue; but, certainly, discussion about the hybridization issue goes back to the early 1980s, and since then the opinion about the fundamentally tree-like nature of history has been repeatedly expressed. Moreover, those people using reduced median networks and median joining networks for mitochondrial data seem to be interested in simplifying their initially complex network as much as possible, and then interpreting the resulting network in terms of a few conflicting parsimony trees; I interpret this as a preference for trees over networks.

Of course, preferring a reticulated tree model may just be historical inertia, whose importance in the daily lives of scientists should not be underestimated.

No comments:

Post a Comment