In this blog we have always championed the use of Exploratory Data Analysis prior to phylogenetic analyses. This approach explores the characteristics of the data before making formal inferences about possible evolutionary scenarios. One of the reasons for doing this is the possibility of data errors. That is, we need to distinguish between estimation errors deriving from our experimental procedures and real biological scenarios, because both of these will result in complex patterns in our data.
One possible classification of the potential causes of complex data patterns in phylogenetics is this:
(i) incorrect data
— inadequate data-collection protocol
— poor laboratory / museum / herbarium technique
— lack of quality control after data collection
(ii) inappropriate sampling
— distant outgroup
— rapid evolutionary rates
— short internal branches
(iii) model mis-specification
— wrong assessment of primary homology
— wrong substitution model
— different optimality criteria
— deep coalescence
— horizontal gene transfer
— genome fusion
The scientific literature has a number of prime examples where people have asserted a case of biological complexity that has subsequently been questioned, and attributed to estimation errors instead.
For example, many of you will have noted the recent attention given to the release of various genome sequences from the Tardigrades, a group of microscopic animals often alleged to be the world's most resistant to environmental conditions. Two rival papers have appeared:
Thomas C. Boothby et al. (2015) Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proceedings of the National Academy of Sciences of the USA 112: 15976–15981.
Georgios Koutsovoulos et al. (2015) The genome of the tardigrade Hypsibius dujardini. BioRxiv preprint 33464. [Now published as: Georgios Koutsovoulos et al. (2016) No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proceedings of the National Academy of Sciences of the USA]The former paper attributes their observed phylogenetic complexity to horizontal gene transfer (group v in the list above) while the latter attributes it to sequencing errors (group i). This situation is discussed in more detail elsewhere on the web, for example:
The current situation potentially has unfortunate consequences. For example, there are continual comments that horizontal gene flow is rare, particularly from zoologists, even though there is a large amount of evidence to the contrary. Situations like the current one can only add fuel to this argument, if strong claims of gene flow turn out to be erroneous. There is no quantitative basis for an assertion that gene flow is rare in zoology — those who have looked for reticulate evolution in animals have found it, and those who haven't haven't.
In the end, data-display networks are useful for displaying incongruent data patterns, but the source of the incongruence needs to be identified before these networks are turned into evolutionary networks (either explicitly drawn or verbally implied).