Monday, April 16, 2018

Networks in the news, at last


Phylogenetic networks do not always fare very well in the traditional media. The general public has enough troubles dealing with a phylogenetic tree, let alone networks. For example, many people still consider that Darwin claimed that monkeys are our ancestors (a chain-based relationship) rather than our cousins (a tree-based relationship) — who knows what they must think about humans inter-breeding with Neandertals (a network-based relationship).

Nevertheless, a few news reports about a recent network-based paper have suggested that the situation might be improving.


The paper in question is:
Úlfur Árnason, Fritjof Lammers, Vikas Kumar, Maria A. Nilsson, Axel Janke. Whole-genome sequencing of the blue whale and other rorquals finds signatures for introgressive gene flow. Science Advances 4: eaap9873.
This paper details extensive genomic admixture among six species of Baleen whales. The phylogenetic scenarios involving gene flow cannot be represented by a tree, of course, so the authors include the following set of networks (along with a Median network).


News reports have appeared in at least two places, reporting on this paper, that discuss the difference between networks and "Darwinian trees", and do quite a good job of it.

For example, this quotation is from the New York Times ("Baleen Whales intermingled as they evolved, and share DNA with distant cousins"):
The relationships are so complicated, however, that the senior researcher Axel Janke said "family tree" is too simple a metaphor. Instead, the species, all part of a group called rorquals, have evolved more into a network, sharing large segments of DNA with even distant cousins. Scientists expressed surprise that there had been so much intermingling of baleen whales, given the variety of sizes and shapes.
This quotation is from Popular Science ("A new study on whales suggests Darwin didn't quite get it right"):
Evolutionary network analysis takes the tree metaphor and turns it into a complex web, which acknowledges the different kinds of familial connections shown by whole-genome sequencing. Comparing the whole genomes of rorquals shows that genetics is much more fluid than the Darwinian “tree” model, Janke says.
"Gene flow and hybridization is more common than biologists usually think," Janke says. Analysis of the rorquals’ genes shows that they've interbred in different ways at various times in their evolutionary history. This doesn't make much sense if you rely only on Darwin's model, where branches of the family tree never touch again after they separate.
I think that these give us all a reason for optimism.

Monday, April 9, 2018

The curious case(s) of tree-like matrices with no synapomorphies


(This is a joint post by Guido Grimm and David Morrison)

Phylogenetic data matrices can have odd patterns in them, which presumably represent phylogenetic signals of some sort. This seems to apply particularly to morphological matrices. In this post, we will show examples of matrices that are packed with homoplasious characters, and thus lead to trees with a low Consistency Index (CI), but which nevertheless have high tree-likeness, as measured by a high Retention Index (RI) and a low matrix Delta Value (mDV). We will also try to explore the reasons for this apparently contradictory situation.

Background

A colleague of ours was recently asked, when trying to publish a paper, to explain why there were low CI but high RI values in his study. This reminded Guido of a set of analyses he started about a decade ago, using an arbitrary selection of plant morphological matrices he had access to.

The idea of that study was to advocate the use of networks for phylogenetic studies using morphological matrices, based on the two dozen data sets that he had at hand. The datasets were each used to infer trees and quantify branch support, under three different optimality criteria: least-squares (via neighbour-joining, NJ), maximum likelihood, and maximum parsimony. This study was was never wrapped up for a formal paper, for several reasons (one being that 10 years ago Guido had absolutely no idea which journal could possibly consider to publish such a paper, another that he struggled to find many suitable published matrices).

The signals detected in the collected matrices were quite different from each other. The set included matrices with very high matrix Delta Values (mDV), nontree-like signals, and astonishingly low mDVs, for a morphological matrix. Equally divergent were the CI and RI of the inferred equally most-parsimonious trees (MPT) and the NJ tree. The data for the MPTs and the primary matrices are shown in the first graph, as a series of scatterplots, where each axis covers the values 0-1. (Note: in most cases the NJ topologies are as optimal as the MPTs, and have similar CI and RI values.)


As you can see, the CI values (parsimony-uninformative characters not considered) are not correlated with either the RI or mDV values, whereas the latter two are highly correlated, with one exception.

The most tree-like matrix (mDV = 0.184, which is a value typically found for molecular matrices allowing for inference of unambiguous trees) was the one of Hufford & McMahon (2004) on Besseya and Synthyris. The number of MPTs was undetermined —using a ChuckScore of 39 steps (the best value found in test runs), PAUP* found more than 80,000 MPTs with a CI of 0.39 (third-lowest of all of the datasets), but an RI of 0.9 (highest value found).

A strict consensus network of the 80,003 equally parsimonious solutions, the network equivalent to the commonly seen strict consensus tree cladograms. Trivial splits are collapsed. Colours solely added for orientation (see next graph).

Oddly, the NJ tree had the same number of steps (under parsimony), but a much higher CI (0.69). The proportion of branches with a boostrap support of > 50% was twice as large in a distance-based framework than using parsimony.

Bootstrap consensus networks based on 10,000 pseudoreplicates each. Left, distance-based and inferred using the Neighbour-Joining algorithm; right, using a branch-and-bound search under parsimony as optimality criterion (one tree saved per replicate). Edge-lengths reflect branch support of sole or competing alternatives; alternatives found in less than 20% of the replicates not shown; trivial splits are collapsed. Same colour scheme than above for orientation.

The Neighbour-net based on this matrix has quite an interesting structure. Tree-like portions are clearly visible (hence, the low mDV) but the branches are not twigs but well developed trunks. The large number of MPTs is mainly due to the relative indistinctness of many OTUs from each other.


Neighbour-net based on simple mean (Hamming) morphological distances. Same colour scheme as above.
This distance-based 2-dimensional graph captures all main aspects of the tree inferences and bootstrap analyses, with one notable exception: B. alpina which is clearly part of the red clade in the tree-based analyses. We can see that the orange group, B. wyomingensis and close relatives, is (morphology-wise) less derived than the red species group. Although B. alpina is usually placed in a red clade, it would represent a morphotype much more similar to the orange cluster as it lacks most of the derived character suite that defines the rest of the red clade. In trees, B. alpina is accordingly connected to the short red root branch as first diverging "sister" with a very short to zero-long terminal branch, but in the network it is placed intermediate between the poorly differentiated but morphologically inhomogenous oranges and the strongly derived reds — being a slightly reddish orange. This reddishness may reflect a shared common origin of B. alpina and the other reds, in which case the tree-based inferences show us the true tree. Or just a parallel derivation in a member of the B. wyoming species aggregate, in which case the unambiguous clade would be a pseudo-monophylum (see also our recent posts on Clades, cladistics, and why networks are inevitable and Let's distinguish between Hennig and cladistics).

Interpretation, what does low CI but high RI stand for?

The distinction between the Consistency Index and the Retention index has been of long-standing practical importance in phylogenetics. For a detailed discussion, you can consult the paper by Gavin Naylor and Fred Kraus (The Relationship between s and m and the Retention Index. Systematic Biology 44: 559-562. 1995).

For each character, the consistency index is the fraction of changes in a character that are implied to be unique on any given tree (ie. one change for each character state): m / s, where m = the minimum possible number if character-state changes on the tree, and s = the observed number if character-state changes on the tree. The sum of these values across all characters is the ensemble consistency index for the dataset (CI).

The retention index (also called the homoplasy excess ratio) for each character quantifies the apparent synapomorphy in the character that is retained as synapomorphy on the tree: (g - s) / (g - m), where g = the greatest amount of change that the character may require on the tree. Once again, the sum of these values across all characters is the ensemble retention index for the dataset (RI).

Both CI and RI are comparative measures of homoplasy — that is, the degree to which the data fit the given tree. However, CI is negatively correlated with both the number of taxa and the number of characters, and it is inflated by the inclusion of parsimony-uninformative characters. RI is less sensitive to these characteristics. However, RI is inflated by the presence of unique states in multi-state characters that have some other states shared among taxa and, therefore, are potentially synapomorphic.

It is these different responses to character-state distributions (among the taxa) that apparently create the situation noted above for morphological data. Neither CI nor RI directly measures tree-likeness, but instead they are related to homoplasy. So, it is the relative character-state distributions among the taxa that matter in determining their values, not just the tree itself.

For example, increasing the number of states per character will, in general, increase CI faster than RI. Increasing the number of states that per character that occur in only one taxon will, in general, increase RI faster than CI.

Take-home message

This is just another example demonstrating that morphological data sets should not be used to infer (parsimony) trees alone, but analysed using a combination of Neighbour-nets and support Consensus Networks. No matter which optimality criterion is preferred by the researcher, the signal in such matrices is typically not trivial. It calls for exploratory data analysis, and inference methods that are able to capture more than a trivial sequence of dichotomies.

Monday, April 2, 2018

Things you can learn in a blink about your data


As phylogeneticists, we commonly have to deal with data that we don't initially understand. In this post, I'll use a recently published 8-gene dataset on lizards to show how much can be learned prior to any deeper analysis, just from producing a few Neighbour-nets.

The data

Solovyeva et al. (Cenozoic aridization in Central Eurasia shaped diversification of toad-headed agamas, PeerJ, 2018) sampled species of toad-headed agamas (lizards) across their natural range (north-western China to the western side of the Caspian Sea), to study their genetic differentiation in time and space. To do so they used two datasets. The mitochondrial data covers four gene regions: coxI, cytB, nad2, and nad4, and are complemented by four nuclear gene regions: AKAP9, NKTR, BDNF, RAG1.

This caught my eye, because the authors' preferred trees have a bunch of low branch-support values, so that this would be a good opportunity to advocate some Consensus networks. They also report only values above a certain threshold, as apparently recommended by several reviewers. My reviewers not rarely recommended the same, but I always ignored this — I believe we should give the value, because it makes a difference if its just below the threshold (e.g. bootstrap support, BS, of 49), or non-existent (BS < 5). The authors also note that their mitochondrial and nuclear genealogies are not fully congruent. In short, the signal from their matrix is probably not trivial, but could be interesting.

In contrast to many other journals, PeerJ has a strict open-data policy. Solovyeva et al. provide each gene as FASTA-formatted alignment as Supporting Information. So let's have some quick-and-dirty Neighbour-nets.

Using Neighbour-nets to decide on an analysis strategy

A comprehensive outgroup sampling can avoid outgroup-rooting artefacts, but adding very distant outgroups comes at a price. We need to invest much more computational effort, because the inference programmes not only try to optimize our focus group, but the entire taxon set. Another principal question is: what can an outgroup taxon provide as information for rooting an ingroup, while being completely different? Furthermore, when we do an ML (or Bayesian) analysis, e.g. with RAxML, we leave it to the program to optimize a substitution model (even when we predefine a model, its parameters will usually be optimized by the inference software on the fly). By adding distant outgroups, we optimize a model for them plus our focus group — by not using any outgroup, we optimize a model suiting just the situation in our focus group.

Fig. 1 shows the neighbour-net (uncorrected, codon-naive p-distances) for the first of the mitochondrial genes, coxI (the others are similar), which and tells us a lot about the data to be used for the tree inferences.

Fig. 1 Neighbour-net based on mitochondrial (coxI) uncorrected p-distances. The diffuse, non-treelike signal expressed in the A and B fans will be a hard nut for the tree inference, and will have little influence on questions dealing with the focal genus.
We can see that outgroup diversity is much higher than for the focus group, and that most outgroup taxa are very distinct from the ingroup. Looking at the closest outgroups (Stellagama, Agama, Laudakia, Paralaudakia, Xenagama, Pseudotrapelus), we see that finding an unambiguous sister taxon to the focal genus will be difficult. And we can realize that including more-distant taxa just gives the algorithm much more work (note the A and B bushes), but hardly will have any benefit for rooting the ingroup.

We also can see that the 3rd codon position is probably saturated to some degree, and that we will be dealing with a high level of stochasticity (randomly distributed mutation patterns) here — all terminal edges are long to very long. Since the same thing holds for the other three mitochondrial regions, it would not be a bad idea to do an additional inference including only the 1st and 2nd codon positions, in case all taxa should be included.

Using Neighbour-nets to understand the basic signal properties of your data

Fig. 2 shows the Neighbour-net (again, uncorrected p-distances) for one of the nuclear genes, AKAP9. The outgroup sample is somewhat different, but we can immediately see that this gene has more potency to infer unambiguous phylogenetic relationships among the sampled taxa — the graph has distinctly tree-like portions. We also see that saturation of 3rd codon position is much less of an issue here, compared to the cox1 gene (Fig. 1) — the terminal edges are comparatively short, with respect to the central edge bundles. [Nonetheless, it is never wrong to analyze coding gene data partitioned: 1st and 2nd codon positions vs. 3rd codon position.]


Fig. 2 Neighbour-net based on the nuclear (AKAP9) genetic distances. Note the much more treelike structure of the graph, the generally shorter terminal edges, and last-but-not-least the notable difference between ingroup (focal genus) and outgroup taxa.
For the general differentiation patterns, compare the minute extent of the focal group, green background in Fig. 2 vs. the prominent bush in Fig. 1. It is clear that including distant outgroups will not have any benefit. We may even consider reducing the outgroup sample (if one has to include an outgroup at all) to the two genetically closest genera Stellagama and Paralaudakia.

Similarly structured graphs are found for the other three nuclear genes.


Producing some quick Neighbour-nets doesn't hurt

Sometimes reviewers will pick on them — "distance-based phenetic method" is something I used to get a lot. In this case, you can still produce them just to get some basic impressions on your data set. This will help you to understand the results of your tree inferences, including why some of your branches have ambiguous support.

It comes as little surprise that the taxa one can identify, in these networks, as likely sister genera of the focal genus, come up as sister taxa in the explicit phylogenetic analyses done by Soloveya et al. — e.g., their fig. 2 showing the combined mitochondrial tree, and their fig. 3, showing the combined nuclear tree.

Soloveya et al. (2018) performed some incongruence tests (AU-topology test) using single-gene inferences (going further than many other studies), but did not dig deeper. One of the authors answered my question about potential signal issues that may cause topological incongruence between ML and Bayesian trees, as well as ambiguous support, but he considers this to be a solely a problem with methods — different algorithms prefer different phylogenies. Having looked at the basic differentiation pattern in the gene regions using Neighbour-nets, it may be more than just an issue with methods — ML and Bayesian analysis should always support the same splits when using the same or similar substitution models.

Like many other studies, the authors also use the data for Bayesian dating and dating-dependent biogeographic analysis. Lacking any ingroup fossils, the authors could only constrain nodes within the outgroup subtree, which are nodes far from those that they discuss and estimate. I have my doubts that we can put much faith in the uncorrelated clock process to handle such extreme differences between focus group (ingroup) and (constrained) outgroup-taxon lineages as seen in Fig. 2. Estimates for rate shifts between outgroup and ingroup usually render ingroup age estimates to be too young, compared to age estimates obtained with ingroup fossils. This is something that can be directly deduced from a graph like the one in Fig. 2.

Data and networks can be found at figshare

The original paper provides a comprehensive supplement with a lot of interesting information, but the FASTA-files, each comprising a single gene region and a few editing issues, are not yet ready to use. Hence, I transformed them into NEXUS-files, and generated a combined data matrix. The files and the Neighbour-nets for each gene region (and a full single-gene maximum likelihood analysis) can be found on figshare.

Monday, March 26, 2018

It's the system, stupid! More thoughts on sound change in language history


In various blog posts in the past I have tried to emphasize that sound change in linguistics is fundamentally different from the kind of change in phenotype / genotype that we encounter in biology. The most crucial difference is that sound sequences, i.e., our words or parts of the words we use when communicating, do not manifest as a physical substance but — as linguists say — "ephemerically", i.e. by the air flow that comes out of the mouth of a speaker and is perceived as an acoustic signal by the listener. This is in strong contrast to DNA sequences, for example, which are undeniably somewhere "out there". They can be sliced, investigated, and they preserve information for centuries if not millenia, as the recent boom in archaeogenetics illustrates.

Here, I explore the consequences of this difference in a bit more detail.

Language as an activity

Language, as Wilhelm von Humboldt (1767-1835) — the boring linguist who investigated languages from his armchair while his brother Alexander was traveling the world — put it, is an activity (energeia). If we utter sentences, we pursue this activity and produce sample output of the system hidden in our heads. Since the sound signal is only determined by the capacity of our mouth to produce certain sounds, and the capacity of our brain to parse the signals we hear, we find a much stronger variation in the different sounds available in the languages of the world than we find when comparing the alphabets underlying DNA or protein sequences.

Despite the large variation in the sound systems of the world's languages, it is clear that there are striking common tendencies. A language without vowels does not make much sense, as we would have problems pronouncing the words or perceiving them at longer distances. A language without consonants would also be problematic; and even artificial communication systems developed for long-distance communication, like the different kinds of yodeling practiced in different parts of the world, make use of consonants to allow for a clearer distinction between vowels (see the page about Yodeling on Wikipedia). But, between both extremes we find great variation in the languages of the world, and this does not seem to follow any specific pattern that could point to any kind of selective pressure, although scholars have repeatedly tried to demonstrate it (see Everett et al. 2015 and the follow-up by Roberts 2018).

What is also important here is that, not only is the number of the sounds we find in the sound system of a given language highly variable, but there is also variation in the rules by which sounds can be concatenated to form words (called the phonotactics of a language), along with the frequency of the sounds in the words of different languages. Some languages tolerate clusters of multiple consonants (compare Russian vzroslye or German Herbst), others refuse them (compare the Chinese name for Frankfurt: fǎlánkèfú), yet others allow words to end in voiced stops (compare English job in standard pronunciation), and some turn voiced stops into voiceless ones (compare the standard pronunciation of Job in German as jop).

Language as a system

Language is a system which essentially concatenates a fixed number of sounds to sequences, being only restricted by the encoding and decoding capacities of its users. This is the core reason why sound change is so different from change in biological characters. If we say that German d goes back to Proto-Germanic *θ (pronounced as th in path), this does not mean that there were a couple of mutations in a couple of words of the German language. Instead it means that the system which produced the words for Proto-Germanic changed the way in which the sound *θ was produced in the original system.

In some sense, we can think metaphorically of a typewriter, in which we replace a letter by another one. As a result, whenever we want to type a given word in the way we know it, we will type it with the new letter instead. But this analogy would be to restricted, as we can also add new letters to the typewriter, or remove existing ones. We can also split one letter key into two, as happens in the case of palatalization, which is a very common type of sound change during which sounds like [k] or [g] turn into sounds like [] and [] when being followed by front vowels (compare Italian cento "hundred", which was pronounced [kɛntum] in Latin and is now pronounced as [tʃɛnto]).

Sound change is not the same as mutation in biology

Since it is the sound system that changes during the process we call sound change, and not the words (which are just a reflection of the output of the system), we cannot equate sound change with mutations in biological sequences, since mutations do not recur across all sequences in a genome, replacing one DNA segment by another one, which may not even have existed before. The change in the system, as opposed to the sequences that the system produces, is the reason for the apparent regularity of sound change.

This culminates in Leonard Bloomfield's (1887-1949) famous (at least among old-school linguists) expression that 'phonemes [i. e., the minimal distinctive units of language] change' (Bloomfield 1933: 351). From the perspective of formal approaches to sequence comparison, we could restate this as: 'alphabets change'. Hruschka et al. (2015) have compared sound change with concerted evolution in biology. We can state the analogy in simpler terms: sound change reflects systemics in language history, and concerted evolution results from systemic changes in biological evolution. It's the system, stupid!

Given that sound systems change in language history, this means that the problem of character alignments (i.e. determining homology/cognacy) in linguistics cannot be directly solved with the same techniques that are used in biology, where the alphabets are assumed to be constant, and alignments are supposed to identify mutations alone. If we want to compare sequences in linguistics, where we have to compare sequences that were basically drawn from different alphabets, this means that we need to find out which sounds correspond to which sounds across different languages while at the same time trying to align them.

An artificial example for the systemic grounding of sound change

Let me provide a concrete artificial example, to illustrate the peculiarities of sound change. Imagine two people who originally spoke the same language, but then suffered from diseases or accidents that inhibited them from producing their speech in the way they did before. Let the first person suffer from a cold, which blocks the nose, and therefore turns all nasal sounds into their corresponding voiced stops, i.e., n becomes a d, ng becomes a g, and m becomes a b. Let the other person suffer from the loss of the front teeth, which makes it difficult to pronounce the sounds s and z correctly, so that they sound like a th (in its voiced and voiceless form, like in thing vs. that).


Artificial sound change resulting from a cold or the loss of the front teeth.

If we now let both persons pronounce the same words in their original language, they won't sound very similar anymore, as I have tried to depict in the following table (dh points to the th in words like father, as opposed to the voiceless th in words like thatch).

No.   Speaker Cold   Speaker Tooth 
1 bass math
2 buzic mudhic
3 dose nothe
4 boizy moidhy
5 sig thing
6 rizig ridhing

By comparing the words systematically, however, bearing in mind that we need to find the best alignment and the mapping between the alphabets, we can retrieve a set of what linguists call sound correspondences. We can see that the s of speaker Cold corresponds to the th of speaker Tooth, z corresponds to dh, b to m, d to n, and g to ng. Having probably figured out by now that my words were taken from the English language (spelling voiced s consequently as z), it is easy even to come up with a reconstruction of the original words (mass, music[=muzik], nose, noisy=[noizy], etc.).

Reconstructing ancestral sounds in our artificial example with help of regular sound correspondences.

Summary

Systemic changes are difficult to handle in phylogenetic analyses. They leave specific traces in the evolving objects we investigate that are often difficult to interpret. While it has been long since known to linguists that sound change is an inherently systemic phenomenon, it is still very difficult to communicate to non-linguistics what this means, and why it is so difficult for us to compare languages by comparing their words. Although it may seem tempting to compare languages with simple sequence-alignment algorithms with differences in biological sequences resulting from mutations (see for example Wheeler and Whiteley 2015), it is basically an oversimplifying approach.

Simple models undeniably have their merits, especially when dealing with big datasets that are difficult to inspect manually — there is nothing to say against their use. But we should always keep in mind that we can, and should, do much better than this. Handling systemic changes remains a major challenge for phylogenetic approaches, no matter whether they use trees, networks, bushes, or forests.

Given the peculiarity of sound change in linguistic evolution, and how well the phenomena are understood in our discipline, it seems worthwhile to invest time in exploring ways to formalize and model the process. During the past two decades, linguists have taken a lot of inspiration from biology. The time will come when we need to pay something back. Providing models and analyses to deal with systemic processes like sound change might be a good start.

References

Bloomfield, L. (1973) Language. Allen & Unwin: London.

Everett, C., D. Blasi, and S. Roberts (2015) Climate, vocal folds, and tonal languages: connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5: 1322-1327.

Hruschka, D., S. Branford, E. Smith, J. Wilkins, A. Meade, M. Pagel, and T. Bhattacharya (2015) Detecting regular sound changes in linguistics as events of concerted evolution. Curr. Biol. 25.1: 1-9.

Roberts, S. (2018) Robust, causal, and incremental approaches to investigating linguistic adaptation. Frontiers in Psychology 9: 166.

Wheeler, W. and P. Whiteley (2015) Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages. Cladistics 31.2: 113-125.

Monday, March 19, 2018

Comparing neighbour-nets and PCA graphs – the example of Mediterranean oaks


Distance matrices offer many avenues for exploring data. A common method is Principal Component Analysis (PCA). A much less common method is the use of Neighbour-nets. We have previously compared PCA and Neighbor-nets using theoretical data. In this post, I'll compare a PCA graph and the corresponding Neighbour-net using some empirical data.

Genetic differentiation in Mediterranean oaks

In the paper by Vitelli et al. (2017), we explored the phylogeographic structuring of a group of Mediterranean oak species. The species represented the westernmost populations of one of the main Eurasian oak lineages: the evergreen Quercus section Ilex ("Ilex oaks"; see Denk et al. 2017 for an up-to-date classification of oaks; see also this figshare-spread-sheet). It was a follow-up study to the one by Simeone et al. (2016).

We found that one species, the most widespread (Quercus ilex), carry plastids from quite different origins. The 2016 paper identified three main plastid haplotypes in the Ilex oaks: the unique (within the entire genus) "Euro-Med" haplotype; the "Cerris-Ilex" haplotype shared with western Eurasian members of (essentially deciduous) section Cerris, the sister clade of section Ilex (see Denk & Grimm 2010; confirmed by NGS SNP data, Hipp et al. 2015); and the "WAHEA" haplotype, an east-bound haplotype of section Ilex. Vitelli et al. aimed to characterise the range of these three main haplotypes throughout the four Ilex oak species found in the Mediterranean.

Figure 1 shows the two multivariate data analyses, along with a map of the sample locations.

Fig. 1 Phylogeographic structure of Quercus section Ilex around the Mediterranean (after Vitelli et al. 2017). a. PCA graph, and b. Neighbour-net based on the same inter-haplotype pairwise distance matrix. c. A map depicting the distribution of main haplotype groups labelled by Roman numerals: I haplotypes of the "WAHEA" lineage, II "Cerris-Ilex"-lineage, III–VI, subtypes of the "Euro-Med" lineage (cf. Simeone et al. 2016, fig. 1)

Regarding the overall diversification pattern, the PCA graph and the Neighbour-net show similar things. The "Euro-Med" lineage is the most diverse group, with four subgroups — two larger (and widespread) ones (haplotypes IV, V) and two rare ones (III, VI) only found in the Aegean region.
  • According to the PCA, haplotype III (colored olive) is intermediate between "Euro-Med" IV (blue) and the haplotype II (yellow), which represents another lineage of oak haplotypes, the Aegean/Northern Turkish "Cerris-Ilex" lineage. The same can be seen in the Neighbour-net.
  • The PCA further places haplotype VI (red) as equidistant to all of the other types, with IV and I (green; representing the oriental "WAHEA" lineage) being a bit closer. In the Neighbour-net, we can sum up the length of the connecting edge-bundles to find the same pattern. A difference between the two analyses is that VI is connected only with part of V (purple) by a pronounced edge bundle, but not connected to I (green). This is strikingly different from III, which shares an edge bundle with II and IV+V.

At this point in the analyses, we can use the potential property of the Neighbour-net acting as a distance-based 2-dimensional graph and acting as a meta-phylogenetic network (Fig. 2). Based on the PCA, which also is a 2-dimensional depiction of the differentiation, one may be tempted to interpret VI as a bridge between IV/V and I, not much different from how III bridges between II and IV (Fig. 1). On the other hand, the network (Figs 1, 2) informs us that VI is a likely relative of V, which in turn is a likely relative of IV; and the only connection between I and VI is their increasing distinctness to the other haplotypes of the "Euro-Med" lineage, III/IV/V.

Fig. 2 The main splits expressed in the neighbour-net. III may either be sister to II, or is part of a clade comprising IV and V.

Using the main split patterns in the Neighbour-net, we can infer the one phylogenetic hypothesis, a tree, that can accommodate them all (Fig. 3).

Fig. 3 The tree solution congruent with the major split patterns (Fig. 2).

I rejected the alternative sister relationship between II and III because this would imply a sister clade that only includes IV, V and VI but not III, which clashes with the affinity of III to IV and V (Fig. 2). Interpreting III as a sister of IV and V, explains both its affinity to II (putative sister lineage to III–VI) and IV and V.

We might accept that all three plastome lineages are reciprocally monophyletic (in a quite broad sense), meaning that each lineage evolved from a pool of closely related mother plants. If so, then the higher similarity between III ("Euro-Med") and II ("Cerris-Ilex") may represent a relative lack of derivation, whereas the dissimilarity between VI ("Euro-Med") and I ("WAHEA") to all other types can be due to a higher level of distinctness. And we can come up with a "cactus"-type metaphorical tree (Fig. 4) explaining the Neighbour-net (and PCA graph).

Fig. 4 A "cactus"-type tree metaphor for the evolution of oak plastomes (based on the results of Simeone et al. 2016, Vitelli et al. 2017, and – outside the focus group, i.e. Mediterranean oaks of Subgenus Cerris – some partly arcane, not yet published knowledge, I have access to)
We thus learn more from the Neighbor-net than from the PCA.

There's no reason to stop with a PCA

One empirical example is far from being conclusive, but it shows what the Neighbour-nets have to offer.

Trees are fine for proposing phylogenetic hypotheses, but we should always be aware of equally valid alternatives to the tree that we have optimized. And with increasing numbers of taxa, inferring optimal trees and assessing their alternatives require increasing effort, and checking. For many questions, PCA has been used as a quick alternative, including in large-sample genetic studies (see Continued misuse of PCA in genomics studies).

Neighbour-nets are just a natural step further towards a phylogeny, which come with very little extra effort and can use the same data basis: a matrix of pairwise distances. In the case of genetic data, which usually reflects at least the main aspects of the actual phylogeny (trivial or complex) behind it, the "true tree", they should be obligatory. They are much more than just a clustering approach (even though their algorithm is based on a cluster algorithm) or a bivariate analysis. Neighbour-nets are meta-phylogenetic networks that have the capacity to contain the one or many topologies explaining the data. They are as straightforward as PCA, when it comes to recognising "natural", coherent and equal, groups (in contrast to phylogenetic trees).

Postscript

I would have liked to add some more examples with non-genetic data. Data sets where the distances are not the result of an explicit phylogenetic process. But this requires much more effort, since none of the PCA studies I browsed had documented the used distance data/matrix. However, I'm sure that inferring a Neighbour-net based on no-matter-what similarity data used for PCA, can be a fruitful and revealing endeavour (and the reason why you find Neighbour-net based on U.S. gun legislation, breast sizes, languages, cryptocurrencies, etc. on this blog, but few PCAs). So, try it out the next time you make a PCA, and share the results e.g. by using our comment option or even a post as guest-blogger.

Don't miss these earlier posts with similar topic:

Also, this paper introduces Neighbor-nets to the wider audience of multivariate data analyses:

References

Denk T, Grimm GW. 2010. The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59: 351–366.

Denk T, Grimm GW, Manos PS, Deng M, Hipp AL. 2017. An updated infrageneric classification of the oaks: review of previous taxonomic schemes and synthesis of evolutionary patterns. In: Gil-Pelegrín E, Peguero-Pina JJ, and Sancho-Knapik D, eds. Oaks Physiological Ecology. Heidelberg, New York: Springer, p. 13–38. Free Pre-Print at bioRxiv [major change: Ponticae and Virentes accepted as additional sections in final version]

Hipp AL, Manos P, McVay JD, ... , Avishai M, Simeone MC. 2015 [abstract]. A phylogeny of the World's oaks. Botany 2015. Edmonton.

Simeone MC, Grimm GW, Papini A, Vessella F, Cardoni S, Tordoni E, Piredda R, Franc A, Denk T. 2016. Plastome data reveal multiple geographic origins of Quercus Group Ilex. PeerJ 4: e1897 [open access, comments/questions welcomed]

Vitelli M, Vessella F, Cardoni S, Pollegioni P, Denk T, Grimm GW, Simeone MC. 2017. Phylogeographic structuring of plastome diversity in Mediterranean oaks (Quercus Group Ilex, Fagaceae). Tree Genetics and Genomes 13:3.

Monday, March 12, 2018

Tattoo Monday XIV


Tattoos are quite common among modern women. So, for today's collection, here are some circular phylogenetic trees of various sizes and in various locations.



For anyone who wants to pursue the matter, there is a rredit thread on the topic:
Ladies with tattoos - What are some negative (or positive) comments you've gotten from strangers because of your tattoos? Where are your tattoos and what are they of?

Monday, March 5, 2018

Visualizing U.S. gun laws


The Founding Fathers of the USA made the decision to explicitly insist on the right of all US citizens to bear arms, because they felt that to do otherwise could be the foundation of what we would now call a Police State. The right was granted in the well-known 2nd Amendment, along with the right form a militia (to fend off the British, among others). This may have been a reasonable way to achieve freedom in the 1700s; and it was certainly the basis of the reputation of the Wild West in the 1800s.

However, increasingly during the 1900s, and especially now, in the 2000s, the practical consequences of this part of the US Constitution have come into question. Indeed, due to recent events in some states, this facet of the United States has come to world-wide attention, because it is a quite unique gun legislation. However, this is an over-simplification, because there are substantial differences between the fifty states (and the District of Columbia). This blog post provides a practical look at the similarities and differences in these gun laws.

The 2nd Amendment of the U.S. Constitution, the Bill of Rights:
"A well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed."

Gun legislation in the United States

Gun legislation is not a federal business, as one may think when following the news. The USA is a union of states, rather than a federation, with the states retaining all political rights that they have not delegated to the federal government (ie. inter-state laws and inter-nation laws). This differs from almost all other countries, in which the federal government retains all political rights that it does not delegate to the states or counties.

In particular, the US state legislations are highly diverse regarding how to exercise the basic (constitutional) right to bear arms. Some states retain the original 1700s interpretation while others have made it rather hard to carry guns, either openly or concealed.

The web site GunsToCarry, for example, breaks the legislation down to five general points:
  • Does one need a permit?
  • Does one need a permit to purchase a gun?
  • Does one need to register an owned gun?
  • Is it allowed to carry the gun in the open?
  • Are there background checks when one privately sells or buys a gun?
For each state, each of these questions can be answered by 'Yes' or 'No', for both hand guns (pistols, revolvers, etc) and long guns (rifles) separately.

In addition, further restrictions/modifications are listed. For instance, there are variations regarding the general policy regarding getting a permit for a gun ("Unrestricted"; "Shall Issue"; "May Issue") and how it's done. Hawaii, to take one example, requires permits, and the general policy is "May Issue", meaning that the state may issue a permit or decide not to, on a case-to-case basis. In reality, the bars to getting a permit are so high in Hawaii that normally people don't get one. The other "May Issue" exercised as "No Issue" state is New Jersey. Another characteristic is that some states, such as California, do not allow private sales unless they are done via a licensed dealer or state law enforcement department.

This all leads to 17 characters that can contribute to differences between states. These can be illustrated in a simple network. The outcome is shown below, after some technical details about how to produce the picture.


Technical details

To provide a pictorial overview of these differences, we can use a particular type of network, called a phylogenetic network. We first calculate pairwise distances between the states, quantifying their differences, and then use a neighbor-net to create the picture as a single graph.

The five main questions provide ten binary (2-state) characters (No = 0, Yes = 1), but I chose an ordered ternary character (3-state) for open carry, to account for local variation (open carry allowed in general = 0, not state-wide = 1, not-at-all = 2). For the ordered ternary characters, the change from e.g. "Unrestricted" (0) to "May Issue" (2) counts as two differences. To even out the impact of binary and ternary characters, all binary characters have the weight two. Hence, a distance of 0 (between any given pair of states) means that the two states have the same legislation in all scored characters; and a distance of 2 would mean that two states differ completely in their legislation.

I excluded one character (the maximum number of rounds allowed per magazine) that provides little discriminatory signal, since it can only be scored for the rather few states that have a magazine size restriction (either 10 or 15 rounds) for hand or long guns (or both).

Gun legislation in the states of the U.S.A.

The interpretation of the network is straightforward. States that are closely connected in the network are similar to each other based on their gun laws, and those that are further apart are progressively more different from each other. Find your own state, and you can immediately see which states are similar to it. (For more details, see: How to interpret splits graphs)


Figure 1 A neighbour-net visualizing the differences in gun legislation in the U.S.A. Blue stars indicate states where guns have to be registered.

The graph well captures the differences in the state legislations. States without gun control, i.e. no permits needed, no registration, free-to-carry, no limitation of magazine sizes, form one endpoint of the network (highlighted in red).

At the opposite end of the graph, highlighted in green, are those states requiring permits for having, buying or selling a gun, that don't endorse open carry, and limit the size of magazines to 10/15 rounds. This part of the network is spread out because each state shows a different combination of controls. The most restrictive states are the right-most ones (Hawaii, Connecticut, California and District of Columbia).

In between these two endpoints, come the states that exercise some control (e.g. on handguns only). These are generally more similar to the no-control states, in that they may require one or another permit, but otherwise have no or few restrictions.

You will note the position of both Texas and Florida (states that joined the Union in the 19th century and were part of the Confederacy) in the network — they are both down the end with the fewest gun controls. You will also note the position of the most densely populated states, which are mostly down the other end.

Finally, here is the same graph with two historical groups of states highlighted, representing two phases of the development of the modern USA. The nature of modern gun laws is not randomly distributed among these groups.

Figure 2 Same graph as in Fig. 1, showing the original Thirteen Colonies (1700s) and the states of the Confederacy (1800s).


Conclusion

Clearly, the United States provides a variety of gun legislation, from strong control to almost none. This inevitably leads to strongly opposing opinions among the public when it comes to guns, although this calls into question a basic constitutional right.

The network also provides a guide-graph for any tourists who might be concerned about U.S. gun legislation. They should visit states such as California if they wish to feel safer, or Alaska if they are searching for a little wild-west feeling.


More plots, links, etc can be found in the related long-read. It includes mapping results of recent and earlier tight presidential elections, population density, real GDP, and number of firearm-related deaths per 100,000 inhabitants; links for further reading and some thoughts on the issue.
  

Data

I have provided fileset on Figshare, including the matrix used (annotated NEXUS, generated with and optimized for Mesquite; "simple" NEXUS with set-up details for PAUP*), the resultant distance matrices (raw, PHYLIP-formatted; analyzed, Splits-NEXUS-formatted), and the figures (for this post an the related long-read)

Monday, February 26, 2018

Tossing coins: linguistic phylogenies and extensive synonymy


The procedures by which linguists sample data when carrying out phylogenetic analyses of languages are sometimes fundamentally different from the methods applied in biology. This is particularly obvious in the matter of the sampling of data for analysis, which I will discuss in this post.

Sampling data in historical linguistics

The reason for the difference is straightforward: while biologists can now sample whole genomes and search across those genomes for shared word families, linguists cannot sample the whole lexicon of several languages. The problem is not that we could not apply cognate detection methods to whole dictionaries. In fact there are recent attempts that try to do exactly this (Arnaud et al. 2017). The problem is that we simply do not know exactly how many words we can find in any given language.

For example, the Duden, a large lexicon of the German language, for example, recently added 5000 more words, mostly due to recent technological innovations, which then lead to new words which we frequently use in German, such as twittern "to tweet", Tablet "tablet computer", or Drohnenangriff "drone attack". In total, it now lists 145,000 words, and the majority of these words has been coined in complex processes involving language-internal derivation of new word forms, but also by a large amount of borrowing, as one can see from the three examples.

One could argue that we should only sample those words which most of the speakers in a given language know, but even there we are far from being able to provide reliable statistics, not to speak of the fact that it is also possible that these numbers vary greatly across different language families and cultural and sociolinguistic backgrounds. Brysbaert et al. (2016), for example, estimate that
an average 20-year-old native speaker of American English knows 42,000 lemmas and 4,200 non-transparent multiword expressions, derived from 11,100 word families.
But in order to count as "near-native" in a certain language, including the ability to pursue studies at a university, the Common European Framework of Reference for Languages, requires only between 4000 and 5000 words (Milton 2010, see also List et al. 2016). How many word families this includes is not clear, and may, again, depend directly on the target language.

Lexicostatistics

When Morris Swadesh (1909-1967) established the discipline of lexicostatistics, which represented the first attempt to approach the problems we face in historical linguistics with the help of quantitative methods. He started from a sample of 215 concepts (Swadesh 1950), which he later reduced to only 100 (Swadesh 1955), because he was afraid that some concepts would often be denoted by words that are borrowed, or that would simply not be expressed by single words in certain language families. Since then, linguists have been trying to refine this list further, either by modifying it (Starostin 1991 added 10 more concepts to Swadesh's list of 100 concepts), or by reducing it even further (Holman et al. 2008 reduced the list to 40 concepts).

While it is not essential how many concepts we use in the end, it is important to understand that we do not start simply by comparing words in our current phylogenetic approaches, but instead we sample parts of the lexicon of our languages with the help of a list of comparative concepts (Haspelmath 2010), which we then consecutively translate into the target languages. This sampling procedure was not necessarily invented by Morris Swadesh, but he was first to establish its broader use, and we have directly inherited this procedure of sampling when applying our phylogenetic methods (see this earlier post for details on lexicostatistics).

Synonymy in linguistic datasets

Having inherited the procedure, we have also inherited its problems, and, unfortunately, there are many problems involved with this sampling procedure. Not only do we have difficulties determining a universal diagnostic test list that could be applied to all languages, we also have considerable problems in standardizing the procedure of translating a comparative concept into the target languages, especially when the concepts are only loosely defined. The concept "to kill", for example, seems to be a rather straightforward example at first sight. In German, however, we have two words that could express this meaning equally well: töten (cognate with English dead) and umbringen (partially cognate with English to bring). In fact, as with all languages in the world, there are many more words for "to kill" in German, but these can easily be filtered out, as they usually are euphemisms, such as eliminieren "to eliminate", or neutralisieren "to neutralize". The words töten and umbringen, however, are extremely difficult to distinguish with respect to their meaning, and speakers often use them interchangeably, depending, perhaps, on register (töten being more formal). But even for me as a native speaker of German, it is incredibly difficult to tell when I use which word.

One solution to making a decision as to which of the words is more basic could be corpus studies. By counting how often and in which situations one term is used in a large corpus of German speech, we might be able to determine which of the two words comes closer to the concept "to kill" (see Starostin 2013 for a very elegant example for the problem of words for "dog" in Chinese). But in most cases where we compile lists of languages, we do not have the necessary corpora.

Furthermore, since corpus studies on competing forms for a given concept are extremely rare in linguistics, we cannot exclude the possibility that the frequency of two words expressing the same concept is in the end the same, and the words just represent a state of equilibrium in which speakers use them interchangeably. Whether we like it or not, we have to accept that there is no general principle to avoid these cases of synymony when compiling our datasets for phylogenetic analyses.

Tossing coins

What should linguists do in such a situation, when they are about to compile the dataset that they want to analyze with the modern phylogenetic methods, in order to reconstruct some eye-catching phylogenetic trees? In the early days of lexicostatistics, scholars recommended being very strict, demanding that only one word in a given language should represent one comparative concept. In cases like German töten and umbringen, they recommended to toss a coin (Gudschinsky 1956), in order to guarantee that the procedure was as objective as possible.

Later on, scholars relaxed the criteria, and just accepted that in a few — hopefully very few — cases there would be more than one word representing a comparative concept in a given language. This principle has not changed with the quantitative turn in historical linguistics. In fact, thanks to the procedure by which cognate sets across concept slots are dichotomized in a second step, scholars who only care for the phylogenetic analyses and not for the real data may easily overlook that the Nexus file from which they try to infer the ancestry of a given language family may list a large amount of synonyms, where the classical scholars simply did not know how to translate one of their diagnostic concepts into the target languages.

Testing the impact of synonymy on phylogenetic reconstruction

The obvious question to ask at this stage is: does this actually matter? Can't we just ignore it and trust that our phylogenetic approaches are sophisticated enough to find the major signals in the data, so that we can just ignore the problem of synonymy in linguistic datasets? In an early study, almost 10 years ago, when I was still a greenhorn in computing, I made an initial study of the problem of extensive synonymy, but it never made it into a publication, since we had to shorten our more general study, of which the synonymy test was only a small part. This study has been online since 2010 (Geisler and List 2010), but is still awaiting publication; and instead of including my quantitative test on the impact of extensive synonymy on phylogenetic reconstruction, we just mentioned the problem briefly.

Given that the problem of extensive synonymy turned up frequently in recent discussions with colleagues working on phylogenetic reconstruction in linguistics, I decided that I should finally close this chapter of my life, and resume the analyses that had been sleeping in my computer for the last 10 years.

The approach is very straightforward. If we want to test whether the choice of translations leaves traces in phylogenetic analyses, we can just take the pioneers of lexicostatistics literally, and conduct a series of coin-tossing experiments. We start from a "normal" dataset that people use in phylogenetic studies. These datasets usually contain a certain amount of synonymy (not extremely many, but it is not surprising to find two, three, or even four translations in the datasets that have been analysed in the recent years). If we now have the computer toss a coin in each situation where only one word should be chosen, we can easily create a large sample of datasets each of which is synonym free. Analysing these datasets and comparing the resulting trees is again straightforward.

I wrote some Python code, based on our LingPy library for computational tasks in historical linguistics (List et al. 2017), and selected four datasets, which are publicly available, for my studies, namely: one Indo-European dataset (Dunn 2012), one Pama-Nyungan dataset (Australian languages, Bowern and Atkinson 2012), one Austronesian dataset (Greenhill et al. 2008), and one Austro-Asiatic dataset (Sidwell 2015). The following table lists some basic information about the number of concepts, languages, and the average synonymy, i.e., the average number of words that a concept expresses in the data.

Dataset Concepts Languages Synonymy
Austro-Asiatic 200 58 1.08
Austronesian 210 45 1.12
Indo-European 208 58 1.16
Pama-Nyungan 183 67 1.1

For each dataset, I made 1000 coin-tossing trials, in which I randomly picked only one word where more than one word would have been given as the translation of a given concept in a given language. I then computed a phylogeny of each newly created dataset with the help of the Neighbor-joining algorithm on the distance matrix of shared cognates (Saitou and Nei 1987). In order to compare the trees, I employed the general Robinson-Foulds distance, as implemented in LingPy by Taraka Rama. Since I did not have time to wait to compare all 1000 trees against each other (as this takes a long time when computing the analyses for four datasets), I randomly sampled 1000 tree pairs. It is, however, easy to repeat the results and compute the distances for all tree pairs exhaustively. The code and the data that I used can be found online at GitHub (github.com/lingpy/toss-a-coin).

Some results

As shown in the following table, where I added the averaged generalized Robinson-Foulds distances for the pairwise tree comparisons, it becomes obvious that — at least for distance-based phylogenetic calculations — the problem of extensive synonymy and choice of translational equivalents has an immediate impact on phylogenetic reconstruction. In fact, the average differences reported here are higher than the ones we find when comparing phylogenetic reconstruction based on automatic pipelines with phylogenetic reconstruction based on manual annotation (Jäger 2013).

Dataset Concepts Languages Synonymy Average GRF
Austro-Asiatic 200 58 1.08 0.20
Austronesian 210 45 1.12 0.19
Indo-European 208 58 1.16 0.59
Pama-Nyungan 183 67 1.1 0.22

The most impressive example is for the Indo-European dataset, where we have an incredible average distance of 0.59. This result almost seems surreal, and at first I thought that it was my lazy sampling procedure that introduced the bias. But a second trial confirmed the distance (0.62), and when comparing each of the 1000 trial trees with the tree we receive when not excluding the synonyms, the distance
is even slightly higher (0.64).

When looking at the consensus network of the 1000 trees (created with SplitsTree4, Huson et al. 2006), using no threshold (to make sure that the full variation could be traced), and the mean for the calculation of the branch lengths, which is shown below, we can see that the variation introduced by the synonyms is indeed real.


The consensus network of the 1000 tree sample for the Indo-European language sample

Notably, the Germanic languages are highly incompatible, followed by Slavic and Romance. In addition, we find quite a lot of variation in the root. Furthermore, when looking the at the table below, which shows the ten languages that have the largest number of synonyms in the Indo-European data, we can see that most of them belong to the highly incompatible Germanic branch.

Language Subgroup Synonymous Concepts
OLD_NORSE Germanic 83
FAROESE Germanic 77
SWEDISH Germanic 68
OLD_SWEDISH Germanic 65
ICELANDIC Germanic 64
OLD_IRISH Celtic 61
NORWEGIAN_RIKSMAL Germanic 54
GUTNISH_LAU Germanic 50
ORIYA Indo-Aryan 50
ANCIENT_GREEK Greek 46

Conclusion

This study should be taken with some due care, as it is a preliminary experiment, and I have only tested it on four datasets, using a rather rough procedure of sampling the distances. It is perfectly possible that Bayesian methods (as they are "traditionally" used for phylogenetic analyses in historical linguistics now) can deal with this problem much better than distance-based approaches. It is also clear that by sampling the trees in a more rigorous manner (eg. by setting a threshold to include only those splits which occur frequently enough), the network will look much more tree like.

However, even if it turns out that the results are exaggerating the situation due to some theoretical or practical errors in my experiment, I think that we can no longer ignore the impact that our data decisions have on the phylogenies we produce. I hope that this preliminary study can eventually lead to some fruitful discussions in our field that may help us to improve our standards of data annotation.

I should also make it clear that this is in part already happening. Our colleagues from Moscow State University (lead by George Starostin in the form of the Global Lexicostatistical Database project) try very hard to improve the procedure by which translational equivalents are selected for the languages they investigate. The same applies to colleagues from our department in Jena who are working on an ambitious database for the Indo-European languages.

In addition to linguists trying to improve the way they sample their data, however, I hope that our computational experts could also begin to take the problem of data sampling in historical linguistics more seriously. A phylogenetic analysis does not start with a Nexus file. Especially in historical linguistics, where we often have very detailed accounts of individual word histories (derived from our qualitative methods), we need to work harder to integrate software solutions and qualitative studies.

References

Arnaud, A., D. Beck, and G. Kondrak (2017) Identifying cognate sets across dictionaries of related languages. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2509-2518.

Bowern, C. and Q. Atkinson (2012) Computational phylogenetics of the internal structure of Pama-Nguyan. Language 88. 817-845.

Brysbaert, M., M. Stevens, P. Mandera, and E. Keuleers (2016) How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Frontiers in Psychology 7. 1116.

Dunn, M. (ed.) (2012) Indo-European Lexical Cognacy Database (IELex). http://ielex.mpi.nl/.

Geisler, H. and J.-M. List (2010) Beautiful trees on unstable ground: notes on the data problem in lexicostatistics. In: Hettrich, H. (ed.) Die Ausbreitung des Indogermanischen. Thesen aus Sprachwissenschaft, Archäologie und Genetik. Reichert: Wiesbaden.

Greenhill, S., R. Blust, and R. Gray (2008) The Austronesian Basic Vocabulary Database: From bioinformatics to lexomics. Evolutionary Bioinformatics 4. 271-283.

Gudschinsky, S. (1956) The ABC’s of lexicostatistics (glottochronology). Word 12.2. 175-210.

Haspelmath, M. (2010) Comparative concepts and descriptive categories. Language 86.3. 663-687.

Holman, E., S. Wichmann, C. Brown, V. Velupillai, A. Müller, and D. Bakker (2008) Explorations in automated lexicostatistics. Folia Linguistica 20.3. 116-121.

Huson, D. and D. Bryant (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23.2. 254-267.

Jäger, G. (2013) Phylogenetic inference from word lists using weighted alignment with empirical determined weights. Language Dynamics and Change 3.2. 245-291.

List, J.-M., J. Pathmanathan, P. Lopez, and E. Bapteste (2016) Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct 11.39. 1-17.

List, J.-M., S. Greenhill, and R. Forkel (2017) LingPy. A Python Library For Quantitative Tasks in Historical Linguistics. Software Package. Version 2.6. Max Planck Institute for the Science of Human History: Jena.

Milton, J. (2010) The development of vocabulary breadth across the CEFR levels: a common basis for the elaboration of language syllabuses, curriculum guidelines, examinations, and textbooks across Europe. In: Bartning, I., M. Martin, and I. Vedder (eds.) Communicative Proficiency and Linguistic Development: Intersections Between SLA and Language Testing Research. Eurosla: York. 211-232.

Saitou, N. and M. Nei (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4.4. 406-425.

Sidwell, P. (2015) Austroasiatic Dataset for Phylogenetic Analysis: 2015 version. Mon-Khmer Studies (Notes, Reviews, Data-Papers) 44. lxviii-ccclvii.

Starostin, S. (1991) Altajskaja problema i proischo\vzdenije japonskogo jazyka [The Altaic problem and the origin of the Japanese language]. Nauka: Moscow.

Starostin, G. (2013) K probleme dvuch sobak v klassi\cceskom kitajskom jazyke: canis comestibilis vs. canis venaticus? [On the problem of two words for dog in Classical Chinese: edible vs. hunting dog?]. In: Grincer, N., M. Rusanov, L. Kogan, G. Starostin, and N. \cCalisova (eds.) Institutionis conditori: Ilje Sergejevi\ccu Smirnovy.[In honor of Ilja Sergejevi\cc Smirnov].L. RGGU: Moscow. 269-283.

Swadesh, M. (1950) Salish internal relationships. International Journal of American Linguistics 16.4. 157-167.

Swadesh, M. (1955) Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21.2. 121-137.

Monday, February 19, 2018

We want to publish our phylogenetic data – including networks, but where?


(This is a joint post by Guido Grimm and David Morrison)

About five years ago, David wrote two posts regarding issues with the public availability and release of phylogenetic data. Since then, the situation has become a bit more beneficial for science, but we still have not progressed as far as we should have. In this post, we will share some anecdotes , and give some tips for where you can do store your networks.

David asked an interesting question: Why are phylogeneticists so reluctant to present their actual data in the first place? In this schematic, this asks why the arrow connecting "Data Product" to "Reality" is so often missing.


The archiving of primary data (the data matrix) and its derivatives (eg. phylogenies) should be obligatory, so that the basic data are publicly available, so that the results can be verified by others, and any errors identified / eliminated.

There is no good reason to hold it back. While we may have put a lot of effort into our data sets, if we don't share them then this effort will only benefit ourselves, and it will become null and void after we have published our paper. We also may leave science (via retirement or something else), or otherwise stop maintaining our professional homepage, and at this point our data legacy will likely drift off in a puff of smoke.

On the other hand, when we make the data publicly available, others can take it from there. Indeed, we may even meet new collaborators, if they are interested in the same line of research. Just as importantly, we are no longer responsible for keeping it at hand for eventual requests. This is one of the chief advantages of sites like ResearchGate, which automate this sort of administrative effort.

If the re-users of our data are honest scientists, then they will (of course) cite us for our data matrix. But if they have to sit down to harvest the genebanks, and re-create the matrix from scratch, then why should they cite the people that produced the data? More importantly, making data sets accessible enables teachers / lectures to make use of it in their courses, having at hand one (or more, when the data were re-used) publications for discussion.

It also gives developers some test datasets for new algorithms and programs. For instance, Guido's best-cited (first-author) paper on GoogleScholar (Grimm et al. Evolutionary Bioinformatics 2006) has been cited 66 times (per February 13th), mainly because the maple dataset has become a tricky test set for a large amount of bioinformatic papers passed from one bioinformatician to the other. It is for this reason that our compilation of verified empirical network datasets was first created.

Finally, for most of us our research is made possible by public money, so we do not actually own our data, personally. It really belongs to the public, who funded it, so there should be public access to it — we cannot monopolize expertise that is created by public funding.

As an aside, it avoids responses such as these (all of which are real, and quite common):
I cannot send you the data because I don't have a backup on my new computer
I don't have the data, only the late Ph.D. student has it, who has left the lab
I can't find the data, because I have changed universities
I'm not sure if I can share the data, as it was a collaborative project
I expect to be a co-author, even if I do no further work.

Tides have turned, somewhat

There are quite a few journals that now expect that each phylogenetic data matrix, and the inferred tree, is stored within a public repository. For instance, BioMed Central journals such as BMC Evolutionary Biology (now owned by Springer-Nature), expect you store your (phylogenetic) data in a public repository such as TreeBase or Dryad. However, few journals enforce the documentation of primary data (e.g. Nature, the same publisher's flagship journal, does not), but treat it only as a recommendation. The easiest way to enforce the archiving is to refuse to review any manuscript where the data has not already been deposited.

TreeBase, which is free of charge, is still only an option when you deal with simple data: a matrix and a tree, or a few trees inferred from the matrix — network-formatted genealogies cannot be stored, only trees. When you have networks, a compilation of analysis files, trees including labels that are not referring to species (in a taxonomic sense), it is not an option. For example, the TreeBase submission of the above-mentioned maple data is defunct, because the maximum likelihood trees were based on individual clones or consensus sequences. The main result, "bipartition networks" based on the ML bootstrap pseudoreplicate samples, cannot be handled; and naked matrices are not published anymore (you need a tree to go with the matrix).

Dryad has no file type or content limitations, but it charges a fee (although quite modest). A few of the journals enforcing data storage such as Systematic Biology cover the cost, but Springer-Nature's BMC Evolutionary Biology does not — with respect for what they charge for a publication (> $2,500), they should. Springer-Nature has now launched an open research initiative with open data components (eg. LOD), of its own, but so far little has changed (see eg. the fresh paper on Citrus in Nature); and it would be surprising that making data openly accessible would come with no extra costs for the authors.

Ideally, there would be as online supplement

Providing the data as an open-access online supplement directly linked to the paper seems to be a natural choice. Everyone that finds the paper can then directly access the related data and main analysis files.

Journals such as PeerJ, or the Public Library of Science (PLoS) series, make it possible to upload a wide range of file formats as online supplements. While most journals now have online supplements, relatively few allow uploading of, for example, a packed (zipped) archive file. This is the only possible option when you want to not only provide the raw NEXUS file and a NEWICK-formatted text file with the tree, but also e.g. the bootstrap samples or the Bayesian sampled topology file and the support consensus networks based on them. This requires an annotated (graphically enhanced) Split-NEXUS file generated with SplitsTree, or a fully annotated matrix, or the outcome of a median network analysis from the NETWORK program. There is usually some limitation on the maximum size (storage space generates real costs for the publisher).

A nice touch of PeerJ is that each supplement file gets it's own DOI, similar to Dryad's annotation procedure, making the uploaded data archives/files individually referencable.


More alternatives

Most, if not all, journals with good online supplement storages are open access journals, where you have to pay to publish — currently a bit over 1000 $ for PeerJ; and ~ 1500 $ for e.g. PLoS ONE (PeerJ also has the option of individual life-long publishing plans). Perhaps a basic problem with open access is that it moves the financial cost from the reader to the writer — this is not good if you have little funding to do your work.

So what do you do when you publish in a traditional journal with few online storage options?

One alternative is Figshare, where you have up to 20 GB storage for free, and can upload a variety of file types, including images, spreadsheets, and data archives. Uploading images and data to repositories like Dryad or figshare may also be a good option where restrictive copyright clauses still occasionally are found in publication agreements. Before submitting the final version, you simply publish the data and figures there under a CC-BY licence, and reference them accordingly in your copyrighted book chapter or paper.

And increasing number of institutions now also provide the possibility to store (permanently) research data produced at the institution. So, it's always worth asking the IT-department or the university biobliotheque about the availability of such an option. And some countries such as Austria have launched their own open data platforms.

Uploading data files to ResearchGate is probably not an option for network-affine research, as it allow only PDF files (they then need to be text-extractable). As phylogeneticists, we want to distribute our (usually NEXUS-, FASTA- or PHYLIP-formatted) matrices and primary inference-results file, so that they become part of the scientific world.

There is also the possibility of generic cloud storage, which is often free, or at least available to users of certain operating systems or programs. Unfortunately, this is entirely a short-term option, no different from a personal home page; and it may be a target for hackers, anyway.


Final comment

One frequently raised issue not mentioned so far is the concept of a gray area of social or personal responsibility. That is, there might be unforeseen or undesirable consequences to a general obligation to provide full documentation of primary data. This is always an issue in the medical and social sciences, for example, where the exposure of personal data might lead to societal problems. Even in palaeontology, there may be legitimate concerns about, for example, making the GPS coordinates of special fossil sites publicly available.

However, there is nothing to stop an author highlighting such issues at the time of their manuscript submission, and the editor asking for comments from the reviewers, who are supposed to be experts in the particular field.

Some further relevant links (please feel free to point out more)

Join the discussion by using our comments below; or provide your answer to the open question at the PeerJ Questions portal: Should we be forced to publish primary data integral to our results?

Twitter has the hashtag #OpenData, used by people / organisations promoting or providing open data, as well as those who are (so far) only allegedly dedicated to it (such as Springer-Nature and RELX-Elsevier).

The open source software environment RStudio for R allows knitting and publishing html-files (and other file formats) on their RPubs server, which can be a convenient way to permanently store your R-obtained results and scripts (e.g. Potts & Grimm, 2017).

Preprint servers such as arXiv, bioRxiv, and PeerJ Preprints also provide the option to attach supplementary data files (there are usually size limits), using a wide range of file formats including zipped archives. arXiv had to end its data storage programme in 2013, but still accepts "ancillary files" for raw data, code, etc. "up to a few MB" (which should be enough for a phylogenetic data matrix).

For Austrian/German-speaking users, as noted above, there is Austria's new Open Data Portal (ODP). So far, German is the only language selectable from the scroll-down menu, but there seem to be no registering restrictions.