Tuesday, December 20, 2016

Isogloss maps are hypergraphs are bipartite networks


Linguists are a very special people. They are very proud, especially when biologists tell them how to do phylogenetic analyses; but their pride is often also justified, as many phylogenetic concepts were initially or independently developed by linguists, be it the family tree model, proposed years before Darwin's (1859) tree by Ćelakovský (1853), or even the cladistic principle of synapomorphies, which are called "exclusively shared innovations" in linguistics (see Brugmann 1884).

Linguists also invented one interesting kind of data-display which so far has never been used by biologists (at least as far as I know): maps of isogloss boundaries. The term "isogloss" is an unfortunate term, as it has multiple usages in linguistics, and its history seems to go back to a naive borrowing from chemistry (but I have not really followed the literature here). On most occasions, it just means "shared trait". That is, it denotes a features shared between two or more languages; and given that languages may share many different features, isoglosses for a group of related languages may yield a very complex type of data. Isoglosses are somehow related to the wave theory, the arch-enemy of the family tree in linguistics, which I described as a mystical theory some time ago, since it never really made it to a clear-cut model that could be formalized (The Wave Theory: the predecessor of network thinking in historical linguistics ).

Some linguists, nevertheless, insist that the waves that are the core of the wave theory are nothing other than isoglosses. More specifically, the waves represent innovations that contribute to the separation of languages (a change in pronunciation of a word here, a change in grammar there), but which are not transmitted vertically — they spread across the speakers of a language and may even cross linguistic borders. One early visualization of these waves can be found in Bloomfield (1933), as shown here:


What Bloomfield essentially does here is pick certain traits of Indo-European languages, calling them isoglosses, and arrange them on a quasi-geographic map of Indo-European languages in such a way that all languages sharing a trait are inside one of these isogloss boundaries.

Only recently, I realised, what this actually means, when I found the "Bible of Network Theory" by Newman (2010) and started reading at a random page, which — as it turned out — treated hypergraphs. Hypergraphs, as I learned from Newman, are graphs in which one edge can connect to more than one node, and Newman used exactly the same visualization for these hyperedges as Bloomfield had done in 1933, without knowing that it was actually a rather complex network structure he was proposing.

Even more interesting than the complex graph structure is that hypergraphs can be likewise displayed as bipartite networks, in which we distinguish two fundamental kinds of nodes, and in which connections are only allowed between nodes of different kinds, without losing any information. In order to do so, one just converts all hyperedges into a node that connects to all nodes (languages in our case) to which the edges connect in the hypergraph. In the same way that Bloomfield labeled the hyperedges in his legend, we can label the isogloss nodes that connect to the languages. The following image shows the resulting bipartite network for Bloomfield's hypergraph:


If you now ask what this tells us after all, I will disappoint you — so far it does not tell us anything, it is just a display of data in a different fashion. Note, however, that hypergraph visualization is not a trivial problem, and if you have enclaves not sharing a trait, it may even be impossible to visualize hypergraphs in a two-dimensional space by just using one line that connects to all nodes. Bipartite networks are easier to handle in this regard. Even more importantly, however, bipartite graphs are also easy to handle algorithmically, and biologists are currently developing new methods to handle them (Corel et al. 2016).

If we visualize the Bloomfield data in a bipartite network using network visualization software such as Cytoscape, we can conveniently explore the data, and arrange the nodes in order to search for patterns in the isoglosses. The following visualization, for example, shows that Bloomfield chose the data well in order to illustrate the amount of conflicting, apparently non-tree-like, signal in Indo-European languages (remember that linguists tend to dislike trees, but not necessarily in a productive way), as the data describes more of a circular structure than a strict hierarchy.


In order to really interpret this kind of data, however, we should not forget that this is still a data-display network. It is by no means a phylogenetic analysis, as we only show how a certain amount of data selected by a scholar and distributed over the given language groups. A true phylogenetic analysis will need to interpret these data, making bold claims about the history of those shared traits.

The existence of sibilants (s-like sounds, like [s, z, ʃˌ ʒ]) for certain velar sounds (k-like sounds, like [k, g, x]), for example, is a trait shared by Balto-Slavic, Indo-Iranian, Armenian, and Albanian, but this does not mean that they all inherited it from a common ancestor, as the process of palatalization, by which velar sounds turn into affricates and fricatives (compare French cent, which was pronounced kentum in Latin), is very frequent in the languages of the world, and may well reflect independent evolution.

Apart from independent development, which would actually force us to revise our network, deleting the respective edges because they are not homologous in the strict sense means that we may also have to deal with differential loss. This quite likely happened with the shared feature labeled as "past e-" in the network, referring to the past tense in Ancient Greek and Indo-Iranian, which was augmented by the prefix e-.

A further reason for those commonalities labelled as isoglosses by linguists may also be simple lateral transfer due to language contact.

Proponents of the wave theory have taken this kind of data as proof that the family tree model is essentially wrong. While I would agree that the family tree model shows only a certain aspect of language evolution, and may therefore be boring at times (and even wrong, if we do not manage to correctly interpret the nature of shared traits), I have a hard time understanding why linguists still insist that isogloss maps are an alternative model of language evolution. They are surely not, in the same way in which splits graphs are not phylogenetic networks, as David emphasized in a recent blogpost.

Unless we add the missing time dimension and analyse how the shared traits originated, isogloss maps and hypergraphs will remain nothing more than an interesting form of data visualization. Given the recent research on bipartite networks, however, we may have some hope that the mysterious waves in historical linguistics may not only find a formal model of representation, but even bring us to the point where we gain new insights into the history of our languages.

References
  • Bloomfield, L. (1973) Language. Allen & Unwin: London.
  • Brugmann, K. (1884) Zur Frage nach den Verwandtschaftsverhältnissen der indogermanischen Sprachen [Questions regarding the closer relationship of the Indo-European languages]. Internationale Zeischrift für allgemeine Sprachewissenschaft 1. 228-256.
  • Čelakovský, F. (1853) Čtení o srovnavací mluvnici slovanské [Lectures on comparative grammar of Slavic]. V komisí u F. Řivnáče: Prague.
  • Corel, E., P. Lopez, R. Méheust, and E. Bapteste (2016) Network-thinking: graphs to analyze microbial complexity and evolution. Trends Microbiol. 24.3: 224-237.
  • Darwin, C. (1859) On the origin of species by means of natural selection, or, the preservation of favoured races in the struggle for life. John Murray: London.
  • Newman, M. (2010) Networks. An Introduction. Oxford University Press: Oxford.

1 comment:

  1. Just a guess, but the term 'isogloss' probably comes from a semantic extension of XIXth century usage of 'isophone' (which denoted not precisely "phonetic isoglosses" as today, but homophones between languages) and 'isoseme', terms used in the science of "etymology". Here an example from an Italian manual: https://books.google.com.br/books?id=1Ig_AAAAIAAJ&pg=PA244&dq=isofona&hl=pt-BR&sa=X&redir_esc=y#v=onepage&q=isofona&f=false

    ReplyDelete