‌Meeting Darwin's last challenge: toward a global tree of human languages and genes (LanGeLin)

ERC Advanced Grant project, 2012-2018

Professor Giuseppe Longobardi (Principal Investigator)

State of the art

In the Origin of Species, Darwin (1859, ch. 14) predicted that: "If we possessed a perfect pedigree of mankind, a genealogical arrangement of the races of man would afford the best classification of the various languages now spoken throughout the world; and if all extinct languages, and all intermediate and slowly changing dialects, were to be included, such an arrangement would be the only possible one". Darwin's prediction is the best point of departure to launch an innovative, frontier research project on the crucial interaction between biology and historical linguistics, indeed the population-language congruence hypothesis: the LANGELIN project aims precisely to fully renovate the multidisciplinary tools for testing this hypothesis, thus contributing to C. Renfrew’s so-called 'New Synthesis' and to the study of human diversity.

In biology, population comparisons already underwent a first decisive progress in the past few decades, thanks to the advances of genetics, and prompted attempts to quantitatively address the problem posed by Darwin’s prediction: some geneticists (Sokal 1988; Cavalli-Sforza et al. 1988) advocated a large-scale correspondence between the distribution of classical genetic markers (blood groups, serum proteins, etc.) and certain long-range language classifications found in the linguistic literature; their work has been received with much serious criticism, though, and has remained very controversial, especially among linguists: for, virtually no professional historical linguist unconditionally subscribes to the reliability of the linguistic genealogies used as matches in such experiments. Indeed, most linguists have denied the very possibility of a reliable global or long-range classification of languages, for serious and valid methodological reasons which brought the interdisciplinary debate on large-scale population-language congruence close to a dead end, and which we believe theoretical linguistics are now in a position to overcome.

More recent genetic work also showed that, in a large number of case studies, patterns of genetic and linguistic diversity do appear locally well correlated (Barbujani 1991; Chen et al. 1995; Belle and Barbujani 2007; Tishkoff et al. 2009).

However, to address and test the congruence hypothesis at a general, worldwide rather than regional, level, a broad-scope analysis of linguistic and genetic variation is inevitably needed. This has proved an impossible challenge so far, because of the mentioned weakness of long-distance language taxonomies and of the lack of dedicated datasets with a sufficient number of populations speaking different languages and, at the same time, typed for many genetic polymorphisms. We now propose to tackle these problems by reconstructing the relationships within a sufficiently global selection of languages by means of a radically new taxonomic method (PCM, Longobardi and Guardiano 2009), and by comparing these relationships with those inferred from the DNAs of the corresponding human populations, reconstructed with the tools made available by the genomic revolution of the last ten years. For this purpose, linguists and geneticists will build up a dedicated database, for the first time jointly selecting the most relevant and realistically accessible pairs of languages/populations.

Problems with linguistic methods

Any linguistic taxonomic method with some global ambition should be able to identify sets of correspondence characters both safe from chance (i.e. probabilistically reliable) and universally applicable. The two methods so far used (i.e. the classical comparative method and Greenberg's mass comparison), which are essentially based on entities ultimately characterized by lexical arbitrariness (broadly understood as to include roots and grammatical morphemes, as well as sound laws connecting them crosslinguistically), both fail either with respect to universality or reliability.

The soundness of the classical comparative method as a scientific tool is warranted by the fact that common ancestry is acknowledged only if two languages precisely agree in very improbable phenomena, like regular or at least recurrent sound correspondences. Yet, precisely as based on the detection of improbable, hence rare, correspondences, the method applies well only to languages so similar that their kinship is often already obvious and is unfit for long-range comparisons. It can somehow measure relative distances (e.g. trying to count shared cognates or sound laws), but of course cannot be applied universally, i.e. to languages for which etymological cognates have not been safely identified.

Greenberg's mass comparison takes 'universal' lists of meanings and checks their expression in many different languages at a time, guessing common ancestry simply from overall surface resemblance in sound/meaning. The identification of (degrees of) language relatedness is undermined by the difficulty of establishing any serious probabilistic criterion for cognation (cf. Ringe 1992 among many others), let alone of computing precise linguistic distances. Therefore, Greenberg's approach produced many highly controversial taxonomic proposals.

For all these reasons, lexical methods have proved to be a poor choice as heuristics for wide-range language phylogenies and parallelisms with biological classifications. We want to pursue this goal in a radically different way.

New linguistic tools

As hinted above, reliable population classifications were indeed made possible by a deeper level of evidence (genetic markers) discovered by recent advances in molecular biology: the new data (drawn from a universal list of discrete options, i.e. DNA polymorphisms), though less accessible to immediate observation, are more immune from environmental selection and thus more likely to retain genealogical information than were external morphological characters. For language classifications, instead, no comparable exploitation of theoretical advances in linguistics has been systematically attempted. The cognitive revolution prompted, inter alia, by the rise of formal grammar and Chomsky's so-called 'biolinguistic' program has so far completely neglected phylogenetic issues and related problems central to the historical-linguistic paradigm.

Yet, we contend that modern linguistic theories can provide taxonomic characters formally more analogous to genetic markers and equally apt to serve phylogenetic purposes (Longobardi 2003). Here, we notice that the basic intuition behind several typological and formal theories is that many syntactic differences, like genetic polymorphisms, are drawn from a universal list of discrete options (called parameters in formal grammar). Again like genetic markers (Cavalli Sforza et al. 1994: 18), parameters are also likely to be less subject to instability induced by adaptive pressures from the material environment, and seem more immune than other cultural traits to individual conscious innovations.

Owing to their universal and discrete nature, parameters as taxonomic characters combine the two strengths of the classical method and of mass comparison (without sharing the respective disadvantages): comparison of parameter values are potentially able to 1) overcome all issues about the safety of correspondences, affecting Greenberg's method; 2) be applied to any set of languages, unlike the classical method. Since the core grammar of every natural language can in principle be represented by a string of binary symbols (e.g. a succession of 0,1 or +,-; cf. Clark and Roberts, 1993), each coding the value of a parameter, such strings can be unambiguously collated and used to define exact correspondence sets.

The idea that certain structural traits of languages (crucially including a subset of syntactic features) can encode a relevant phylogenetic signal has been pioneered in Nichols' (1992) work, and was tested empirically by Dunn et al. (2005) on a specific language area, providing encouraging hints. We make the claim that this is true to a very general extent, once comparison can be conducted with a significant number of taxonomic characters formulated at the appropriate level of abstractness (parameter values).

In fact, parameters cannot be simply equated with structural traits, since they try to represent more 'abstract' differences, often exhibiting a high degree of deductive depth with respect to surface syntactic contrasts: e.g. many observable distinctions between two related languages often reduce to a single parametric difference, presumably arisen exactly once in the divergence history of the two languages. Therefore, counting similarities in surface patterns rather than in parameter values could turn out to be misleading for arithmetically assessing areal or genealogical relatedness. The potential drawback of building phylogenies on surface (though structural) differences, rather than on supposed primitives of syntactic variation, grows substantially when moving from a circumscribed set of similar languages to a more extensive and geographically large-grained basis, as we are planning to do for LANGELIN; for, we must compare quite diverse varieties, whose surface properties do not single out immediate points of minimal collation (also cf. Roberts 1998 for relevant remarks): on the surface, the risk of missing correspondences or mistaking pattern similarities for real correspondences presents itself again, as in lexical mass comparisons. Parametric correspondences are thus more likely to play the role of remote-from-surface but highly precise and reliable evidence played by DNA polymorphisms in population genetics.

Moreover, for their formal properties, parameter values can exactly measure syntactic distances among languages; therefore they can serve as a perfect input for replicable clustering hypotheses testable through such bio-statistical algorithms as justly advocated in Mc Mahon and Mc Mahon (2005). No single binary parameter can ever prove kinship between two languages, of course: however, since parametric comparisons yield clear-cut answers (discreteness), one can provide calculations of probabilistic thresholds, contrary to the situation of mass comparison.

Longobardi and Guardiano (2009) further developed the insights of Longobardi (2003) into a new operational strategy termed Parametric Comparison Method (PCM). The experiment consisted of:

the formulation of over 60 binary parameters (covering a compact module of syntax, i.e. the nominal subdomain), with values set in more than 25 prevailingly Euro-Mediterranean languages from different families with well-known histories;
the elaboration of Jaccard distance matrices among such languages precisely calculated from this parametric database, which were eventually fed to computer-run phylogenetic algorithms.

The results obtained have been very satisfactory:

the distribution of calculated language distances is clearly non-random and calls for historical explanation;
the numerical and computational tools provided correct phylogenies (under repeated tests with some of Felsenstein’s PHYLIP programs; also cf. Rigon 2009) in independently known cases;
the probability of relatedness for the closest language pairs has been shown to be clearly significant by a detailed large-scale computer-run experiment of random generation (Longobardi, Bortolussi, Guardiano, Sgarro 2011).
tests on the rate of evolution of parametric syntax and lexicon suggest more stability for the former.

Thus, the idea that parametric syntax encodes a relevant phylogenetic signal, a perfectly falsifiable empirical hypothesis, is strikingly confirmed by the tests conducted so far.

An exploration of the applicability of these new perspectives to the language-gene problem has already been pursued by the PI and the Co-I in Colonna, Boattini, Guardiano, Pettener, Longobardi, Barbujani (2010). Such a paper demonstrates that there is indeed a correlation between linguistic distances, calculated according to Longobardi and Guardiano (2009) in a subset of eight populations, and the respective genetic distances, and such correlation is not apparently mediated by geographic distances.

On these grounds it is now conceivable to apply Longobardi and Guardiano's (2009) method in a more global and thorough perspective, construing its eventual outcomes as a 'grammatical anthropology' along the model of molecular anthropology and against the background of its classifications. Refining the available parametric toolkit (parameters and questionnaires to elicit data from native speakers), we propose to probe the nominal syntax of a manageable but significant number of languages, distributed across 5 continents and the major established or alleged macro-families, to arrive at a sufficiently global genealogical tree. The latter will be compared to that provided by DNA polymorphisms.

Advances in genetic methodology

On the biological side, recent years have seen an impressive increase of the available information on human genome diversity, and a radical improvement of the bio-statistical methods to describe and interpret such diversity. Such recent advances have profoundly influenced modern views on the timing of important events in human evolutionary history, emphasizing the continuum between processes at the phylogenetic and population-genetic scales. The already available information, stored in public databases or otherwise accessible, will allow a first set of analyses and comparisons with the existing linguistic data, a strategy already pursued by Longobardi, Barbujani and prospective collaborators of the LANGELIN team in Colonna et al. (2010). However, not always is the match ready or perfect between the populations and languages already studied: so, these analyses will basically represent a further preliminary test of the potentials and pitfalls of the approach proposed.

A large set of worldwide populations, strategically selected on the basis of both linguistic and anthropological considerations, will then be sampled on purpose and typed using state-of-the-art facilities for large scale analysis of the genome. Approximately 1 million single-nucleotide polymorphisms (SNPs) will be typed, covering both the autosomes, and the regions of the genome transmitted by one parent only, namely the Y chromosome and mitochondrial DNA. This way, it will be also possible to take into consideration the different migrational histories of males and females, and their impact over the global language-gene relationships. The Illumina platform we plan to use for LANGELIN is characterized not only by the high representation of slow-evolving markers such as SNPs, but also by the possibility of comparing its high-throughput genotypes with the data already present in public databases (like the Human Genome Diversity Panel - HGDP-CEPH; Li et al. 2008) or generated by genome wide association studies (as in Heath et al. 2008).

This dataset will represent the starting point for the statistical analyses, which will include:

correlations of measures of genetic and linguistic distance;
comparisons of trees inferred from genetic and linguistic distances;
comparisons of the geographic location of zones of increased genetic change, or genetic boundaries with the distribution of language barriers;
computer simulations of evolutionary models based on the inferred linguistic differences, and quantitative comparison of the observed genetic data with those simulated under various such models.

The first step for all these analyses will be the estimation of matrices of genetic distances. We refer to matrices, rather than a single matrix, because whole-genome SNPs and haplotype data on the Y chromosome and the mtDNA provide independent and complementary information. This will be an important factor of novelty of the project: we will separately analyse different portions of genome, characterized by different mutation rates, likely to better discriminate either major continental groups, or closely-related populations.

Research strategy

We will address the basic question about the potential correlation of genetic and linguistic distances, but also the issue of gene/language parallelism with respect to vertical vs. horizontal (genetic admixture/syntactic borrowing) transmission. To solve the latter issue, we plan to place special focus on genetic/syntactic comparisons in the two contact areas whose problematic status already emerges from Longobardi and Guardiano (2009): the classical Balkan domain (with a relevant extension to Southern Italy) and North-Western Europe (including the British Isles). Most preliminary analyses of gene-language correlations will deal with Europe: Europe is the first continent studied in the modern scientific search for the relationships between genetic and linguistic variation (Sokal 1988, Barbujani et al. 1995). Based on previous studies, the outcome will probably be positive. However, a subtler question is to what extent the apparent correlations may depend on the particular linguistic classifications employed, or on the confounding factor of other, correlated variables, mainly geographic distance: claims that gene diversity prevailingly correlates with geography have been recently resumed (Novembre et al. 2008) precisely for Europe, while the syntactic distances of Longobardi and Guardiano (2009) do not obviously correlate with geographical ones (Rigon 2009, Colonna et al. 2010).

When extending the analysis to other continents, a manageable number of strategically selected languages will be used to test some controversial long-distance proposals and lay the grounds for global comparison; the collection areas can be preliminarily divided into three macro-domains: Asia (with an appendix in Australasia), Africa and the Americas.

A) In Asia, we plan to perform parametric analyses of a core of 8 or 9 varieties which, once added to Longobardi and Guardiano’s (2009) sample, will be sufficient to begin testing a huge number of proposed but controversial long-range taxonomic hypotheses:

the possible unity of Japanese with Korean and of both with Turkish;
the possible grouping of Indo-European, Uralic, Altaic, and possibly Dravidian (Greenberg's Eurasiatic superfamily) as distinct from Semitic (and other Afro-Asiatic languages in Greenberg's terms), rather than the grouping of all of them as branches of a 'Nostratic' superfamily;
the remote possibility of detecting the even more weakly supported relation proposed by some between Basque and Chinese as opposed to the languages of the previous point;
the existence of a South-Asian syntactic convergence area at least between Indo-Aryan and Dravidian;
the existence of some syntactic convergence also in the Far-East, between Korean, Japanese, Chinese, and possibly Austronesian languages such as Maori.

B) The addition of few new African languages to Longobardi and Guardiano’s (2009) sample will enable us to use PCM to test the solidity of Greenberg’s two most ambitious African proposals:

the Afro-Asiatic super-family, which in our sample should group together Hebrew and Arabic closely with Amharic, and more distantly with Somali, and the Omotic variety we plan to sample for the project;
the Niger-Congo super-family, into which PCM should eventually collapse a language from the extreme West of Africa (Wolof) with Eastern Bantu varieties.

C) One of the most controversial domains of linguistic and genetic classification is offered by the Americas. Even a choice of 6 or 7 such languages, if crucially including Navajo on one side, and several non-Na-Dené varieties on the other, is sufficient to address the most famous and disputable long-range proposal of all, Greenberg's alleged Amerind super-family. This is a remarkable objective for the interdisciplinary ambition of LANGELIN, also because, in the harsh debate around Greenberg’s daring idea much controversial role has been played by Cavalli Sforza’s and other scholars’ attempts at matching biological classifications.

Thus, the linguistic research at this stage plans to achieve three different goals: i) testing the most fascinating and controversial existing hypotheses of super-families by a new method theoretically exempt from the shortcomings of Greenberg's practice; ii) providing measures of long-range linguistic distances comparable with the genetic distances of corresponding populations; iii) laying the basis for the global tree of languages (and corresponding genes) to be addressed in the final stage of the research.

Linguists and geneticists, after choosing together the languages/populations to be sampled, will collect data in parallel, according to specific needs and criteria of their respective fields. The geneticists will sample 20-25 individuals from the extra-European populations corresponding to the selected linguistic varieties (cf. again Fig. 1). For mathematical data treatment and reconstruction of the genetic and linguistic genealogies the two groups of researchers will interact closely, adopting the same statistical tools, either by selecting them among those of evolutionary biology, or by elaborating new ones on purpose.

Matches and possible conclusions: toward a global tree of genes and languages

The final natural step toward searching for the possible congruence between populations and languages, and ultimately testing Darwin's prediction, will be the systematic comparison, on a global scale, of the phylogenies resulting from the previous research stages.

Starting from genetic distance- and character-matrices and from the corresponding syntactic distance- and character-matrices, geneticists and linguists will construct global phylogenetic trees and networks by means of various algorithms; the amount of similarity between the respective topologies will be computed quantitatively and qualitatively. If a salient amount of dissimilarity between genetic and syntax-based trees is found, the necessary conclusion will be that Darwin's hypothesis and Cavalli Sforza’s claim were in fact too strong, and that the diffusion of languages does not significantly correlate with the distribution of populations. If, instead, we find a significant amount of similarity, this will crucially support Darwin's hypothesis and give evidence that the transmission of languages is normally accompanied by robust demic expansion of the corresponding populations. In the latter case we will proceed to carefully single out the minority of non-overlapping portions of our trees (i.e. cases of language- or gene-substitutions) and to propose them for an historical explanation. In either case, it will be important to single out all the cases where splits in the linguistic and the genetic trees are solidly congruent, and linguists will be in the unique position to exploit genetic dating techniques (molecular clocks) to hypothesize pre-historic dates of separation between languages. This is a crucial point, where interdisciplinary research can generate real scientific progress.

Whatever the final answer to the question raised by Darwin's prediction, LANGELIN will be able to support the "New Synthesis" enterprise with a revolutionary contribution: the historical application of formal grammar, the use and refinement of PCM, the exchange of the most recent methods and results between linguistics and genetics, and the study of jointly selected samples of languages/populations is likely to model out a new discipline at the crossroads of bio-cognitive and historical sciences, grammatical anthropology. This approach will side up with molecular anthropology as a tool for the investigation of the balance between demic and cultural diffusion of languages, and for better complementing archaeology and palaeo-anthropology in the study of human origins and dispersal.