A Comparative Study of the Andean Languages

Introduction to Our Database





A Database for Sensitive Measurements of Similarity in Lexical Semantics

The Basic Structure of our Database

Our Data for Three Simple Illustrative Meanings

Four    Three    Tooth


Our method for calculating the degree of similarity between any pair of languages in their lexical semantics has been deliberately designed to be as sensitive as possible.  Above all we have sought to get away from the blunt and flawed assumption that one can always identify one and only one lexeme for any one given meaning.

Sensitivity to the complex reality of the overlaps and differences between languages in their lexical semantics necessarily entails a much more complex database structure than for traditional lexicostatistics.  It is therefore not possible to represent in simple tables the complex form of our database structure.  Our full database is in Microsoft Access 2000/XP, and it is this that we plan to release for download from this webpage in a user-friendly version once we have published our own first articles on the basis of our results (January 2005), and pending the outcome of an application for funding to support this (March 2005).

In the tables below we simply present a few of the simplest types of cases, in a much simplified format relative to that in our full Access database, to illustrate the types of data we shall be making available.


The Basic Structure of our Database

For each meaning in our list of 150, we have a ‘reference set’ of all the different unrelated morphemes found in that meaning across all the Andean languages in our study.  Different phonetic realisations of the ‘same’ morpheme, such as regional pronunciations kimsa or kinsa for the meaning three, are not unrelated but correlate with each other, so they both fall under the same ‘reference correlate’.  This is similar to the concept of a ‘cognate’, only in our database loanwords also count as correlates.  For full details, see our full article in Revista Andina (Heggarty 2005).  The correlate form itself is given in its assumed original proto-form (followed by other variants where it is not clear which is primary).  For more details see our separate webpage on this.

For any given list-meaning, each different correlates is given a reference number, generally depending on which language family it is most typically found in, and these are also colour coded, as follows:

   correlates 1 to 8             found most typically in the Quechua family;  or ‘pan-Andean’ in Quechua and other Andean language families

   correlates 11 to 14         found most typically in the Aymara family             

   correlate 16                    found only in Chipaya                  

   correlates 17 to 19         Spanish loanwords         


This numbering system is simply for practicality, for in most cases it is helpful to distinguish words found predominantly in one family.  It has no impact on our quantifications of similarity, however, nor should it be taken to imply any judgements on our part of which family a given root is original to.  The only exception to this is for Spanish loanwords (i.e. correlates numbered from 17 to 19), to which we do award a special status, so that we can identify and minimise in our calculations the confusing impact of Spanish.  This has affected different Andean varieties to greater and lesser extents, which reflects modern social factors rather than the original relationships between Andean language varieties that we are trying to investigate.

Indeed many roots, such as kimsa three or warmi woman, are found widely in both Quechua and Aymara varieties.  Whichever of the two families one might wish or be able to identify as the original source of a root like this, we normally put it within correlates 1 to 8, in this case as ‘pan-Andean’ correlates rather than specifically Quechua ones.  Again, this by no means implies any assumption on our part that it is necessarily more natively Quechua than Aymara.

(Chipaya was added fairly late to our database, and ideally we would, in further development of this method increase the number of correlate slots given over to it.  Nonetheless, even the current system in which we sometimes enter multiple different Chipaya roots in the single Chipaya slot does not distort the results in any way since Chipaya is the only variety we cover in the Uru-Chipaya family, and any roots it shares with any other Andean language family are entered in the correlate slots for that family.)



Our Data for Three Simple Illustrative Meanings

Here we illustrate only three simple meanings;  other cases become considerably more complex, but this also means that they become increasingly difficult to display in simple table formats, so we leave these for our full Access database structure to be published here shortly.


The Meaning four

The meaning four illustrates a fairly simple case where each language family (Quechua, Aymara, and Uru-Chipaya) has its own root not used in the other families, indeed Quechua has two different roots distributed across different varieties.  Some Quechua varieties also have lost the native root entirely and now use the borrowed the Spanish form, which is why it is also included here because it is now necessary to represent those Andean language varieties.

1          tawa

2          ĉusku

11        puši

16        paqpik

17        cuatro


The Meaning three

The meaning three illustrates a case where there is no need to specify different correlates for Quechua and Aymara, since both share the same.  In this case, correlate 1 is not just a Quechua correlate, but a wider ‘pan-Andean’ one.  Nor is any Spanish loanword specified, since all the Andean varieties we covered have retain a native form for this meaning.

1          kimsa

16        čhep


The Meaning tooth

The meaning tooth illustrates a more complex case, in several respects:

   All Aymara languages use not just one correlate but two morphemes in compounds which literally mean ‘mouth bone’.

   The bone word is correlate in both surviving branches of Aymara (southern and central), but the mouth root is different, so there is only partial overlap between the forms in the two branches.

   However, the full compound noun ‘mouth bone’ is not used in all cases, and the bone root is only added where necessary to be more specific.  When it is not, there is no overlap between the Aymara branches.

   Moreover, while the Aymara bone root is unknown in Quechua, the mouth root in central (but not southern) Aymara is indeed known in all the Quechua varieties we covered, though not in the specific meaning tooth but only the related general meaning mouth (so there is some overlap in their lexical semantics for tooth, but only little).

Even if there were space, here is not the place to go into the complexities of how our method analyses such constellations of overlapping data to give detailed quantifications of the degrees of difference between all the languages concerned – for details on this, see Heggarty (2005) and Heggarty (in preparation), or download our Access database.  Those explain how we represent in our database the various correlates listed here in the structures in which they are used in each individual language.  So for example Cuzco Quechua just has kiru for tooth, but also has a form of the correlate šimi for the related meaning mouth.  Jaqaru, meanwhile, has the compound šimi ĉ'akha (‘mouth bone’) for tooth, though also can just use the form šimi where the specific meaning (tooth as distinct from mouth) is clear in context.  Here our purpose is simply to illustrate the data that it is necessary to store for this meaning in the set of reference correlates, which are:

1          kiru

11        laka

12        ĉ'akha

13        šimi

16        izhqi



